Senior Software Engineer — AI Evaluation & Benchmarks (Python)

Remote, USA • Full-time • Posted 2026-05-31

Before Applying
This role is open to contractors in accepted locations only. Please confirm your country is on the list before applying — we're unable to process applications from unlisted locations. List of accepted countries and locations.
For US applicants: This is a 1099 independent contractor role. It is not compatible with F-1 OPT, STEM OPT, or any visa status that requires W-2 employment, guaranteed hours, or employer sponsorship. We are unable to provide offer letters or employment verification for this role.

What You'll Be Doing
Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work:
Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality code

Build and maintain scalable data pipelines for evaluation workflows

Analyze model-generated code for correctness, reliability, and edge-case failures

Construct structured evaluation scenarios across large repos and multi-language environments

Provide detailed technical feedback on model performance and failure patterns

Contribute to evaluation frameworks that set the bar for how coding ability is measured

End result: benchmarks that meaningfully separate what frontier models can and can't do — and shape how the next generation is trained and improved.
AI coding evaluation in one line: Design task → build harness → run model → analyze failures → feed findings back into the benchmark → evaluations that actually distinguish strong models from weak ones.

What You'll Need
4+ years of professional software engineering experience (non-negotiable)

Expert Python — clean, performant, well-tested code

Hands-on experience working in large, complex codebases

Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines

Strong command of Git and modern development workflows

Track record at a high-growth tech company or top-tier software organization

Strong written English communication

Identity verification: Applicants will be required to verify their identity and confirm they have valid documentation to work as an independent contractor in their country of residence.

Nice to have
Senior or Lead-level profile with a history of technical ownership

Bachelor's or Master's in CS, ML, or related field (or equivalent professional experience)

Proficiency in additional languages: JavaScript, Go, C++, or others

CI/CD experience and writing robust unit tests (pytest, Mocha, JUnit)

Background in security engineering or significant open-source contributions

Familiarity with AI/ML evaluation methodologies or model benchmarking

Logistics
Location: Fully remote — work from anywhere on the accepted locations list

Compensation: $80–$100/hr based on location and seniority

Contract length: 3 months, with potential for extension

Hours: Full-time availability preferred — hours vary by project and are not guaranteed week to week

Engagement: 1099 independent contractor

Payment: Weekly via PayPal or Stripe

⚠️ Important: Hours are project-dependent and can vary week to week. We recommend keeping other work options open alongside this engagement rather than relying on it as your sole source of income.

Apply To This Job

Apply Now

Similar Jobs

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote, USA

Experienced and Compassionate Paraprofessional Wanted for a Rewarding Role in Illinois, Supporting Students with Special Needs in a Dynamic and Inclusive Educational Environment

Remote, USA

Senior Software Engineer — AI Evaluation & Benchmarks (Python)

Similar Jobs

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Experienced and Compassionate Paraprofessional Wanted for a Rewarding Role in Illinois, Supporting Students with Special Needs in a Dynamic and Inclusive Educational Environment

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

USPS Office Helper

American Express Data Entry Remote Jobs

Experienced Hybrid Analyst, Markets & Customers – Data-Driven Insights for Business Growth at arenaflex

Account Manager-Fort Myers/Naples, Florida

Software Network Engineer

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Senior Software Engineer — AI Evaluation & Benchmarks (Python)

Similar Jobs

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Experienced and Compassionate Paraprofessional Wanted for a Rewarding Role in Illinois, Supporting Students with Special Needs in a Dynamic and Inclusive Educational Environment

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

USPS Office Helper

American Express Data Entry Remote Jobs

**Experienced Hybrid Analyst, Markets & Customers – Data-Driven Insights for Business Growth at arenaflex**

Account Manager-Fort Myers/Naples, Florida

Software Network Engineer

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Experienced Hybrid Analyst, Markets & Customers – Data-Driven Insights for Business Growth at arenaflex