Mirrai Careers
Resume BuilderCareer Test
InsightsPricing
Get Started Free
Jobs/Research Scientist, Agentic Data & Benchmarking

Research Scientist, Agentic Data & Benchmarking

ifm-us

Sunnyvale, CA Full-time$150k–$450k / year Posted 2w ago
Market rate. This role pays around the $209k median for similar USD roles (19 comparable postings in our corpus).
Apply on company site
About the Institute of Foundation Models  The Institute of Foundation Models (IFM) is a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.  As part of our team, you'll work at the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You'll help build groundbreaking AI systems with the potential to reshape entire industries, and contribute to establishing MBZUAI as a global hub for high-performance computing and deep learning.  About the role  The Agents team trains advanced agentic language models that use reasoning and tool use to complete real tasks on a computer. This is a specialist role at the center of the loop that drives those models: the data we train on and the benchmarks we measure against.  You'll own the agentic data pipeline end-to-end — sourcing and generating high-quality trajectories, tool-use data, and RL environments — and the evaluation suite that tells us, rigorously and reproducibly, what our agents can actually do. These two halves are inseparable: benchmarks expose where models fail, and targeted data closes the gap. The agents are only as good as the data they learn from and the evals that keep us honest, and this role owns both.  This is a research scientist position for someone who wants depth in data and measurement rather than breadth across the whole stack. You should be the kind of person who reads through datasets line by line, distrusts a metric until it's been validated, and gets satisfaction from making an eval suite that nobody questions.  KEY RESPONSIBILITIES Benchmarking & evaluation  * Design and run evaluations of agentic capabilities — multi-step reasoning, tool use, long-horizon planning, computer use, and safety properties — turning ambiguous notions of "intelligence" into defensible, reproducible metrics.  * Build and harden evaluation harnesses so benchmarks run reliably at scale against training checkpoints, with clear signal on regressions and model health.  * Run experiments characterizing how prompting, sampling, scaffolding, and environment design affect agentic performance on internal and public benchmarks.  * Diagnose anomalous eval results mid-training-run — determine whether the cause is the model, the data, the harness, or the infrastructure — and communicate the answer clearly.  Agentic data  * Source, generate, and curate high-quality agentic training data: trajectories, tool-use traces, and task datasets for new capabilities.  * Design and scale RL environments and reward signals, and measure their impact on model performance.  * Manage technical relationships with external data vendors and domain experts, evaluating data quality and iterating quickly on feedback.  * Develop QA frameworks that catch reward hacking, label noise, and contamination, keeping data and benchmark quality high.  Across both  * Contribute to technical reports, research publications, and open-source benchmarks and tooling.  * Partner with research and product teams to translate capability goals into measurable data and evaluation artifacts.  QUALIFICATIONS Academic qualifications  * BS, MS, or PhD (or equivalent experience) in Computer Science, Machine Learning, or a related field.  Minimum qualifications  * 2+ years of experience with a clear emphasis on evaluations and/or training-data curation for ML systems (related areas: LLM training/fine-tuning, RL, or distributed ML systems).  * Strong Python and PyTorch development experience.  * Demonstrated experience designing and deep-diving into evaluations, or curating and generating training datasets — ideally both.  * Hands-on experience using LLM agents in your personal or professional work.  * A habit of reading through raw data and trajectories to understand them and spot issues, and an instinct to distrust a metric until it's validated.  Preferred qualifications  * Experience with reinforcement learning, reward design, or RL environment construction for LLMs.  * Background in statistics and experimental design — a feel for signal-to-noise, statistical power, and contamination in evaluations.  * Experience with large-scale dataset sourcing, curation, and processing, including working with external vendors or domain experts.  * Strong knowledge of the literature on agent evaluation, RL, LLM reasoning, and tool use.  * Experience building or operating data pipelines and evaluation infrastructure reliable at scale (e.g., PyTorch, Ray).  * Experience evaluating or generating data for software-engineering or computer-use agents.  * Contributions to published research, public benchmarks, and/or open-source ML software.  REPRESENTATIVE PROJECTS * Stand up a new agentic benchmark from scratch — define the task, build the dataset and scoring, validate against known signals, and ship a view that makes the result legible to researchers and leadership.  * Build an RL environment for a new high-value capability: design the reward, generate and QA the trajectory data, and measure the lift on model performance.  * Diagnose a mid-training regression: an eval suite returns anomalous numbers and you determine whether it's the model, the harness, the data, or the infrastructure.  * Partner with an external data vendor or domain expert to source high-quality trajectories, then build the QA framework that keeps reward hacking and contamination out.  * Take a flaky distributed eval pipeline and make it reliable — better retries, better observability, faster feedback to researchers.  We encourage you to apply even if you don't meet every qualification listed. Strong candidates rarely match every line, and we'd rather hear from you than have you rule yourself out.

See how well you match this job

Upload your resume and we’ll score your fit for this role and 6 similar roles — then tailor your CV to it with AI. Free, no credit card.

Check your match

Similar jobs

  • Research Scientist - Vision Language Model

    ifm-us

    Sunnyvale, CA$150k–$450k
  • Research Engineer, Frontier Evals & Environments

    OpenAI

    San Francisco$205k–$380k
  • Researcher, Artifacts - Agent Post-Training

    OpenAI

    Remote$250k–$380k
  • Research Engineer, Knowledge Foundations

    Anthropic

    San Francisco, CA
  • Eval360 - Error Analysis Engineer

    ifm-us

    Sunnyvale, CA$150k–$450k
  • Research Engineer, Domain Scaling

    Anthropic

    San Francisco, CA | New York City, NY | Seattle, WA
Apply on company site

Want more roles like this? Browse fresh jobs or tailor your resume with AI.

Mirrai Careers

AI-powered career platform: build resumes, match jobs, and plan your career.

Product

  • All Tools
  • Resume Builder
  • Career Test
  • Pricing

Legal

  • Privacy Policy
  • Terms of Service
  • Fair Use Policy

Company

MIRRAI CHAT LTD (Company No. 16403306)

71-75 Shelton Street, Covent Garden

London, WC2H 9JQ, UNITED KINGDOM

[email protected]

© 2026 Mirrai Careers. All rights reserved.