Back to home

Announcing otto-SR

Helping researchers perform end-to-end systematic reviews in hours, not months

By Christian, Jason, and Paul | June 12, 2025

otto-SR is a new AI-powered workflow built to support and automate systematic reviews—the backbones of public health and clinical practice guidelines. It is the first system to outperform human-level performance in both screening and data-extraction tasks, while running 3000× faster.

TLDR

  • Automates the most labour-intensive steps of systematic reviews: screening and data extraction.
  • Built and tested by researchers globally—from Harvard, MIT, and the University of Toronto to Cochrane—and former Stripe engineers.
  • Achieves state-of-the-art performance in screening and data-extraction benchmarks, directly outperforming human reviewers in sensitivity and accuracy across every review.
  • Reproduced an entire Cochrane issue, doubling the number of eligible articles and altering key statistical conclusions—what would take researchers ~12 work-years was completed in < 48 hours.
  • Designed to complement researchers: it follows the traditional SR process and requires only a study protocol, search results, and extraction variables—humans stay in-the-loop at every critical step.

Preprint: l.ottosr.com/preprint

Why does this matter?

Systematic reviews (SRs) shape research into facts, which in turn inform public-health policy and support clinical decision-making. But while medicine's scientific corpus grows by millions of papers each year, SRs demand rigour: they are taking longer (often 1+ years) and becoming more expensive to write (over $100,000)[1]. These delays have real public-health consequences—prolonging the use of ineffective or harmful treatments initially supported by less-rigorous evidence.

Traditional SR tools merely streamline human workflows. Newer automated solutions promise faster results but sacrifice comprehensiveness and validation. They search only a subset of papers, cut corners, and often lack benchmarking against human comparators. Crucially, none have been tested on real-world tasks where outcomes directly inform clinical and policy decisions.

What we do

otto-SR is an end-to-end platform using large language models (LLMs) to support and automate the SR workflow from the initial search through to analysis. We leverage OpenAI's GPT-4.1 for screening and o3-mini for data extraction—applying each model where it performs best.

In head-to-head benchmarks against graduate-level researchers, otto-SR included more relevant papers (96.2 % sensitivity) than humans (81.7 %) while maintaining comparable specificity (96.9 % vs 98.2 %).

Figure 2 – screening performance

In data-extraction benchmarks, otto-SR outperformed humans and other LLM tools across every review, scoring 93.1 % accuracy on average compared to 79.6 % for human reviewers.

Figure 3 – data-extraction accuracy

To test otto-SR in a real-world setting, we reproduced and updated the April 2024 issue of Cochrane Reviews (n = 12). The original reviews found only 64 eligible studies; otto-SR identified 54 additional studies likely missed by manual screening—a 78 % increase. Including these studies altered statistical conclusions in several reviews: two became statistically significant, while one lost significance.

Figure 4 – Cochrane reproduction results

In short, the gold standard is no longer human.

Our mission

We believe tools like otto-SR mark a new paradigm. When conducting a review eventually takes minutes, every review can be truly "living" and continuously updated with emerging evidence. otto-SR is the first step toward the infrastructure required for this vision—enabling researchers to broaden access to timely and reliable information.

Even with super-human performance, transparency is critical. That's why otto-SR prioritises explainable AI (XAI): every screening and extraction decision is accompanied by source-linked reasoning so researchers can audit and intervene at any stage. We also emphasise research-grade reproducibility—otto-SR is the only system to date that has been rigorously benchmarked and published, including our earlier work in Annals of Internal Medicine and our latest preprint.

Who we are

We're a small team of PhD and medical students from Harvard and the University of Toronto, plus former Stripe engineers (hi Patrick!), building automation for research tasks. Our bar is very high: bad tools risk flawed science and harmful clinical decisions. If you enjoy solving difficult problems, tell us about one you've solved at careers@ottosr.com; we'd love for you to join us!

We’re advised by leading figures in medicine and evidence synthesis, including George M. Church, a pioneer in synthetic biology. We also collaborate closely with David Moher, founder of the PRISMA guidelines, and Isabelle Boutron, Director of Cochrane France.

Try it out!

otto-SR is in a research preview and is free for the first few who sign up: https://ottosr.com/sign-up.

[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC6722281/