Announcing otto-SR

Helping researchers perform end-to-end systematic reviews in hours, not months

By Christian, Jason, and Paul | June 12, 2025

otto-SR is a new AI-powered workflow built to support and automate systematic reviews—the backbones of public health and clinical practice guidelines. It is the first system to outperform human-level performance in both screening and data-extraction tasks, while running 3000× faster.

TLDR

Automates the most labour-intensive steps of systematic reviews: screening and data extraction.
Built and tested by researchers globally—from Harvard, MIT, and the University of Toronto to Cochrane—and former Stripe engineers.
Achieves state-of-the-art performance in screening and data-extraction benchmarks, directly outperforming human reviewers in sensitivity and accuracy across every review.
Reproduced an entire Cochrane issue, doubling the number of eligible articles and altering key statistical conclusions—what would take researchers ~12 work-years was completed in < 48 hours.
Designed to complement researchers: it follows the traditional SR process and requires only a study protocol, search results, and extraction variables—humans stay in-the-loop at every critical step.

Preprint: l.ottosr.com/preprint

Why does this matter?

Systematic reviews (SRs) shape research into facts, which in turn inform public-health policy and support clinical decision-making. But while medicine's scientific corpus grows by millions of papers each year, SRs demand rigour: they are taking longer (often 1+ years) and becoming more expensive to write (over $100,000)^[1]. These delays have real public-health consequences—prolonging the use of ineffective or harmful treatments initially supported by less-rigorous evidence.

Traditional SR tools merely streamline human workflows. Newer automated solutions promise faster results but sacrifice comprehensiveness and validation. They search only a subset of papers, cut corners, and often lack benchmarking against human comparators. Crucially, none have been tested on real-world tasks where outcomes directly inform clinical and policy decisions.

What we do

otto-SR is an end-to-end platform using large language models (LLMs) to support and automate the SR workflow from the initial search through to analysis. We leverage OpenAI's GPT-4.1 for screening and o3-mini for data extraction—applying each model where it performs best.

In head-to-head benchmarks against graduate-level researchers, otto-SR included more relevant papers (96.2 % sensitivity) than humans (81.7 %) while maintaining comparable specificity (96.9 % vs 98.2 %).

In data-extraction benchmarks, otto-SR outperformed humans and other LLM tools across every review, scoring 93.1 % accuracy on average compared to 79.6 % for human reviewers.

To test otto-SR in a real-world setting, we reproduced and updated the April 2024 issue of Cochrane Reviews (n = 12). The original reviews found only 64 eligible studies; otto-SR identified 54 additional studies likely missed by manual screening—a 78 % increase. Including these studies altered statistical conclusions in several reviews: two became statistically significant, while one lost significance.

Figure 4 – Cochrane reproduction results

In short, the gold standard is no longer human.

Our mission

We believe tools like otto-SR mark a new paradigm. When conducting a review eventually takes minutes, every review can be truly "living" and continuously updated with emerging evidence. otto-SR is the first step toward the infrastructure required for this vision—enabling researchers to broaden access to timely and reliable information.

Even with super-human performance, transparency is critical. That's why otto-SR prioritises explainable AI (XAI): every screening and extraction decision is accompanied by source-linked reasoning so researchers can audit and intervene at any stage. We also emphasise research-grade reproducibility—otto-SR is the only system to date that has been rigorously benchmarked and published, including our earlier work in Annals of Internal Medicine and our latest preprint.

Who we are

We're a small team of PhD and medical students from Harvard and the University of Toronto, plus former Stripe engineers (hi Patrick!), building automation for research tasks. Our bar is very high: bad tools risk flawed science and harmful clinical decisions. If you enjoy solving difficult problems, tell us about one you've solved at careers@ottosr.com; we'd love for you to join us!

We’re advised by leading figures in medicine and evidence synthesis, including George M. Church, a pioneer in synthetic biology. We also collaborate closely with Isabelle Boutron, Director of Cochrane France.

Try it out!

otto-SR is in a research preview and is free for the first few who sign up: https://ottosr.com/sign-up.

^[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC6722281/