Research Article (preprint): “DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research”
The preprint linked below was recently posted on arXiv.
Title
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
Authors
João Coelho
Carnegie Mellon University
IST and INESC-ID
Jingjie Ning
Carnegie Mellon University
Jingyuan He
Carnegie Mellon University
Kangrui Mao
Carnegie Mellon University
Abhijay Paladugu
Carnegie Mellon University
Pranav Setlur
Carnegie Mellon University
Jiahe Jin
Carnegie Mellon University
Jamie Callan
Carnegie Mellon University
João Magalhães
NOVA LINCS
Bruno Martins
IST and INESC-ID
Chenyan Xiong
Carnegie Mellon University
Source
via arXiv
DOI: 10.48550/arXiv.2505.19253
Abstract
Deep research systems represent an emerging class of agentic information retrieval methods that generate comprehensive and well-supported reports to complex queries. However, most existing frameworks rely on dynamic commercial search APIs, which pose reproducibility and transparency challenges in addition to their cost. To address these limitations, we introduce DeepResearchGym, an open-source sandbox that combines a reproducible search API with a rigorous evaluation protocol for benchmarking deep research systems. The API indexes large-scale public web corpora, namely ClueWeb22 and FineWeb, using a state-of-the-art dense retriever and approximate nearest neighbor search via DiskANN. It achieves lower latency than popular commercial APIs while ensuring stable document rankings across runs, and is freely available for research use. To evaluate deep research systems’ outputs, we extend the Researchy Questions benchmark with automatic metrics through LLM-as-a-judge assessments to measure alignment with users’ information needs, retrieval faithfulness, and report quality. Experimental results show that systems integrated with DeepResearchGym achieve performance comparable to those using commercial APIs, with performance rankings remaining consistent across evaluation metrics. A human evaluation study further confirms that our automatic protocol aligns with human preferences, validating the framework’s ability to help support controlled assessment of deep research systems. Our code and API documentation are available at this https URL.
Direct to Abstract Page + Link to Full Text Article
Filed under: News, Patrons and Users, Reports
About Gary Price
Gary Price (gprice@gmail.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com.


Deep research systems represent an emerging class of agentic information retrieval methods that generate comprehensive and well-supported reports to complex queries. However, most existing frameworks rely on dynamic commercial search APIs, which pose reproducibility and transparency challenges in addition to their cost. To address these limitations, we introduce DeepResearchGym, an open-source sandbox that combines a reproducible search API with a rigorous evaluation protocol for benchmarking deep research systems. The API indexes large-scale public web corpora, namely ClueWeb22 and FineWeb, using a state-of-the-art dense retriever and approximate nearest neighbor search via DiskANN. It achieves lower latency than popular commercial APIs while ensuring stable document rankings across runs, and is freely available for research use. To evaluate deep research systems’ outputs, we extend the Researchy Questions benchmark with automatic metrics through LLM-as-a-judge assessments to measure alignment with users’ information needs, retrieval faithfulness, and report quality. Experimental results show that systems integrated with DeepResearchGym achieve performance comparable to those using commercial APIs, with performance rankings remaining consistent across evaluation metrics. A human evaluation study further confirms that our automatic protocol aligns with human preferences, validating the framework’s ability to help support controlled assessment of deep research systems. Our code and API documentation are available at this 