Research

Latest

Recchia, G., Mangat, C. S., Li, I., & Krishnakumar, G. (2025). FindTheFlaws: Annotated errors for use in scalable oversight research. arxiv:2503.22989. Link

Work in progress

Recchia, G., Mangat, C., Nyachhyon, J., Sharma, M., Canavan, C., Epstein-Gross, D., and Abdulbari, M. (in prep.) Automation bias: A challenge for scalable oversight. Presents results of two sandwiching-like experiments intended to establish baselines for simple approaches to scalable oversight.

In Bowman et al.’s Measuring Progress on Scalable Oversight for Large Language Models, humans conversed with language models to answer questions more accurately than either language models alone or humans alone. Can a systematic investigation of the transcripts reveal key factors that differentiated successful from unsuccessful question-answering attempts? (investigation complete; write-up pending)

Contributed to

Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., … & Krueger, D. (2024). Foundational challenges in assuring alignment and safety of large language models. Transactions on Machine Learning Research, 2835-8856. Link

Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., … & Verbeken, B. (2025). Humanity’s Last Exam. Link. Co-author on account of contributing question(s) that were selected for the dataset.

McKenzie, I. R., Lyzhov, A., Pieler, M., Parrish, A., Mueller, A., Prabhu, A., … & Perez, E. (2023). Inverse scaling: When bigger isn’t better. Transactions on Machine Learning ResearchLink. Co-author on account of submitting a winning task (e.g., identifying a task on which language model performance decreases with scale).