Latest
Recchia, G., Mangat, C. S., Li, I., & Krishnakumar, G. (2025). FindTheFlaws: Annotated errors for use in scalable oversight research. arxiv:2503.22989. Link
Work in progress
Recchia, G., Mangat, C., Nyachhyon, J., Sharma, M., Canavan, C., Epstein-Gross, D., and Abdulbari, M. (in prep.) Automation bias: A challenge for scalable oversight. Presents results of two sandwiching-like experiments intended to establish baselines for simple approaches to scalable oversight.
In Bowman et al.’s Measuring Progress on Scalable Oversight for Large Language Models, humans conversed with language models to answer questions more accurately than either language models alone or humans alone. Can a systematic investigation of the transcripts reveal key factors that differentiated successful from unsuccessful question-answering attempts? (investigation complete; write-up pending)
Contributed to
Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., … & Krueger, D. (2024). Foundational challenges in assuring alignment and safety of large language models. Transactions on Machine Learning Research, 2835-8856. Link
Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., … & Verbeken, B. (2025). Humanity’s Last Exam. Link. Co-author on account of contributing question(s) that were selected for the dataset.
McKenzie, I. R., Lyzhov, A., Pieler, M., Parrish, A., Mueller, A., Prabhu, A., … & Perez, E. (2023). Inverse scaling: When bigger isn’t better. Transactions on Machine Learning Research. Link. Co-author on account of submitting a winning task (e.g., identifying a task on which language model performance decreases with scale).