Selected Publications
A jailbreak built for one model often breaks another. We trace that transfer to shared internal representations: models that encode concepts alike inherit each other's vulnerabilities.
We propose a perspective on interpretability grounded in causal mediation analysis.
Even models trained almost entirely on English are fluent in other languages. We trace that fluency to shared, language-agnostic representations of grammar that are reused across languages.
The internals of the largest open models are effectively out of reach for most researchers. NNsight and NDIF open them up through a transparent interface for running interventions at scale.
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Sparse autoencoders decompose activations into candidate features, but there's no ground truth for whether a decomposition is faithful. Board-game models offer one: the true state of the board is known.
Positions
Mar 2026 – Present
OpenAIContractor
Multi-agent red-teaming of frontier models in collaboration with OpenAI's safety team.
Jun 2025 – Oct 2025
J.P. Morgan AI ResearchResearch Intern
Synthetic data generation and post-training methods for mathematical reasoning in language models, resulting in a NeurIPS 2025 workshop paper.
Nov 2024 – Mar 2025
NYU Center for Data ScienceVisiting Researcher
Safety and adversarial robustness of language models with Prof. He He, resulting in a paper on the cross-model transferability of jailbreak attacks.
Apr 2024 – Aug 2024
Northeastern UniversityVisiting Researcher
Structure and interpretation of neural networks with Prof. David Bau, resulting in a journal publication and conference papers at ICLR, NAACL (Oral), and NeurIPS.