Jannik Brinkmann — Mechanistic Interpretability & AI Safety

I am a Ph.D. student interested in the structure and interpretation of natural and artificial systems. I am advised by Christian Bartelt and affiliated with the Interpretable Neural Networks group, led by David Bau. I currently red-team frontier models at OpenAI, building on our Agents of Chaos study of the risks of multi-agent systems.

Google Scholar GitHub

Selected Publications

SafetyICLR 2026

Jailbreak Transferability Emerges From Shared Representations

A jailbreak built for one model often breaks another. We trace that transfer to shared internal representations: models that encode concepts alike inherit each other's vulnerabilities.

InterpretabilityComputational Linguistics 2025

The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability

We propose a perspective on interpretability grounded in causal mediation analysis.

InterpretabilityNAACL 2025 · Oral

Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages

Even models trained almost entirely on English are fluent in other languages. We trace that fluency to shared, language-agnostic representations of grammar that are reused across languages.

ToolsICLR 2025

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

The internals of the largest open models are effectively out of reach for most researchers. NNsight and NDIF open them up through a transparent interface for running interventions at scale.

InterpretabilityNeurIPS 2024

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Sparse autoencoders decompose activations into candidate features, but there's no ground truth for whether a decomposition is faithful. Board-game models offer one: the true state of the board is known.

Positions

Mar 2026 – Present

OpenAIContractor

Multi-agent red-teaming of frontier models in collaboration with OpenAI's safety team.

Jun 2025 – Oct 2025

J.P. Morgan AI ResearchResearch Intern

Synthetic data generation and post-training methods for mathematical reasoning in language models, resulting in a NeurIPS 2025 workshop paper.

Nov 2024 – Mar 2025

NYU Center for Data ScienceVisiting Researcher

Safety and adversarial robustness of language models with Prof. He He, resulting in a paper on the cross-model transferability of jailbreak attacks.

Apr 2024 – Aug 2024

Northeastern UniversityVisiting Researcher

Structure and interpretation of neural networks with Prof. David Bau, resulting in a journal publication and conference papers at ICLR, NAACL (Oral), and NeurIPS.