As AI systems grow more powerful, ensuring their decisions remain understandable becomes a critical challenge. We often can't directly oversee their complex processes, making it essential to find ways to monitor their internal reasoning. One promising approach is to scrutinize a model's chain-of-thought – the explicit steps it takes before reaching a conclusion. This method, championed by advancements like GPT-5 Thinking, offers a more effective way to detect potential missteps compared to simply observing a model's actions or final outputs. But here's where it gets controversial: while this 'chain-of-thought monitorability' seems promising, researchers at OpenAI and elsewhere worry it might be fragile. Changes in training methods, data sources, or even the sheer scale of these models could potentially undermine our ability to monitor their reasoning.
We urgently need to ensure that chain-of-thought monitorability remains robust as AI systems become more sophisticated and are deployed in high-stakes scenarios. We call upon the AI research community to prioritize preserving this monitorability for as long as possible and to investigate its potential as a core control mechanism for future AI systems. To achieve this, we need reliable ways to measure monitorability. However, current evaluation methods are limited and inconsistent.
And this is the part most people miss: monitorability isn't just about the monitor itself; it's also about the agent being monitored. Some agents might have inherently more opaque thought processes, making them harder to understand.
In this work, we take a systematic approach to understanding chain-of-thought monitorability. We introduce a comprehensive framework and a suite of 13 evaluations (spanning 24 different scenarios) designed to rigorously measure how well we can monitor a system's reasoning. These evaluations fall into three categories: intervention, process, and outcome-property, each targeting different aspects of the reasoning process.
A bold interpretation: By analyzing these evaluations, we aim to shed light on the monitorability of current large language models and how it evolves with increased reasoning complexity, larger training datasets, and more advanced pretraining techniques.
Our findings reveal that most advanced reasoning models exhibit a good degree of monitorability, though not perfect. Monitoring their chains-of-thought consistently outperforms monitoring actions and outputs alone. Interestingly, models that are allowed to 'think' for longer periods tend to be more monitorable. Surprisingly, current reinforcement learning techniques don't seem to significantly erode monitorability.
A thought-provoking question: Could this suggest that monitorability is inherently robust, or are we simply not pushing these models to their limits yet?
We also uncover a fascinating trade-off: smaller models, when given more time to 'think,' can achieve comparable performance to larger models while being easier to monitor. This comes at the cost of increased computational resources, which we term the 'monitorability tax.' Furthermore, asking follow-up questions that delve into the model's reasoning process can significantly enhance monitorability.
Our definition of monitorability centers on a monitor's ability to predict specific aspects of an agent's behavior. This includes not only detecting misbehavior like deception or bias but also understanding benign reasoning patterns. Monitorability is a two-way street, depending on both the capabilities of the monitor and the inherent transparency of the agent's thought process.
A counterpoint to consider: While we focus on chain-of-thought monitorability, it's crucial to remember that it's not a silver bullet. Mechanistic interpretability, which aims to understand the inner workings of a model, remains a complementary and essential tool.
We categorize our evaluations into three distinct types:
Intervention Evaluations: Inspired by recent research, these evaluations introduce controlled changes in the agent's environment and assess whether the monitor can identify the reason for the change by analyzing the chain-of-thought.
Process Evaluations: These focus on tasks with limited solution paths, allowing us to determine if the monitor can accurately track the specific steps the model took.
Outcome-Property Evaluations: These evaluations target measurable properties of the model's output, such as reward hacking, and assess the monitor's ability to predict these properties from the reasoning process.
Our taxonomy further refines these evaluations, distinguishing between neutral scenarios (normal behavior), sycophancy (overly pleasing the user), bias (gender, race, etc.), and misalignment (cheating, lying, etc.).
While treating the chain-of-thought as a private workspace for problem-solving should enhance monitorability, reinforcement learning can introduce challenges. Constraints on reasoning time, indirect optimization pressures, and gradient noise can all potentially reduce monitorability. Our analysis of OpenAI's training runs suggests that current reinforcement learning scales don't significantly degrade monitorability, but the possibility of future issues remains.
A call to action: As AI systems become increasingly powerful, understanding and preserving chain-of-thought monitorability is paramount. We need to develop more robust evaluation methods, explore the interplay between model size, reasoning effort, and monitorability, and investigate how to leverage follow-up questions effectively.
The ability to monitor the reasoning of powerful AI systems is not just a technical challenge; it's a crucial step towards ensuring their safe and responsible development and deployment. By deepening our understanding of chain-of-thought monitorability, we can build AI systems that are not only intelligent but also transparent and trustworthy.