CoT Information: Improved Sample Complexity under Chain-of-Thought Supervision

NeurIPS 2025

Learning to Think: The Chain-of-Thought Advantage

Chain-of-thought (CoT) training has quickly become central to the reasoning breakthroughs of large language models. By supervising models not only on final answers but also on the intermediate reasoning steps, CoT training teaches them how to think, not just what to predict. This shift has enabled LLMs to tackle problems that require multi-step reasoning—tasks where conventional models trained end-to-end often fail.

We can see this difference in action when both models attempt the same problem.

Question. Which number is larger, 9.9 or 9.10?

ChatGPT 5 Instant answers the number comparison question incorrectly when forced to respond immediately.
ChatGPT 5 Thinking solves the number comparison question correctly by reasoning step-by-step.

ChatGPT 5 Instant (left), prompted to respond instantly, guesses the wrong answer. ChatGPT Thinking (right) produces the correct answer after reasoning step-by-step.

Question. A forest has 1000 trees. Each year, loggers cut down half of the trees remaining at the start of the year. In the same year, conservationists plant 100 new trees at the end of the year. After this process repeats for two years, how many trees are in the forest?

ChatGPT 5 Instant fails on the forestry word problem without reasoning.
ChatGPT 5 Thinking solves the forestry word problem with chain-of-thought supervision.

ChatGPT 5 Instant (left), prompted to respond instantly, guesses the wrong answer. ChatGPT Thinking (right) produces the correct answer after reasoning step-by-step.

Question. Suppose x+2yz=4x + 2y - z = 4, 3xy+z=10-3x - y + z = 10, 4x+y2z=154x + y - 2z = 15. Solve for x,y,zx, y, z.

ChatGPT 5 Instant fails to solve the linear system without reasoning.
ChatGPT 5 Thinking solves the linear system with a reasoning trace.

ChatGPT 5 Instant (left), prompted to respond instantly, guesses the wrong answer. ChatGPT Thinking (right) produces the correct answer after reasoning step-by-step.

These vignettes summarize the main observation: CoT supervision dramatically improves reasoning accuracy.

In this work, we develop a learning-theoretic framework to formalize and quantify this statistical advantage. We are motivated by the following key question:

Part I: The Role of Chain-of-Thought Supervision in Building LLMs

Before diving into theory, it’s useful to recall where chain-of-thought supervision fits in the modern large language model training pipeline.

Three-stage pipeline for building large language models.
A hitchhiker’s guide to building an LLM.

Step 1: Pre-training (foundation modeling).

Step 2: Supervised fine-tuning.

Step 3: Post-training (alignment and preference optimization).

CoT supervision plays a pivotal role in how LLMs learn to reason, follow instructions, and generalize beyond pattern matching.

CoT supervision primarily lives in Step 2. We abstract away the mechanics of trace collection and instead ask: What statistical advantage does CoT training confer relative to vanilla input–output supervision?

Where Do CoT Traces Come From?

CoT traces can be collected in several ways:

In what follows, we abstract away the mechanics of obtaining CoT traces or their format (e.g., natural language or otherwise). We assume the CoT dataset exists, and ask: What statistical advantage does CoT training confer relative to vanilla input–output supervision?

Part II: Formalizing Learning with CoT Supervision

A Quick Refresher on Classical Supervised Learning

To provide context and intuition, let us briefly review the classical supervised learning setup.

We aim to learn h:XY\hstar: \calX \to \calY inside a hypothesis class HYX\calH \subset \calY^{\calX}. We observe labeled examples (xi,yi)(x_i, y_i) with xii.i.d.Dx_i \simiid \calD and yi=h(xi)y_i = \hstar(x_i) (realizable setting). A learner

A:S={(xi,yi)}i=1mh^\calA: S = \{(x_i, y_i)\}_{i=1}^m \longmapsto \hhat

seeks small prediction error

R(h^):=PxD[h^(x)y]ε.\calR(\hhat) := \probunder{x \sim \calD}{\hhat(x) \neq y} \leq \epsilon.

Distinguishing two hypotheses h1h_1, h2h_2 requires observing an input xx for which they disagree. If PxD[h1(x)h2(x)]=ε\probunder{x \sim \calD}{h_1(x) \neq h_2(x)} = \epsilon, then Θ(1/ε)\Theta(1 / \epsilon) samples are needed. Extending via a union bound yields the familiar sample complexity O(logH/ε)\bigO(\log |\calH| / \epsilon).

Here, the denominator ε\epsilon reflects the information per sample. For example, under noise (agnostic setting), where there is less “information” per sample, the rate is weakened to 1/ε21/\epsilon^2. Similarly, under low-noise assumptions, it is possible to interpolate between 1/ε1/\epsilon and 1/ε21/\epsilon^2. We will see that the effect of CoT supervision appears precisely in this denominator, capturing the extra information per sample provided through observing the CoT trace.

Why End-to-End Supervision Struggles

or tasks involving long-form or multi-step reasoning—mathematics, logical deduction, or code generation—the target function h\hstar is highly complex. Observing only (x,y)(x, y) pairs reveals little about its internal computation, making learning statistically inefficient. The result is a steep sample complexity curve: vast data are required to approximate the underlying reasoning process.

Strengthening the Signal: CoT Supervision

CoT training enriches supervision by exposing how a model arrives at its answer. Instead of learning solely from (x,y)(x, y), the learner also observes a reasoning trace zz.

Suppose h1h_1, h2h_2 emit both an answer y=he2e(x)y = \hete{h}(x) and a reasoning trace z=hCoT(x)z = \hcot{h}(x). The two hypotheses can now be distinguished if they differ in either the final output or the intermediate CoT computational trace. Let us denote this probability of disagreeement as I(ε)I(\epsilon), thinking of it as a modification of the end-to-end disagreement probability ε\epsilon above:

I(ε):=PxD[h1e2e(x)h2e2e(x) OR h1CoT(x)h2CoT(x)]ε.I(\epsilon) := \probunder{x \sim \calD}{\hete{h_1}(x) \neq \hete{h_2}(x) \text{ OR } \hcot{h_1}(x) \neq \hcot{h_2}(x)} \geq \epsilon.

The additional information I(ε)I(\epsilon) makes hypotheses easier to distinguish—reducing the sample requirement from O(1/ε)\bigO(1/\epsilon) to O(1/I(ε))\bigO(1/I(\epsilon)). Intuitively, a single CoT-labeled example can be worth many ordinary examples in terms of information gained.

CoT Hypothesis Classes and CoT-PAC Learning

Formally, a CoT hypothesis class is a family H(Y×Z)X\calH \subset (\calY \times \calZ)^{\calX} of functions h:XY×Zh: \calX \to \calY \times \calZ. We denote:

A CoT learner observes a dataset of triples (xi,yi,zi)(x_i, y_i, z_i) and outputs a predictor in YX\calY^{\calX}. We say the class is PAC-learnable under CoT supervision with sample complexity mH,D(ε,δ)m_{\mathcal{H}, \mathcal{D}}(\epsilon, \delta) if

PSDm[Px,yDx,y[A(S)(x)y]ε]1δwhenever mmH,D(ε,δ).\probunder{S \sim \calD^m}{\probunder{x,y \sim \calD_{x,y}}{\calA(S)(x) \neq y} \leq \epsilon} \geq 1 - \delta \quad \text{whenever } m \geq m_{\calH, \calD}(\epsilon, \delta).

Note that while the learner observes both final outputs and CoT traces for each example, the evaluation metric is the end-to-end error.

Part III: Statistical Guarantees and CoT Information

Central Analytical Challenge: Training–Testing Asymmetry

The central analytical challenge in characterizing CoT-supervised learning lies in the asymmetry between the training and testing objectives:

This asymmetry prevents the direct application of classical learning theory results.

One approach to studying CoT-supervised learning, taken by recent related work3, sidesteps this asymmetry by bounding the CoT risk instead of the end-to-end risk, noting that RDCoT(h)RDe2e(h)\cotrisk{\calD}(h) \geq \eterisk{\calD}(h) for all hh. This problem is now symmetric because both the training and testing objectives are the CoT error, enabling a direct application of standard learning results. In particular, this approach yields a sample complexity of O(VC(LCoT(H))/ε)\bigO\big(\VC(\calLCoT(\calH)) / \epsilon\big), where LCoT(H):={hCoT:(x,y,z)1{h(x)(y,z)}}\calLCoT(\calH) := \{\ell_h^{\CoT}: (x, y, z) \mapsto \Ind{h(x) \neq (y, z)}\}.

While this provides an upper bound, the resulting rate, scaling as 1/ε1/\epsilon, matches that of standard end-to-end supervision (in terms of the dependence on ε\epsilon). Yet, we know that CoT supervision can provide significant statistical advantages in practice. This suggests that the analysis is loose, failing to capture the full benefit of CoT supervision through the extra information encoded in the traces.

To quantify the true advantage of CoT supervision, we need a measure that explicitly links the two types of risk and captures the information gain per observed trace.

The Chain-of-Thought Information

To quantify the statistical advantage of intermediate supervision, we introduce the Chain-of-Thought Information, capturing the added discriminative power that CoT supervision provides over standard end-to-end learning.

Definition: CoT Information

For a CoT hypothesis class H\calH and realizable distribution D\calD over X×Y×Z\calX \times \calY \times \calZ, the CoT information is defined as the function

IDCoT(ε;H):=infhΔDe2e(ε;H){logPx,y,zD[(hCoT(x),he2e(x))=(z,y)]},\cotinfodomain{\calD}(\epsilon; \calH) := \inf_{h \in \Deltaete_{\calD}(\epsilon; \calH)} \Big\{- \log \probunder{x,y,z \sim \calD}{(\hcot{h}(x), \hete{h}(x)) = (z, y)}\Big\},

where ΔDe2e(ε;H):={hH:Px,y[he2e(x)y]>ε}\Deltaete_{\calD}(\epsilon; \calH) := \{h \in \calH : \probunder{x,y}{\hete{h}(x) \neq y} > \epsilon\} is the set of hypotheses that disagree with the end-to-end reference function on at least an ε\epsilon fraction of the inputs.

Conceptually, the CoT information quantifies how much more discriminative power is packed into each annotated example. When the CoT information is large, observing the CoT reveals additional information about the target hypothesis, enabling faster learning.

In the special case where reasoning traces provide no useful signal (for instance, if they are independent of the final output), the CoT information collapses to the standard ε\epsilon rate. Conversely, in the extreme case where the trace always uniquely identifies the target function, the CoT information becomes infinite—implying that a single example suffices to recover the target function.

Key Properties of the CoT Information:

We will see that the CoT information characterizes the sample complexity of learning under CoT supervision, improving the standard 1/ε1/\epsilon rate to a potentially much faster 1/IDCoT(ε;H)1/\cotinfodomain{\calD}(\epsilon; \calH). This reframes CoT supervision as a statistically stronger form of learning signal: one that not only corrects what the model predicts, but constrains how it reaches the answer.

Main Theorem: CoT Information Governs the Statistical Rate of Learning under CoT Supervision

Multi-panel illustration comparing end-to-end and CoT consistency sets as sample size increases.

Geometry of CoT information: consistency sets shrink much faster under CoT supervision.

Result: Sample Complexity Under CoT Supervision

For finite H\calH, the CoT consistency rule

CoT-Cons(S;H):={hH:he2e(xi)=yi,hCoT(xi)=zi, i}\CoTCons(S; \calH) := \{h \in \calH : \hete{h}(x_i) = y_i, \hcot{h}(x_i) = z_i, \ \forall i\}

has sample complexity

m(ε,δ)=logH+log(1/δ)IDCoT(ε;H).m(\epsilon, \delta) = \frac{\log |\calH| + \log(1/\delta)}{\cotinfodomain{\calD}(\epsilon; \calH)}.

That is, whenever mm(ε,δ)m \geq m(\epsilon, \delta), we have that with probability at least 1δ1 - \delta over SDmS \sim \calD^m,

hCoT-Cons(S;H), RDe2e(h)ε.\forall h \in \CoTCons(S; \calH), \ \eterisk{\calD}(h) \leq \epsilon.

The take away: the 1/ε1/\epsilon rate of end-to-end learning is improved to the potentially much faster 1/IDCoT(ε;H)1/\cotinfodomain{\calD}(\epsilon; \calH) rate under CoT supervision. In particular, the ratio IDCoT(ε;H)/ε\cotinfodomain{\calD}(\epsilon; \calH) / \epsilon can be interpreted as the relative value of a single CoT-labeled example compared to an end-to-end labeled example.

For infinite classes, the result extends to

m(ε,δ)=O ⁣(VC(LCoT(H))log(1IDCoT(ε;H)+1)+log(1/δ)IDCoT(ε;H)).m(\epsilon, \delta) = \bigO\!\left(\frac{\VC(\calLCoT(\calH)) \cdot \log \big(\tfrac{1}{\cotinfodomain{\calD}(\epsilon; \calH)} + 1\big) + \log(1/\delta)}{\cotinfodomain{\calD}(\epsilon; \calH)}\right).

Here, the VC dimension of the CoT loss class LCoT(H)\mathcal{L}_{\text{CoT}}(\mathcal{H}) replaces the log-cardinality term, but the dependence on IDCoT(ε;H)\cotinfodomain{\mathcal{D}}(\epsilon; \mathcal{H}) remains the key driver of the rate.

Preview of the Agnostic Setting

The guarantees above hold in the realizable case, where both outputs and reasoning traces can be represented within the hypothesis class. In practice, however, this assumption may not hold. Chain-of-thought traces are often noisy, incomplete, or stylistically inconsistent with the model’s representational structure, creating a potential mismatch between the supervision signal and the learner’s capacity.

The message becomes somewhat more complex in the agnostic setting. In particular, in general, CoT supervision can even be detrimental in the agnostic setting. Consider the following pathological example, where the distribution D\calD has realizable outputs but unrealizable CoTs: infhHRDe2e(h)=0\inf_{h \in \calH} \eterisk{\calD}(h) = 0 yet infhHRDCoT(h)=1\inf_{h \in \calH} \cotrisk{\calD}(h) = 1. In this case, the CoT consistency rule yields CoT-ERM(S;H)=H\CoTERM(S; \calH) = \calH (no guarantees), whereas standard end-to-end learning still succeeds with O(VC(Le2e(H))/ε)\bigO\big(\VC(\calLete(\calH)) / \epsilon\big) samples.

To extend our analysis to the agnostic setting, we introduce an agnostic variant of the CoT information that links the excess CoT risk to the excess end-to-end risk. This agnostic CoT information measures the alignment between observed traces and the hypothesis class, capturing how well the CoT supervision can guide learning even when perfect trace reconstruction is impossible. We refer the reader to the full paper for details.

Part IV: Information-Theoretic Lower Bounds

The upper bounds established above show that CoT information governs the sample complexity of learning under CoT supervision. But are these rates tight? Do they capture the right complexity measure?

In the following, we establish information-theoretic lower bounds that validate CoT information as the fundamental quantity governing learning with CoT supervision. Such results show that a certain number of samples, scaling inversely with the CoT information, are necessary for any learning algorithm to succeed.

Lower Bound via Two-Point Method

Result: Lower Bound via Two-Point Method

Fix a distribution D\calD over X\calX, and draw mm samples, x1,,xmi.i.d.Dx_1, \ldots, x_m \simiid \calD. If

m<log(1/δ)/ID,hCoT(ε;H),m < \log(1/\delta) / \cotinfo(\epsilon; \calH),

then with probability δ\geq \delta there exists an hh with end-to-end error ε\geq \epsilon which is indistinguishable from h\hstar on the sample. Moreover, for any learning algorithm A\calA, we have

suphHESPh[RDe2e(A(S))]12suph,εεexp(mID,hCoT(ε;H)).\sup_{\hstar \in \calH} \expectunder{S \sim P_{\hstar}^{\otimes}}{\eterisk{\calD}(\calA(S))} \geq \tfrac{1}{2} \sup_{\hstar, \epsilon} \epsilon \cdot \exp(-m \cdot \cotinfo(\epsilon; \calH)).

This lower bound exhibits the expected dependence on the error parameter ε\epsilon through the CoT information, confirming its role as the governing quantity for learning with CoT supervision.

However, it does not scale with size of the hypothesis class. This is a limitation of the two-point method, which only considers a pair of hypotheses. To capture class size, we turn to a more sophisticated packing-based argument employing information-theoretic tools.

Lower Bound via Fano’s Method

Fano’s method generalizes the two-point argument by considering a packing of many hypotheses that are pairwise well-separated with respect to end-to-end error

Result: Lower Bound via Fano's Method

Fix a distribution D\calD over X\calX, and draw mm samples, x1,,xmi.i.d.Dx_1, \ldots, x_m \simiid \calD. For any algorithm A\calA, if the number of samples satisfies

mlogM(H;dDe2e,ε)2(CQsupπEh1,h2π[IDCoT(h1,h2)]+log2),m \le \frac{\log M(\mathcal{H}; d_{\mathcal{D}}^{\text{e2e}}, \epsilon)}{2 \big(C_Q \cdot \sup_{\pi} \mathbb{E}_{h_1, h_2 \sim \pi}[\relcotinfo(h_1, h_2)] + \log 2\big)},

then the end-to-end error must be large for some h\hstar:

suphHPr[RD,he2e(A(S))ε/2]1/2.\sup_{\hstar \in \calH} \Pr[\eterisk{\mathcal{D}, \hstar}(\mathcal{A}(S)) \ge \epsilon / 2] \ge 1/2.

Here, M(H;dDe2e,ε)M(\mathcal{H}; d_{\mathcal{D}}^{\text{e2e}}, \epsilon) denotes the packing number of the hypothesis class under the end-to-end metric, and IDCoT(h1,h2)\relcotinfo(h_1, h_2) is the pairwise CoT information, quantifying how distinguishable two hypotheses are when reasoning traces are observed.

This result scales appropriately with both error and class complexity, confirming that the CoT information is not merely an artifact of the upper bound but a tight characterization of the sample efficiency achievable under CoT supervision.

Part V: Simulations — Theory Meets Practice

The theoretical framework suggests that the CoT information IDCoT(ε;H)\cotinfodomain{\mathcal{D}}(\epsilon; \mathcal{H}) should govern how much faster learning occurs under CoT supervision. To test this prediction, we perform controlled simulations where all quantities can be computed exactly, comparing empirical sample complexity under standard end-to-end supervision versus CoT-augmented training.

Example 1: Linear Autoregressive Model

Our first experiment studies a simplified autoregressive generation process over a binary alphabet Σ={0,1}\Sigma = \{0, 1\}. Each model has window size d=8d = 8, produces T=16T = 16 autoregressive steps, and uses weights w{1,0,+1}dw \in \{-1, 0, +1\}^d. The CoT trace corresponds to the full sequence of intermediate tokens, while the final output yy is the last generated token.

Scatter plot comparing CoT information and end-to-end disagreement for the linear autoregressive class.
Relative CoT information between pairs of hypotheses versus their end-to-end disagreement. Observing the CoT makes distinguishing hypotheses statistically easier.
Plot of CoT information as a function of epsilon for the linear autoregressive class.
The CoT information function IDCoT(ε;H)\cotinfodomain{\mathcal{D}}(\epsilon; \mathcal{H}). The dashed line is the line y=εy = \epsilon. The CoT information always dominates the ε\epsilon baseline.
Empirical sample complexity for end-to-end vs CoT consistency in the linear autoregressive class.
Empirical sample complexity for end-to-end versus CoT learning.
Empirical probability of zero error for end-to-end vs CoT learners in the linear autoregressive class.
Probability that each learning rule achieves zero error as the sample size grows.

The empirical data closely match the theoretical prediction: limε0IDCoT(ε;H)/ε+6\lim_{\epsilon \to 0} \cotinfodomain{\mathcal{D}}(\epsilon; \mathcal{H}) / \epsilon^+ \approx 6, corresponding to a roughly 6× theoretical gain. Measured sample complexity shows an actual ≈5× empirical improvement, validating that CoT information accurately predicts the observed acceleration in learning.

Example 2: Learning Regular Languages

Our second experiment examines multi-step reasoning via regular language recognition. Here, he2e(x)h^{\text{e2e}}(x) indicates whether a string xx belongs to a regular language L\mathcal{L}, while hCoT(x)h^{\text{CoT}}(x) records the sequence of states traversed by the corresponding deterministic finite automaton during recognition. The CoT thus captures the latent computational process underlying membership decisions.

Deterministic finite automaton with four states used in the simulation.
DFA whose state trajectory forms the chain-of-thought trace.
CoT information ratio as a function of epsilon under varying detail levels of the DFA trace.
How CoT information scales with epsilon as trace detail varies.
Empirical sample complexity for DFA learning under end-to-end vs CoT supervision.
Empirical sample complexity for end-to-end versus CoT learning.
Probability of zero error for DFA learning with and without CoT supervision.
Probability that each learning rule achieves zero error as the sample size grows.

In this setting, the theoretical IDCoT(ε;H)/ε+\cotinfodomain{\mathcal{D}}(\epsilon; \mathcal{H}) / \epsilon^+ curve approaches 600\approx 600 as ε0\epsilon \to 0, forecasting an improvement on the order of 600× to learn h\hstar with very small error. Empirical results confirm dramatic gains: between 10210^2 and 10310^3 times fewer samples are needed under CoT supervision to reach equivalent accuracy. This demonstrates that CoT information can yield orders-of-magnitude improvements in sample efficiency, depending on how much structure the reasoning trace exposes about the underlying computation.

Estimating the CoT Information from Data for Complex Infinite Classes

The above examples considered known distributions D\calD and finite hypothesis classes H\mathcal{H} where the CoT information could be computed exactly. Here, we briefly preview how CoT information can be estimated from random samples, which is particularly relevant for infinite hypothesis classes. The quantity IDCoT(ε;H)\cotinfodomain{\mathcal{D}}(\epsilon; \mathcal{H}) can be estimated consistently and uniformly over ε(0,1)\epsilon \in (0,1), enabling its application to neural networks and other expressive model families. This opens new possibilities: quantifying the value of CoT supervision in large-scale architectures, comparing different methods of generating reasoning traces, and analyzing how CoT information relates to out-of-distribution generalization. In essence, it provides a concrete, measurable way to answer a fundamental question in modern LLM training—how much information is encoded in a chain of thought?

Conclusion

This work develops a learning theory framework for understanding the statistical advantage of chain-of-thought supervision. By introducing the notion of CoT information, we provide an information-theoretic measure that quantifies how observing reasoning traces accelerate learning. The main results establish matching upper and lower bounds, showing that CoT information precisely governs the achievable rates of sample complexity under CoT supervision. Empirical simulations confirm these predictions, demonstrating substantial gains—sometimes by orders of magnitude—when reasoning traces align with the underlying computation.

Several open directions remain. Theoretically, extending the framework to the agnostic and noisy settings raises questions about robustness and alignment between traces and hypotheses. Empirically, estimating CoT information in neural architectures, comparing different trace-generation pipelines, and exploring its role in out-of-distribution generalization and reinforcement learning for reasoning represent promising next steps. Together, these directions aim toward a unified understanding of how CoT supervision reshapes the statistical foundations of reasoning in large models.

Citation

@inproceedings{altabaa2025cotinformation,
title={CoT Information: Improved Sample Complexity under Chain-of-Thought Supervision},
author={Awni Altabaa and Omar Montasser and John Lafferty},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=OkVQJZWGfn}
}

Footnotes

  1. Cobbe et al. (GSM8K)
  2. Wang et al. (self-consistency); Zelikman et al. (STaR); Lightman et al. (Let’s Verify Step by Step).

  3. Joshi et al. (2025)