publications | Awni Altabaa

View my Google Scholar profile →

2025

Unlocking Out-of-Distribution Generalization in Transformers via Recursive Latent Space Reasoning

Awni Altabaa, Siyu Chen, John Lafferty, and Zhuoran Yang

Under review, 2025

Abs TLDR arXiv Bib Blog Code

Systematic, compositional generalization beyond the training distribution remains a core challenge in machine learning – and a critical bottleneck for the emergent reasoning abilities of modern language models. This work investigates out-of-distribution (OOD) generalization in Transformer networks using a GSM8K-style modular arithmetic on computational graphs task as a testbed. We introduce and explore a set of four architectural mechanisms aimed at enhancing OOD generalization: (i) input-adaptive recurrence; (ii) algorithmic supervision; (iii) anchored latent representations via a discrete bottleneck; and (iv) an explicit error-correction mechanism. Collectively, these mechanisms yield an architectural approach for native and scalable latent space reasoning in Transformer networks with robust algorithmic generalization capabilities. We complement these empirical results with a detailed mechanistic interpretability analysis that reveals how these mechanisms give rise to robust OOD generalization abilities.

We introduce and explore a set of four architectural mechanisms aimed at enhancing OOD algorithmic generalization: *(i)* input-adaptive recurrence; *(ii)* algorithmic supervision; *(iii)* anchored latent representations via a discrete bottleneck; and *(iv)* an explicit error-correction mechanism.
@article{altabaa2025unlockingoutofdistributiongeneralizationtransformers, title = {Unlocking Out-of-Distribution Generalization in Transformers via Recursive Latent Space Reasoning}, author = {Altabaa, Awni and Chen, Siyu and Lafferty, John and Yang, Zhuoran}, year = {2025}, journal = {Under review}, eprint = {2510.14095}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, url = {https://arxiv.org/abs/2510.14095}, }
CoT Information: Improved Sample Complexity under Chain-of-Thought Supervision

Awni Altabaa, Omar Montasser, and John Lafferty

Neural Information Processing Systems (NeurIPS), spotlight, 2025

Abs TLDR arXiv Bib Blog Poster Slides

Learning complex functions that involve multi-step reasoning poses a significant challenge for standard supervised learning from input-output examples. Chain-of-thought (CoT) supervision, which provides intermediate reasoning steps together with the final output, has emerged as a powerful empirical technique, underpinning much of the recent progress in the reasoning capabilities of large language models. This paper develops a statistical theory of learning under CoT supervision. A key characteristic of the CoT setting, in contrast to standard supervision, is the mismatch between the training objective (CoT risk) and the test objective (end-to-end risk). A central part of our analysis, distinguished from prior work, is explicitly linking those two types of risk to achieve sharper sample complexity bounds. This is achieved via the CoT information measure CoTInfo(ε), which quantifies the additional discriminative power gained from observing the reasoning process. The main theoretical results demonstrate how CoT supervision can yield significantly faster learning rates compared to standard E2E supervision. Specifically, it is shown that the sample complexity required to achieve a target E2E error εscales as d/CoTInfo(ε), where d is a measure of hypothesis class complexity, which can be much faster than standard d/εrates. Information-theoretic lower bounds in terms of the CoT information are also obtained. Together, these results suggest that CoT information is a fundamental measure of statistical complexity for learning under chain-of-thought supervision.

We develop a statistical theory of learning under chain-of-thought (CoT) supervision, introducing the CoT information measure to quantify the additional discriminative power gained from observing the reasoning process, and demonstrate that CoT supervision can yield significantly faster learning rates compared to standard end-to-end supervision.
@article{altabaa2025cotinformation, title = {CoT Information: Improved Sample Complexity under Chain-of-Thought Supervision}, author = {Altabaa, Awni and Montasser, Omar and Lafferty, John}, year = {2025}, eprint = {2505.15927}, archiveprefix = {arXiv}, primaryclass = {stat.ML}, url = {https://arxiv.org/abs/2505.15927}, journal = {Neural Information Processing Systems (NeurIPS), spotlight}, }
Disentangling and Integrating Relational and Sensory Information in Transformer Architectures

Awni Altabaa, and John Lafferty

International Conference on Machine Learning (ICML), 2025

Abs TLDR arXiv Bib Blog Code Poster Slides Publication

Relational reasoning is a central component of generally intelligent systems, enabling robust and data-efficient inductive generalization. Recent empirical evidence shows that many existing neural architectures, including Transformers, struggle with tasks requiring relational reasoning. In this work, we distinguish between two types of information: sensory information about the properties of individual objects, and relational information about the relationships between objects. While neural attention provides a powerful mechanism for controlling the flow of sensory information between objects, the Transformer lacks an explicit computational mechanism for routing and processing relational information. To address this limitation, we propose an architectural extension of the Transformer framework that we call the Dual Attention Transformer (DAT), featuring two distinct attention mechanisms: sensory attention for directing the flow of sensory information, and a novel relational attention mechanism for directing the flow of relational information. We empirically evaluate DAT on a diverse set of tasks ranging from synthetic relational benchmarks to complex real-world tasks such as language modeling and visual processing. Our results demonstrate that integrating explicit relational computational mechanisms into the Transformer architecture leads to significant performance gains in terms of data efficiency and parameter efficiency.

We introduce the Dual Attention Transformer (DAT), an architectural extension of the Transformer framework that incorporates distinct sensory and relational attention mechanisms, leading to significant performance gains in data efficiency and parameter efficiency across various tasks.
@article{altabaa2024disentangling, title = {Disentangling and Integrating Relational and Sensory Information in Transformer Architectures}, author = {Altabaa, Awni and Lafferty, John}, journal = {International Conference on Machine Learning (ICML)}, publication = {https://icml.cc/virtual/2025/poster/44191}, year = {2025}, eprint = {2405.16727}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, }

2024

On the Role of Information Structure in Reinforcement Learning for Partially-Observable Sequential Teams and Games

Awni Altabaa, and Zhuoran Yang

Neural Information Processing Systems (NeurIPS), 2024

Abs TLDR arXiv Bib Poster Slides Publication

In sequential decision-making problems, the information structure describes the causal dependencies between system variables, encompassing the dynamics of the environment and the agents’ actions. Classical models of reinforcement learning (e.g., MDPs, POMDPs) assume a restricted and highly regular information structure, while more general models like predictive state representations do not explicitly model the information structure. By contrast, real-world sequential decision-making problems typically involve a complex and time-varying interdependence of system variables, requiring a rich and flexible representation of information structure. In this paper, we formalize a novel reinforcement learning model which explicitly represents the information structure. We then use this model to carry out an information-structural analysis of the statistical complexity of general sequential decision-making problems, obtaining a characterization via a graph-theoretic quantity of the DAG representation of the information structure. We prove an upper bound on the sample complexity of learning a general sequential decision-making problem in terms of its information structure by exhibiting an algorithm achieving the upper bound. This recovers known tractability results and gives a novel perspective on reinforcement learning in general sequential decision-making problems, providing a systematic way of identifying new tractable classes of problems.

We introduce a novel reinforcement learning model that explicitly represents the information structure in sequential decision-making problems, providing a systematic way to identify new tractable classes of problems and offering insights into the statistical complexity of learning in such environments.
@article{altabaaRoleInformationStructure2024, title = {On the {{Role}} of {{Information Structure}} in {{Reinforcement Learning}} for {{Partially-Observable Sequential Teams}} and {{Games}}}, author = {Altabaa, Awni and Yang, Zhuoran}, year = {2024}, number = {arXiv:2403.00993}, eprint = {2403.00993}, primaryclass = {cs, stat}, publisher = {arXiv}, doi = {10.48550/arXiv.2403.00993}, urldate = {2024-03-14}, journal = {Neural Information Processing Systems (NeurIPS)}, publication = {https://neurips.cc/virtual/2024/poster/95220}, archiveprefix = {arxiv}, }
Approximation of Relation Functions and Attention Mechanisms

Awni Altabaa, and John Lafferty

Information Theory, Probability and Statistical Learning: A Festschrift in Honor of Andrew Barron, 2024

Abs TLDR arXiv Bib

Inner products of neural network feature maps arises in a wide variety of machine learning frameworks as a method of modeling relations between inputs. This work studies the approximation properties of inner products of neural networks. It is shown that the inner product of a multi-layer perceptron with itself is a universal approximator for symmetric positive-definite relation functions. In the case of asymmetric relation functions, it is shown that the inner product of two different multi-layer perceptrons is a universal approximator. In both cases, a bound is obtained on the number of neurons required to achieve a given accuracy of approximation. In the symmetric case, the function class can be identified with kernels of reproducing kernel Hilbert spaces, whereas in the asymmetric case the function class can be identified with kernels of reproducing kernel Banach spaces. Finally, these approximation results are applied to analyzing the attention mechanism underlying Transformers, showing that any retrieval mechanism defined by an abstract preorder can be approximated by attention through its inner product relations. This result uses the Debreu representation theorem in economics to represent preference relations in terms of utility functions.

We demonstrate that inner products of multi-layer perceptrons can universally approximate both symmetric and asymmetric relation functions, providing bounds on the number of neurons required for a given accuracy, and apply these findings to analyze the attention mechanisms in Transformers.
@article{altabaaApproximationRelationFunctions2024, title = {Approximation of Relation Functions and Attention Mechanisms}, author = {Altabaa, Awni and Lafferty, John}, journal = {Information Theory, Probability and Statistical Learning: A Festschrift in Honor of Andrew Barron}, year = {2024}, number = {arXiv:2402.08856}, eprint = {2402.08856}, primaryclass = {cs, stat}, publisher = {arXiv}, doi = {10.48550/arXiv.2402.08856}, urldate = {2024-03-14}, archiveprefix = {arxiv}, }
Learning Hierarchical Relational Representations through Relational Convolutions

Awni Altabaa, and John Lafferty

Transactions on Machine Learning Research (TMLR), 2024

Abs TLDR arXiv Bib Blog Code Publication

An evolving area of research in deep learning is the study of architectures and inductive biases that support the learning of relational feature representations. In this paper, we address the challenge of learning representations of hierarchical relations–that is, higher-order relational patterns among groups of objects. We introduce "relational convolutional networks", a neural architecture equipped with computational mechanisms that capture progressively more complex relational features through the composition of simple modules. A key component of this framework is a novel operation that captures relational patterns in groups of objects by convolving graphlet filters–learnable templates of relational patterns–against subsets of the input. Composing relational convolutions gives rise to a deep architecture that learns representations of higher-order, hierarchical relations. We present the motivation and details of the architecture, together with a set of experiments to demonstrate how relational convolutional networks can provide an effective framework for modeling relational tasks that have hierarchical structure.

We introduce relational convolutional networks, a novel neural architecture that captures hierarchical relational patterns through the composition of learnable graphlet filters, demonstrating its effectiveness in modeling complex relational tasks.
@article{altabaaRelationalConvolutionalNetworks2023, title = {Learning Hierarchical Relational Representations through Relational Convolutions}, shorttitle = {Relational Convolutional Networks}, author = {Altabaa, Awni and Lafferty, John}, year = {2024}, journal = {Transactions on Machine Learning Research (TMLR)}, publication = {https://openreview.net/forum?id=vNZlnznmV2}, eprint = {2310.03240}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, }
The Relational Bottleneck as an Inductive Bias for Efficient Abstraction

Taylor W. Webb, Steven M. Frankland, Awni Altabaa, Kamesh Krishnamurthy, and 5 more authors

Trends in Cognitive Science (TICS), 2024

Abs TLDR arXiv Bib Publication

A central challenge for cognitive science is to explain how abstract concepts are acquired from limited experience. This effort has often been framed in terms of a dichotomy between empiricist and nativist approaches, most recently embodied by debates concerning deep neural networks and symbolic cognitive models. Here, we highlight a recently emerging line of work that suggests a novel reconciliation of these approaches, by exploiting an inductive bias that we term the relational bottleneck. We review a family of models that employ this approach to induce abstractions in a data-efficient manner, emphasizing their potential as candidate models for the acquisition of abstract concepts in the human mind and brain.

We introduce the concept of the relational bottleneck as an inductive bias that reconciles empiricist and nativist approaches in cognitive science, enabling data-efficient acquisition of abstract concepts through a family of models that leverage this bias.
@article{webbRelationalBottleneckInductive2023, title = {The Relational Bottleneck as an Inductive Bias for Efficient Abstraction}, author = {Webb, Taylor W. and Frankland, Steven M. and Altabaa, Awni and Krishnamurthy, Kamesh and Campbell, Declan and Russin, Jacob and O'Reilly, Randall and Lafferty, John and Cohen, Jonathan D.}, year = {2024}, journal = {Trends in Cognitive Science (TICS)}, publication = {https://www.cell.com/trends/cognitive-sciences/fulltext/S1364-6613(24)00080-9}, }
Abstractors and Relational Cross-Attention: An Inductive Bias for Explicit Relational Reasoning in Transformers

Awni Altabaa, Taylor Webb, Jonathan Cohen, and John Lafferty

International Conference on Learning Representations (ICLR), Apr 2024

Abs TLDR arXiv Bib Blog Code Poster Slides Publication

An extension of Transformers is proposed that enables explicit relational reasoning through a novel module called the Abstractor. At the core of the Abstractor is a variant of attention called relational cross-attention. The approach is motivated by an architectural inductive bias for relational learning that disentangles relational information from extraneous features about individual objects. This enables explicit relational reasoning, supporting abstraction and generalization from limited data. The Abstractor is first evaluated on simple discriminative relational tasks and compared to existing relational architectures. Next, the Abstractor is evaluated on purely relational sequence-to-sequence tasks, where dramatic improvements are seen in sample efficiency compared to standard Transformers. Finally, Abstractors are evaluated on a collection of tasks based on mathematical problem solving, where modest but consistent improvements in performance and sample efficiency are observed.

We introduce the Abstractor model, which incorporates a form of attention called relational cross-attention, enabling explicit relational reasoning in Transformers and demonstrating significant improvements in sample efficiency across various tasks.
@article{altabaaAbstractorsRelationalCrossattention2023, publication = {https://openreview.net/forum?id=XNa6r6ZjoB}, title = {Abstractors and Relational Cross-Attention: An Inductive Bias for Explicit Relational Reasoning in Transformers}, shorttitle = {Abstractors and Relational Cross-Attention}, author = {Altabaa, Awni and Webb, Taylor and Cohen, Jonathan and Lafferty, John}, journal = {International Conference on Learning Representations (ICLR)}, year = {2024}, month = apr, number = {arXiv:2304.00195}, eprint = {2304.00195}, primaryclass = {cs, stat}, publisher = {{arXiv}}, urldate = {2023-10-30}, archiveprefix = {arxiv}, }

2023

Decentralized Multi-Agent Reinforcement Learning for Continuous-Space Stochastic Games

Awni Altabaa, Bora Yongacoglu, and Serdar Yüksel

2023 IEEE American Control Conference (ACC), Mar 2023

Abs TLDR arXiv Bib Code Slides Publication

Stochastic games are a popular framework for studying multi-agent reinforcement learning (MARL). Recent advances in MARL have focused primarily on games with finitely many states. In this work, we study multi-agent learning in stochastic games with general state spaces and an information structure in which agents do not observe each other’s actions. In this context, we propose a decentralized MARL algorithm and we prove the near-optimality of its policy updates. Furthermore, we study the global policy-updating dynamics for a general class of best-reply based algorithms and derive a closed-form characterization of convergence probabilities over the joint policy space.

We propose a decentralized multi-agent reinforcement learning algorithm for stochastic games with continuous state spaces and prove the near-optimality of its policy updates, along with a characterization of convergence probabilities in best-reply based algorithms.
@article{altabaaDecentralizedMultiAgentReinforcement2023, publication = {https://ieeexplore.ieee.org/abstract/document/10155828}, title = {Decentralized Multi-Agent Reinforcement Learning for Continuous-Space Stochastic Games}, author = {Altabaa, Awni and Yongacoglu, Bora and Y{\"u}ksel, Serdar}, year = {2023}, month = mar, journal = {2023 IEEE American Control Conference (ACC)}, }

2022

geneDRAGNN: Gene Disease Prioritization Using Graph Neural Networks

Awni Altabaa, David Huang, Ciaran Byles-Ho, Hani Khatib, and 2 more authors

2022 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Aug 2022

Abs TLDR Publication

Many human diseases exhibit a complex genetic etiology impacted by various genes and proteins in a large network of interactions. The process of evaluating gene-disease associations through in-vivo experiments is both time-consuming and expensive. Thus, network-based computational methods capable of modeling the complex interplay between molecular components can lead to more targeted evaluation. In this paper, we propose and evaluate geneDRAGNN: a general data processing and machine learning methodology for exploiting information about gene-gene interaction networks for predicting gene-disease association. We demonstrate that information derived from the gene-gene interaction network can significantly improve the performance of gene-disease association prediction models. We apply this methodology to lung adenocarcinoma, a histological subtype of lung cancer. We identify new potential gene-disease associations and provide supportive evidence for the association through gene-set enrichment and literature based analysis.

We present geneDRAGNN, a graph neural network-based methodology that leverages gene-gene interaction networks to enhance the prediction of gene-disease associations, demonstrating its effectiveness in identifying potential associations for lung adenocarcinoma.