The Last Year of Must-Know AI Research for Businesses
The pace of AI research and innovation can feel overwhelming. Blink and you might feel left behind. Try to keep up and you could get lost separating signal from noise - struggling to identify which papers and trends will actually matter to the AI products or businesses you are building.
It's clear that AI research output is rapidly accelerating. Submissions to two leading machine learning conferences, NeurIPS and ICLR, significantly grew in recent years. NeurIPS grew from 6,700 submissions in 2019 to 12,300 at its 2023 conference1, while ICLR submissions shot from 1,600 in 2019 to 7,400 in 20242.
However, this rapid research has not been matched by private investment into AI startups and companies. Average monthly global AI startup funding was actually higher in 2019 at $25.0B compared to $23.9B in 2023, according to Crunchbase data3. This doesn’t tell the full story, monthly investment hit a peak of $58.0B in 2021 and 2024 is seeing significant investments, regardless there is clear imbalance between AI research and investing growth.
Despite the hype for generative AI driven by recent advances in the transformer and other AI architectures, establishing a clear causal link between pioneering research and commercial impact remains challenging. The journey from lab to market is not straightforward.
This article highlights influential research papers from the past year and their potential commercial implications to inform strategic decisions for technology builders and business leaders driving AI adoption. For those feeling behind the curve or simply seeking an outside perspective, hopefully this helps provide signal amidst the noise.
Note: This is not an exhaustive list. In some cases, I omitted papers with niche applications like drug discovery. In others, I struggled to isolate standout papers for major themes like multimodal models and self-play, though I view these as highly impactful areas.
Memory efficient architectures
One of the transformer architecture's most notable limitations is its lack of memory recall capabilities. By design, a trained transformer cannot learn or update its representations without full retraining. The absence of memory recall means previous inputs outside of the context window are effectively forgotten unless present in the original training data.
Interestingly, this capability constraint is directly linked to the model's suboptimal scaling properties for compute and memory usage on larger workloads - a key technical bottleneck. At inference, the model must re-compute representations for all preceding tokens for each new token it generates. It implements a key-value (KV) cache mechanism to optimize this inefficiency however memory usage still scales linearly with the sequence length. The situation is worse during training, where the self-attention mechanism computes a dot product between every pair of input tokens.
By deriving each new token from its relationship to all previous tokens, the transformer unlocks remarkable emergent abilities in next-token prediction. However, this comes at the cost of substantial computational overhead and an innate inability to form persistent memories. While memory-enabled architectures would solve the inefficiencies, early versions could not match the transformer's text generation abilities.
State-Space model architecture
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Findings: Selective state space models (SSMs) can improve model memory and parallelization achieving state-of-the-art performance with linear scaling in sequence length and significantly faster inference compared to generative pre-trained transformers.
Why it matters: This approach solves two major limitations of the transformer architecture. First, compute requirements scale linearly with input size whereas transformers scale quadratically. Second, it provides “memory” where the model adapts based on previous inputs and outputs; transformers have no “memory”. While the SSM architecture isn't new, the Mam a paper was the first time this architecture achieved near the same performance benchmark scores as the more mainstream transformer.
Long Short-Term Memory model architecture
xLSTM: Extended Long Short-Term Memory
Findings: Overcomes previous LSTM limitations in storage capacity, revising storage decisions, and parallelizability using exponential gating and novel memory structures, demonstrating competitive performance with state-of-the-art Transformers and State Space Models.
Why it matters: Relative to transformers, this approach enables memory and provides strong support for handling of long text with more performance efficient scaling.4 Similar to SSM, xLSTM is based on a well established architecture known to be more efficient than a transformer which only matched transformer text prediction task performance with the advancements made in this paper. The exponential gating and memory structures make the stored memory more controllable and interpretable compared to Mamba SSM where memory is hidden within the model state.
Training
Leaderboards like LMSYS Chatbot Arena Leaderboard demonstrate a very clear trend - the top state-of-the-art (SOTA) models apply reinforcement learning from human feedback (RLHF) whereas others further down stop at instruction tuning. The current predominant approach to RLHF, proximal policy optimization (PPO), is highly manual requiring collection and scoring of massive training datasets by human scorers, is computationally complex to implement, and is simplistic in how it rates a preferred output over another. There is significant potential to imbue models with deeper domain expertise if feedback collection becomes more nuanced and scalable.
Improving reinforcement learning
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Findings: Simplifies the RLHF process and outperforms traditional PPO approaches on stability, performance, and computational efficiency by utilizing various forms of human feedback including preference rankings, numerical feedback, natural language feedback, and edit rate to eliminate the need for explicit reward modeling.
Why it matters: Limitations of current PPO methods used in RLHF are extensive but not widely discussed. They are heavily manual with human scoring applied across the entire output as a generalization rather than identifying specific aspects of the output which were positive or negative. DPO makes significant advances in more automated and specific feedback collection. This and other methods such as Parameter Efficient Reinforcement Learning (PERL) from Human Feedback are an important research area to make efficiently learning from user feedback and improving AI specialization over time more widely accessible.
Small language models
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Findings: A 3.8B parameter model trained on 3.3T tokens of high-quality data matches benchmark performance of models 10-15x larger like GPT-3.5 and Mixtral-8x7B.
Why it matters: The appeal of small language models (SLMs) for on-device deployment has been offset by subpar quality compared to their larger, cloud-based counterparts. Previous research, known as the "Chinchilla scaling laws", argued for training models on a high ratio of data tokens to model parameters - specifically a 20:1 token-to-parameter ratio in that paper. However, recent advances in the Phi-3 SLM upends this conventional thinking. Phi-3 demonstrated that focused curation of ultra high-quality training data can enable SLMs to match the performance of models 10x larger, and at drastically lower token-to-parameter ratios of roughly 1:1. This method is promising for high-performing, lightweight models that can run on personal devices.
Information Retrieval
Information retrieval, whether from within the context window or post-inference, is key to taking the generalized knowledge of large language models and specializing them to specific business contexts or personalized user experiences. Major consumer AI assistants like ChatGPT, Claude, and Gemini already leverage retrieval as a core feature. And, as generative AI expands across enterprises, retrieval architectures are quickly becoming a necessary component for most AI product teams. However, designing effective retrieval systems is not straightforward. Design considerations like latency, access control, caching strategies, and data refresh rates introduce make careful consideration of tradeoffs important to selecting the right tooling and implementation patterns.
Implementing million-token context windows
Ring Attention with Blockwise Transformers for Near-Infinite Context
Findings: Significantly increases the ability for transformers to handle long context windows by breaking model input into smaller chunks to enable training and inference on sequences of data spread across multiple GPUs (and other AI processing units), known as blockwise computation and a ring topology.
Why it matters: The length of input and output for an AI model is now controlled by the number of processing units a company is able to allocate it. For the GPU-rich, this input/output size is effectively infinite. Recent research is also showing promising results in context window extension - taking a base model of a smaller window size and using techniques like RoPE scaling to increase a tuned model’s context window.
Needle in haystack
Retrieval Head Mechanistically Explains Long-Context Factuality
Findings: Demonstrates a specific subset of the attention head in a transformer model, dubbed “retrieval heads”, is responsible for accurately retrieving information from long contexts with deeper evidence they are universal, sparse, intrinsic, dynamically activated, and causal.
Why it matters: With increasing context window lengths, this provides researchers clear direction on where to focus if they want to improve accurate retrieval in longer contexts. Further, it gives confidence a transformer will retrieve the fact from input rather than hallucinate it. But it is also important for what it isn’t. Many model providers claim high performance on the “needle in haystack” test to retrieve a small fact among much larger corpus of data. Close reading of the research shows the needle is in fact quite different from the rest of the input data. More research will be needed when considering retrieval of a fact among similar data or when it requires reasoning.
High-performance, low-resource embeddings
Gecko: Versatile Text Embeddings Distilled from Large Language Models
Findings: Uses LLMs to generate synthetic data and label the data for training an embedding model to achieve top-5 performance on the Hugging Face Massive Text Embedding Benchmark (MTEB) despite being much smaller in size, thus more efficient, compared to most others.
Why it matters: Embeddings are likely the second-most common machine learning model used in current AI products behind only the LLM. Specifically, they are used in specializing a model’s output to an internal data set not used in training, known as retrieval augmented generation (RAG). Efficient retrieval without accuracy loss on a data size which is sufficient for most tasks will improve the overall AI product experience.
Responsible AI
The capabilities of generative AI are advancing at an accelerating pace. This pace sparked intense debate around safe development practices. Concerns range from protecting intellectual property and individual privacy rights, to forecasting potential employment impacts, and even existential risks to humanity itself. As generative AI's capability grows, so too does the imperative to better understand the black box of transformer models and align their development with human values and ethical principles. Rigorous research into interpretability, robustness, and value alignment will be critical to ensuring we harness AI's immense potential for good while mitigating hazardous consequences, intended or unintended.
Model explainability in vector-space
Representation Engineering: A Top-Down Approach to AI Transparency
Findings: Demonstrates its effectiveness in improving model observability and control across various safety-relevant problems using a top-down approach focusing on representations rather than neurons or circuits, distinguishing itself previous works which emphasize mechanistic interpretability and saliency maps.
Why it matters: Model interpretability through vector space techniques shows a lot of promise. Anthropic recently published Scaling Monosemanticity which is worthy of a dedicated article by itself. The RepE paper was selected because it is one of the earlier examples of success in this research domain and remains additive to other ongoing work such as Anthropic's. to probe the vector-space, the mathematical representation of data used by transformers, representations of computations in a model to understand and explain how the model is functioning. Additionally, the organization leading this research, Eleuther AI, is a non-profit, open research community which is commendable for it’s ability to organically form and drive research projects such as this one.
Model exploits via APIs
Findings: Highlights the risks associated with API-level attacks beyond direct model interaction by identifying vulnerabilities in APIs including fine-tuning, function calling, and knowledge retrieval, that can be exploited to bypass safety measures, generate harmful outputs, and execute arbitrary functions.
Why it matters: Security is cited as one of the top reasons businesses and users have not adopted AI more widely yet many CTOs and CIOs struggle to describe the most basic exploits in direct model interaction such as jailbreaking and prompt injection. This paper goes even deeper to document other forms of exploits possible in indirect interaction, e.g., fine-tuning, and more sensitive actions such as function calling to interact directly with enterprise systems and data. It establishes that model security and robustness must be multi-layer throughout the entire lifecycle.
Evaluating language models
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Findings: Open-source evaluator language model which achieves the highest correlation and agreement with both human and proprietary language model evaluators on both direct assessment and pairwise ranking.
Why it matters: Demonstrating a language model to evaluate other language models which scores closely to human judgment is important step in alignment and responsible AI development. By making the evaluator open-source, it provides the option for users to adjust the scoring criteria to fit their development needs. By providing evaluation as a language model itself, it enables trust and verification to be better automated on newly released language model versions which will improve transparency and trust in the overall ecosystem.
Agents
Agents - AI systems capable of autonomously taking actions on behalf of humans - represent an area of intense focus for leading AI companies and research labs. In order to make advances delegating increasingly complex and consequential tasks to AI agents, building user trust will be paramount. Rigorously demonstrating that these agents can reliably match or exceed human-level performance to deliver on nuanced goals is crucial. Empirically measuring agent effectiveness, safety, and alignment with the specified user objectives will be critical for AI agents to graduate from narrow task automation to broader decision-making autonomy.
Multi-agent learning and performance
Scaling Instructable Agents Across Many Simulated Worlds
Findings: Demonstrates generalization capabilities of multi-agent architecture outperforming environment-specialized agents and extending capabilities to new environments by training AI agents to follow arbitrary language instructions across diverse 3D environments, including both research settings and commercial video games through a language-driven, human-like interface.
Why it matters: This research gives a preview of agents being able to deliver on the promise of AI to solve previously unseen problems without additional training. It’s exciting this multi-agent architecture with generalized capabilities possesses the ability to outperform environment-specific models when trained on environment-specific data and data from other environments. Additionally, it shows transfer learning from similar capabilities applies to new environments for agents. Overall, I expect multi-agent architectures with varying capabilities to become increasingly popular so users and developers have a tradeoff choice on selecting the right cost and outcome tradeoff for a given task.
Closing remarks
As we look back on the research highlighted here, a few areas personally stand out - more automated and scalable reinforcement learning and breakthroughs enabling greater model interpretability through vector space representations. The list is curated based on conversations with researchers, founders, investors, and business executives. But it is impossible to predict the impacts with 100% accuracy. Where is there potential overestimation of the research covered in this article?
Additionally, there is a significant amount of emerging research and it is impossible to cover every significant piece, especially as the state of art is advancing rapidly. Areas such as fully-native multi-modal architectures and long context extension via RoPE scaling regrettably aren’t included but are very exciting, among many others. What is the research you are excited about?
The potential unlocked by this cutting-edge work is exciting, but even more important is building AI systems that are robust, aligned with human values, and focused on beneficial real-world impacts.
https://papercopilot.com/statistics/neurips-statistics/
https://papercopilot.com/statistics/iclr-statistics/
https://news.crunchbase.com/venture/monthly-global-funding-recap-february-2024/
Thank you to my colleague, Sohrab Rahimi (https://www.linkedin.com/in/sohrab-rahimi/) for sharing his insights on xLSTM.