What Happens at NeurIPS in Canada, Appears on my Newsletter
Inside the NeurIPS 2024 “hallway track”: why LLMs aren’t enough, RL’s growing role, and what top recruiters really seek in AI talent.
NeurIPS 2024 felt different—more grounded and reflective, my . Yes, large-scale language models were on everyone’s mind, and no one denies that LLMs will play a central role going forward.
But this year, the conversations painted a richer, more complicated picture of what’s needed to push machine learning to the next level. There’s a collective sense that LLMs alone, as impressive as they’ve become, are insufficient for truly complex tasks. Instead, researchers and practitioners are emphasizing the importance of reinforcement learning, more nuanced model architectures, and a rethinking of how we approach intelligence in these systems.
A Move Beyond Clean Architectures and Simple Benchmarks
In multiple hallway chats, I heard a recurring refrain: our models might be too neat, too regularized, and too focused on standardized benchmarks that fail to reflect the messy nature of reality. Jeff Dean said “It’s not just about scaling more parameters. We need architectures that embrace complexity—messier neural connections, richer topologies—to capture the unexpected patterns that emerge in the real world.”
It might mean revisiting older ideas or developing new forms of regularization that better mimic how real biological brains piece together information.
This line of thought dovetails neatly with the surge of interest in reinforcement learning (RL) on display. RL isn’t new, of course, but there’s a renewed excitement around integrating RL with large models and advanced reasoning capabilities. Several participants pointed out that while LLMs excel at producing fluent language and code, RL frameworks might help them learn to select better strategies dynamically.
Side note: there is a lot of excitement about recent results showing that training LLM on coding tasks improves its reasoning on non-coding tasks.
Instead of just predicting the next token, models trained with RL can navigate sequences of decisions, seek goals, and adapt policies. Potentially even in the embedded space. This could make future AI systems more agent-like, capable of acting on the world rather than just describing it.
There were also lots of conversations about how current benchmarking is failing us. Time and again, people lamented that while models can look spectacular on neatly packaged tasks or standardized leaderboards, these tests don’t capture the sprawling complexities, ambiguities, and emergent properties we encounter in real-world domains.
The problem, as several attendees put it, is that benchmarks often measure what is easy to quantify rather than what truly matters. They encourage superficial gains—improving a number here, beating a baseline there—without demanding that models genuinely reason, adapt, or demonstrate deeper understanding.
As a result, impressive results on these benchmarks don’t necessarily translate into robust, nuanced capabilities when models leave the controlled conditions of the lab and enter messy, unpredictable environments.
One speaker mentioned that their model got over 95% on a benchmark and was not able to do anything useful (that the benchmark was measuring). D. Sculley gave a full talk about how the benchmarks must evolve so we don’t continuously fool ourselves with issues like data leakage. He is excited about the new generation of evaluation techniques like SWE Bench.
Agents That Do, Not Just Say
It’s not enough for a model to generate correct code snippets or polished paragraphs; we want them to solve non-trivial tasks and engage with tools—potentially executing actions in the world.
A participant from a well-known lab illustrated this with an example of integrating a model into a broader system: “We can give it the power to call APIs or update inventory databases. Combined with RL, the model can learn which action to take next rather than just producing a static answer.”
Fei Fei’s talk also focused a lot on 3d perception and understanding from images.
The ultimate vision? Agents that plan and adapt, rather than merely respond. This isn’t just academic speculation; people hinted at internal prototypes where models orchestrate sequences of calls and checks, stepping away from linear text generation into something more dynamic and contextually aware.
The Practicalities of Hiring and Credibility
All this talk of advanced RL and more complex, agent-like architectures is exciting, but what does it mean for people entering or shifting roles within the industry?
I had the opportunity to chat with a recruiter from a major FAANG company—who wishes to stay anonymous—about what hiring looks like in this evolving landscape. “Yes, everyone wants experienced machine learning engineers and data scientists, but now we’re placing a premium on those who can navigate these integrated systems,” they told me. “People who understand that LLMs alone can’t solve all problems, and who can think in terms of reinforcement learning loops, evaluation strategies, and designing less rigid architectures—that’s who stands out.”
Other recruiters echoed this sentiment. They mentioned that having a strong publishing record at places like NeurIPS can set you apart. “Don’t underestimate the signal that a NeurIPS publication sends,” said one. “It’s not just academic clout. It shows you’ve dealt with cutting-edge methods and can push beyond cookie-cutter solutions.” (If this is of interest to you, make sure to catch my video about my time at NeurIPS 2024).
The other way a conference like this can accelerate your career is by in-person networking with recruiters. There were many and it is pretty clear that there are still many unfilled jobs in ML today.
Beyond the Valley of Hype
Amid the intellectual buzz, another undercurrent was a sober look at business models. The well-worn joke that big tech companies thrive because they “sell ads” while newer AI labs struggle to find a stable monetization strategy was repeated with a hint of anxiety.
In other words, clever RL approaches and fancy architectures won’t matter if companies can’t figure out how to pay the bills. Teams are starting to look at domain-specific solutions—embedding advanced models (reinforced by RL and better architectural design) into tools that solve very particular problems. When you solve a real problem, revenue tends to follow.
The consensus: first-mover advantage and pure hype are not enough. The industry must ground breakthroughs in tangible, sustainable applications. To this point, the publications at this year’s NeurIPS were more focused on applied ML than any I have been to.
Looking Ahead
So where does this leave us? The conversations at NeurIPS suggest that the next big push in ML will come from blending large language models with RL-based decision-making and more nuanced, “messy” neural architectures.
Hiring and career paths will reflect this shift, favoring those who can think beyond neat stacks and classical training sets.
And industry, once dazzled by pure scale, is now pressing for stable business models and real-world impact.