It's clear that LLMs+RL+<Insert ML/AI Approach> is leading to some pretty impressive AI tools that can solve problems1 in novel and unique ways. Additionally, LLMs have shown that when the task or thinking is best defined through language, code, or mathematics, these AI tools can clearly outperform humans at some of these tasks. Software engineering benchmarks [1] clearly show rapidly advancing capabilities in autonomous issue resolution, though resolved fractions on real-world repositories remain far below typical human developers. However, the internal mechanisms or state representations which are clearly rich and capturing something we humans would call "intelligence", these AI tools do not, as far as I know2, out of the box, maintain a reliable internal "simulator" of state and dynamics. They can roll out scenarios in language/code/math (chain-of-thought, tree search, and similar), but those rollouts are often inconsistent or ungrounded when you need faithful prediction over many steps of inputs, information, constraints, goals, and actions [2]. This seems to me to be a clear limitation of the current SOTA in AI, but keep in mind I'm not a researcher in this space, :shrug:.
Note
There are a lot of references for this post and I did not read most of them all in great detail, but rather "ran" through them and then asked questions to clarify my understanding using Gemini and other LLMs.
If I take a moment here to think about how a human thinks internally it seems we excel at abstraction and planning in a model space of the world we've experienced. Take the scenario of when you or I think about a ball rolling down a hill, we have this mental model for what the ball will do based on our knowledge of the environment around us that we have observed. Now true, a LLM does have knowledge of the governing physics, possible initial/boundary conditions, and thus can numerically simulate the physics exactly but that is not what we do, we often do this by approximate forward simulation in the head, like an "intuitive physics engine" [3]. This knowledge can be shaped early by maturation and perception3, and it is directly reinforced from our continued observations of the same interactions and patterns. We humans have a world model [4] that we use to make decisions and predictions about the world around us; the same idea was formalized in AI as a learned simulator of environment dynamics [5]. Our brains likely evolved for this because it allows us to make informed decisions towards our actual actions and goals ... not a reseacher here just thinking out loud.
Again just stop and think about it, when you are preparing for say a trip or presentation how you mentally play things out the scenarios that may occur. Or think about when you are doing any lab or simulation work, how you mentally scope various scenarios and hypothesize about the outcomes. As you do these things you narrow down on the possible set of actions and targeted goals you want to achieve based on how your mental perspective of these form.
I still don't know if I truly gauge or grasp what a world model is, but I think what it tries to represent is the abstract planning of actions and goals based on the latent (i.e. largely unconscious) knowledge we have acquired. The better the predictive model, the better our real world actions and goals are often defined and executed ... though "world model" is an analogy here it seems Craik's internal models, episodic simulation of the future, and motor internal models in the brain are clearly related ideas, not one single thing [6].
Take riding a bike. Everyone who has learned to ride a bike will tell you that if they rode a bike for 5 years and then stopped for 10 years and then tried again to ride the bike, they often recover riding ability within 10 minutes or so. This is the procedural memory for skilled actions that is stored separately from everyday declarative memory [7]. Truth though is it seems to be that the persistence is usually explained by your implicit motor circuits (basal ganglia, cerebellum, and related systems), not by consciously rehearsing the physics in your mind's eye before you mount, although though mental imagery does help. Still, I argue that there is a motor "model" of riding a bike such that it encodes balance, pedaling, and steering in a given context. If we did the same thing but instead of riding the bike on a road we told you to ride the bike on a tightrope or snow, your learned skills for riding on a road would still be invoked (i.e. you would think about how it might feel to ride a bike on a tightrope) but because that latent representation was never tuned or updated with actual feedback from those environments (i.e., you never actually rode a bike on a tightrope or snow), you would likely struggle [8]. Interestingly though, you could imagine that some might do better than others because their prior experience and internal forward models are richer, so mind-planning in the tightrope action space is more robust.
LeCun's World Model
Yann LeCun has been fairly vocal about world models as part of a broader cognitive architecture for embodied AI. He sees it as a orthogonal direction to LLMs. My understanding on his position is that systems that only predict the next token (or pixel) in some observation space will struggle to reach the kind of sample-efficient reasoning and planning that animals show. The hypothesis is because they waste capacity on unpredictable surface detail instead of learning the proper abstract state and dynamics. The recipe he and collaborators have worked on is a configurable predictive world model: encode what is observed, then roll forward in a compact latent space (optionally conditioned on actions), and use those "imagined" trajectories to choose goals and controls [9].
The way, to my understanding, that this idea departs from classic supervised vision is exactly what I was gesturing in the cat example. Don't train a network to output the label "cat" from pixels and don't ask the model to reconstruct every RGB value either. Instead have encoders that map observations into a high-dimensional latent representation, and a predictor that forecasts how that representation should evolve. For just static images the "action" conditioning the predictor can be absent (e.g., I-JEPA4 [10]). From this we get a self-supervised prediction in representation space and avoids forcing two augmented views of a cat to share one embedding via contrastive learning. In ref. [9] it seems he actually argues against leaning on contrastive losses and favors regularized, predictive objectives instead, but not sure I understand, per usual.
So what does JEPA (Joint-Embedding Predictive Architecture) look like? Rather than me go through the details, for which my visuals may not be as clear, The interview with LeCun on Welch Labs had some really good visuals and explanations:
There are many variations of JEPA, each built on the same overall idea ofmapping observed data into latent representations with a predictor and training via prediction errors in latent space rather than on raw outputs [9]. There is I-JEPA for images, V-JEPA for video, action-conditioned versions for control tasks, and hierarchical JEPA for planning at multiple time scales [11],[9].
The LeWorldModel [12] is a recent implementation of an action-conditioned small encoder-predictor trained and includes a SIGReg in the loss that keeps latents from collapsing. There is no reconstructiong the pixels for forward frames; everything is done in latent space. At test time it encodes start and goal images, rolls the predictor forward under candidate actions, and picks the sequence whose final latent is closest to the goal. The paper's claim to fame is that the planning (i.e. action conditioned predictor) in latent space is cheap comapred to using pixels. An interesting thing here is that nothing in the self-supervised loss contains any biasing towards known physics (i.e., how 3D classical mechanics should work).
How does a world model connect to physics? Probably not a drop-in replacement for a classical PDE solver on a macroscopic grid or an atomistic MLIP, but the idea of a JEPA-style scientific world model makes sense to me because it matches what I've seen from learned simulators and neural operators, such as Fourier Neural Operators [13] to learn a compact state, predicted evolution under physics biases, rolled forward without reconstructing every emergent degree of freedom at each step.
Footnotes
-
Frontier LLM have clearly shown in mathematics and perhaps elsewhere solutions that had yet to been-thought-of by humans. The clear demonstration for this is Erdos problems where solutions that experts state are novel have been produced by frotnier models from OpenAI and Anthropic [14–15]. ↩
-
Keep in mind, I am not a frontier researcher in AI research in any shape or form but more of a domain specialist applying AI tooling. ↩
-
I use "bootstrapping" to mean that without any prescribed inductive biases, the correct mental model exist and can be used to make the correct prediction or action. I don't know if this is the correct wording. A good example is the visual cliff experiment [16]: crawling infants on a glass floor where one half looks solid and safe while the other looks like a cliff. Most refuse to cross to their mother when she calls from the cliff side, consistent with a bootstrapped visual model of depth. ↩
-
Take the visible patches of an image as context and train a predictor to match the latent embeddings of other masked regions in the same image, where the target embeddings come from a slow-moving target encoder (exponential moving average of the context encoder) so the problem does not collapse to a trivial constant latent representation. ↩
References
…
Show remaining references
No comments:
Post a Comment
Please refrain from using ad hominem attacks, profanity, slander, or any similar sentiment in your comments. Let's keep the discussion respectful and constructive.