|𝔻⟩irac's Student: May 2026

It's clear that LLMs+RL+<Insert ML/AI Approach> is leading to some pretty impressive AI tools that can solve problems¹ in novel and unique ways. Additionally, LLMs have shown that when the task or thinking is best defined through language, code, or mathematics, these AI tools can clearly outperform humans at some of these tasks. Software engineering benchmarks [1] clearly show rapidly advancing capabilities in autonomous issue resolution, though the resolved fractions on real-world repositories remain far below typical human developers. However, the internal mechanisms or state representations, which are clearly rich and capturing something we humans would call "intelligence", these AI tools do not, as far as I know², out of the box, maintain a reliable internal "simulator" of state and dynamics. They can roll out scenarios in language/code/math (chain-of-thought, tree search, and similar), but those rollouts are often inconsistent or ungrounded when you need faithful prediction over many steps of inputs, information, constraints, goals, and actions [2]. This seems to me to be a clear limitation of the current SOTA, but keep in mind I'm not a researcher in this space, 🤷‍♂️.

Note

There are a lot of references for this post and I did not read most of them all in great detail, but rather "ran" through them and then asked questions to clarify my understanding using Gemini and other LLMs.

If I take a moment here to think about how a human thinks internally it seems we excel at abstraction and planning in a model space of the world we've experienced. Take the scenario of when you or I think about a ball rolling down a hill, we have this mental model for what the ball will do based on our knowledge of the environment around us that we have observed. Now true, a LLM does have knowledge of the governing physics, possible initial/boundary conditions, and thus can numerically simulate the physics exactly but that is not what we do, we often do this by approximate forward simulation in the head, like an "intuitive physics engine" [3]. This knowledge can be shaped early by maturation and perception³, and it is directly reinforced from our continued observations of the same interactions and patterns. We humans have a world model [4] that we use to make decisions and predictions about the world around us; the same idea was formalized in AI as a learned simulator of environment dynamics [5]. Our brains likely evolved for this because it allows us to make informed decisions towards our actual actions and goals ... not a reseacher here just thinking out loud.

Again just stop and think about it, when you are preparing for say a trip or presentation how you mentally play things out the scenarios that may occur. Or think about when you are doing any lab or simulation work, how you mentally scope various scenarios and hypothesize about the outcomes. As you do these things you narrow down on the possible set of actions and targeted goals you want to achieve based on how your mental perspective of these form.

I still don't know if I truly gauge or grasp what a world model is, but I think what it tries to represent is the abstract planning of actions and goals based on the latent (i.e. largely unconscious) knowledge we have acquired. The better the predictive model, the better our real world actions and goals are often defined and executed ... though "world model" is an analogy here it seems Craik's internal models, episodic simulation of the future, and motor internal models in the brain are clearly related ideas, not one single thing [6].

Take riding a bike. Everyone who has learned to ride a bike will tell you that if they rode a bike for 5 years and then stopped for 10 years and then tried again to ride the bike, they often recover riding ability within 10 minutes or so. This is the procedural memory for skilled actions that is stored separately from everyday declarative memory [7]. Truth though is it seems to be that the persistence is usually explained by your implicit motor circuits (basal ganglia, cerebellum, and related systems), not by consciously rehearsing the physics in your mind's eye before you mount, although though mental imagery does help. Still, I argue that there is a motor "model" of riding a bike such that it encodes balance, pedaling, and steering in a given context. If we did the same thing but instead of riding the bike on a road we told you to ride the bike on a tightrope or snow, your learned skills for riding on a road would still be invoked (i.e. you would think about how it might feel to ride a bike on a tightrope) but because that latent representation was never tuned or updated with actual feedback from those environments (i.e., you never actually rode a bike on a tightrope or snow), you would likely struggle [8]. Interestingly though, you could imagine that some might do better than others because their prior experience and internal forward models are richer, so mind-planning in the tightrope action space is more robust.

LeCun's World Model

Yann LeCun has been fairly vocal about world models as part of a broader cognitive architecture for embodied AI. He sees it as a orthogonal direction to LLMs. My understanding on his position is that systems that only predict the next token (or pixel) in some observation space will struggle to reach the kind of sample-efficient reasoning and planning that animals show. The hypothesis is because they waste capacity on unpredictable surface detail instead of learning the proper abstract state and dynamics. The recipe he and collaborators have worked on is a configurable predictive world model: encode what is observed, then roll forward in a compact latent space (optionally conditioned on actions), and use those "imagined" trajectories to choose goals and controls [9].

The way, to my understanding, that this idea departs from classic supervised vision is exactly what I was gesturing in the cat example. Don't train a network to output the label "cat" from pixels and don't ask the model to reconstruct every RGB value either. Instead have encoders that map observations into a high-dimensional latent representation, and a predictor that forecasts how that representation should evolve. For just static images the "action" conditioning the predictor can be absent (e.g., I-JEPA⁴ [10]). From this we get a self-supervised prediction in representation space and avoids forcing two augmented views of a cat to share one embedding via contrastive learning. In ref. [9] it seems he actually argues against leaning on contrastive losses and favors regularized, predictive objectives instead, but not sure I understand, per usual.

So what does JEPA (Joint-Embedding Predictive Architecture) look like? Rather than me go through the details, for which my visuals may not be as clear, The interview with LeCun on Welch Labs had some really good visuals and explanations:

There are many variations of JEPA, each built on the same overall idea ofmapping observed data into latent representations with a predictor and training via prediction errors in latent space rather than on raw outputs [9]. There is I-JEPA for images, V-JEPA for video, action-conditioned versions for control tasks, and hierarchical JEPA for planning at multiple time scales [11],[9].

The LeWorldModel [12] is a recent implementation of an action-conditioned small encoder-predictor trained and includes a SIGReg in the loss that keeps latents from collapsing. There is no reconstructiong the pixels for forward frames; everything is done in latent space. At test time it encodes start and goal images, rolls the predictor forward under candidate actions, and picks the sequence whose final latent is closest to the goal. The paper's claim to fame is that the planning (i.e. action conditioned predictor) in latent space is cheap comapred to using pixels. An interesting thing here is that nothing in the self-supervised loss contains any biasing towards known physics (i.e., how 3D classical mechanics should work).

How does a world model connect to physics? Probably not a drop-in replacement for a classical PDE solver on a macroscopic grid or an atomistic MLIP, but the idea of a JEPA-style scientific world model makes sense to me because it matches what I've seen from learned simulators and neural operators, such as Fourier Neural Operators [13] to learn a compact state, predicted evolution under physics biases, rolled forward without reconstructing every emergent degree of freedom at each step.

Footnotes

Frontier LLM have clearly shown in mathematics and perhaps elsewhere solutions that had yet to been-thought-of by humans. The clear demonstration for this is Erdos problems where solutions that experts state are novel have been produced by frotnier models from OpenAI and Anthropic [14–15]. ↩
Keep in mind, I am not a frontier researcher in AI research in any shape or form but more of a domain specialist applying AI tooling. ↩
I use "bootstrapping" to mean that without any prescribed inductive biases, the correct mental model exist and can be used to make the correct prediction or action. I don't know if this is the correct wording. A good example is the visual cliff experiment [16]: crawling infants on a glass floor where one half looks solid and safe while the other looks like a cliff. Most refuse to cross to their mother when she calls from the cliff side, consistent with a bootstrapped visual model of depth. ↩
Take the visible patches of an image as context and train a predictor to match the latent embeddings of other masked regions in the same image, where the target embeddings come from a slow-moving target encoder (exponential moving average of the context encoder) so the problem does not collapse to a trivial constant latent representation. ↩

References

[1] C.E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, K.R. Narasimhan, {SWE}-bench: Can Language Models Resolve Real-World GitHub Issues?, in: International Conference on Learning Representations, 2024. https://doi.org/10.48550/arXiv.2310.06770.

[2] S. Yao, D. Yu, J. Zhao, I. Shafran, T.L. Griffiths, Y. Cao, K. Narasimhan, Reasoning with Language Model is Planning with World Model, in: Empirical Methods in Natural Language Processing, 2023. https://doi.org/10.18653/v1/2023.emnlp-main.507.

[3] P.W. Battaglia, J.B. Hamrick, J.B. Tenenbaum, Simulation as an Engine of Physical Scene Understanding, Proceedings of the National Academy of Sciences. 110 (2013) 18327--18332. https://doi.org/10.1073/pnas.1306572110.

…

Show remaining references

[4] K.J.W. Craik, The Nature of Explanation, Cambridge University Press, Cambridge, 1943. https://books.google.com/books?id=wT04AAAAIAAJ.

[5] J. Schmidhuber, Making the World Differentiable: On Using Self-Supervised Fully Recurrent Neural Networks for Dynamic Reinforcement Learning and Planning in Non-Stationary Environments, (1990). https://www.idsia.ch/~juergen/FKI-126-90.pdf.

[6] D. McNamee, D.M. Wolpert, Internal Models in Biological Control, Annual Review of Control, Robotics, and Autonomous Systems. 2 (2019) 339--364. https://doi.org/10.1146/annurev-control-060117-105206.

[7] N.J. Cohen, L.R. Squire, Preserved Learning and Retention of Pattern-Analyzing Skill in Amnesia: Dissociation of Knowing How and Knowing That, Science. 210 (1980) 207--210. https://doi.org/10.1126/science.7414331.

[8] K.E. Adolph, Specificity of Learning: Why Infants Fall Over a Veritable Cliff, Psychological Science. 11 (2000) 290--295. https://doi.org/10.1111/1467-9280.00258.

[9] Y. LeCun, A Path Towards Autonomous Machine Intelligence, (2022). https://openreview.net/forum?id=BZ5a1r-kVsf.

[10] M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, N. Ballas, Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. https://doi.org/10.48550/arXiv.2301.08243.

[11] M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F.R. Hogan, D. Dugas, P. Bojanowski, V. Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y. Li, X. Ma, S. Chandar, F. Meier, Y. LeCun, M. Rabbat, N. Ballas, {{V-JEPA} 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning}, arXiv Preprint arXiv:2506.09985. (2025). https://doi.org/10.48550/arXiv.2506.09985.

[12] L. Maes, Q.L. Lidec, D. Scieur, Y. LeCun, R. Balestriero, {{LeWorldModel}: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels}, arXiv Preprint arXiv:2603.19312. (2026). https://doi.org/10.48550/arXiv.2603.19312.

[13] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, A. Anandkumar, Fourier Neural Operator for Parametric Partial Differential Equations, in: International Conference on Learning Representations, 2021. https://doi.org/10.48550/arXiv.2010.08895.

[14] G. DeepMind, Advancing Mathematics Research with {AI}-Driven Formal Proof Search, (2026). https://doi.org/10.48550/arXiv.2605.22763.

[15] {OpenAI}, An OpenAI model has disproved a central conjecture in discrete geometry, (2026). https://openai.com/index/model-disproves-discrete-geometry-conjecture/.

[16] E.J. Gibson, R.D. Walk, The Visual Cliff, Scientific American. 202 (1960) 64--71. https://doi.org/10.1038/scientificamerican0460-64.

Reuse and Attribution

Glass-ceramics are one of those materials that seem obvious after you take a few materials engineering courses. You start with a glass, which is useful because it can be melted, shaped, and processed. Then instead of avoiding crystallization, you force the glass to crystallize in a controlled way.

That is the key distinction. In normal glass processing, crystallization is usually a defect. It means the glass devitrified. In glass-ceramics, crystallization is the product. So you want something like:

$$ \text{glass} \xrightarrow{\text{controlled heat treatment}} \text{glass-ceramic} $$

Typically, the more useful state is a glass-ceramic with one or more crystalline phases embedded in a residual glassy matrix. The parent glass composition, nucleation process, growth process, crystal size, and residual glass are all essential aspects [1].

The thing I like about glass-ceramics is that they are a very clean process-structure-property example where kinetics matter. You are not just choosing a chemistry. You are choosing a chemistry that can first form a vitreous state, and then later crystallize into the phases you actually want. The workflow is then something like:

graph LR A[Composition] --> B[Glass] B --> C[Crystallization Path] C --> D[Microstructure] D --> E[Properties]

This controlled process drives structure-property relationships that lead to glass-ceramics that have applications in cookware, cooktops, dental restorations, optical mirror substrates, sealants, and bioactive materials. They look unrelated, but the trick is the same, i.e., controlled crystallization gives access to property combinations that are hard to get from either ordinary glass or conventional sintered ceramics [2].

Figure 1. Corning Pyroceram glass-ceramic. Wikimedia Commons, public domain.

CorningWare (see fig. Figure 1) was a glass-ceramic that turned out to have outstanding thermal shock properties and hence was used for cookware. Whereas many silica-rich glasses (for example soda-lime window glass) tolerate rapid temperature changes poorly; borosilicate glass is a common exception. The low-expansion glass-ceramic handles thermal shock much better because the crystallized microstructure changes the effective thermal expansion response.

The classical glass-ceramic idea is tied to S. Donald Stookey's work on catalyzed crystallization of glass [3]. The important point is not simply that glasses can crystallize. I mean ever vitreous material tends towards the thermodynamically stable state which usually is the crystalline phase. As stated above the important point is that the crystallization process is very caefully controlled. The way to think of this is there are two events, 1.) crystallite nucleation, 2.) growth/distribution of the crystallites. if you assume two rates for nucleation and growth, what you try to control is something like,

$$ \begin{equation} I(T) \not\equiv U(T) \label{eq:nucleation_growth} \end{equation} $$

where $I(T)$ is the nucleation rate and $U(T)$ is the crystal growth rate (different physical quantities with different dimensions; (\not\equiv) means their (T)-dependences differ). The expression is illustrative, not empirical. In practice you need the right density of nuclei, and proper crystal growth. If there are too few nuclei, the crystals can grow too large. If the crystals grow too much, the material can lose the desired properties. Ideally one is looking to find the right crystal size and number density to get useful glass-ceramic properties.

The other thing to think about is the actual glass-ceramic microstructure. The useful properties of glass-ceramics come from details like:

the crystalline phase,
the crystal volume fraction,
the crystal size,
the crystal shape,
the residual glass composition,
the elastic and thermal mismatch between phases.

For low thermal expansion glass-ceramics, the idea is often that the glassy phase has positive thermal expansion and the crystalline phase has negative thermal expansion. If the balance is right, those contributions nearly cancel. So you have some linear mixing like:

$$ \begin{equation} \alpha_{\text{eff}} \approx V_g \alpha_g + V_c \alpha_c \nonumber \label{eq:cte_mixture} \end{equation} $$

where $V_g$ and $V_c$ are the glass and crystal volume fractions, and $\alpha_g$ and $\alpha_c$ are the thermal expansion coefficients of the glassy and crystalline phases. This is obviously too simple because real microstructures have elastic constraint and anisotropy.

One example is ZERODUR (see fig. @fig:zerodur). It is a lithium aluminosilicate glass-ceramic used in precision optical applications because it can have extremely low thermal expansion. The low expansion comes from balancing the positive expansion of the residual glass with the negative expansion of the crystalline phase [4].

Figure 2. SCHOTT CERAN glass-ceramics. Wikimedia Commons, CC BY-SA 4.0

Cooktops (see fig. Figure 2) are the less exotic version of the same basic idea. You want a smooth surface, thermal shock resistance, chemical durability, and enough mechanical robustness for normal use. Dental materials are another good example because a dental restoration needs strength, chemical durability, wear resistance, processability, and translucency. A glass might look good, but not be strong enough and a polycrystalline ceramic is strong, but it may be too opaque or difficult to process. Whereas glass-ceramic gives the right balance, such as lithium disilicate dental glass-ceramics. They contain interlocked lithium disilicate crystals in a glassy matrix, which helps improve mechanical properties while still preserving required optical properties [5].

References

[1] et . al . J. Deubener, Updated Definition of Glass-Ceramics, Journal of Non-Crystalline Solids. 501 (2018) 3--10. https://doi.org/10.1016/j.jnoncrysol.2018.01.033.

[2] M.J. Davis, E.D. Zanotto, Glass-Ceramics and Realization of the Unobtainable: Property Combinations That Push the Envelope, MRS Bulletin. 42 (2017) 195--199. https://doi.org/10.1557/mrs.2017.27.

[3] S.D. Stookey, Catalyzed Crystallization of Glass in Theory and Practice, Industrial & Engineering Chemistry. 51 (1959) 805--808. https://doi.org/10.1021/ie50595a022.

[4] P. Hartmann, R. Jedamzik, A. Carre, J. Krieg, T. Westerhoff, Glass Ceramic ZERODUR: Even Closer to Zero Thermal Expansion: A Review, Journal of Astronomical Telescopes, Instruments, and Systems. 7 (2021) 020902. https://doi.org/10.1117/1.JATIS.7.2.020902.

[5] L. Fu, H. Engqvist, W. Xia, Glass--Ceramics in Dentistry: A Review, Materials. 13 (2020) 1049. https://doi.org/10.3390/ma13051049.

Reuse and Attribution

|𝔻⟩irac's Student

Search Blogs

Sunday, May 24, 2026

What in the World is a World Model?

LeCun's World Model

Footnotes

References

Sunday, May 3, 2026

Glass-Ceramics

References