Search Blogs

Friday, June 27, 2025

Synthetic Datasets & Model Distillation

Say your given an all-knowing oracle machine that you can query to get the ground truth data but the thing is this oracle system is fairly expensive1 to query. Furthermore say that your not really interested in all ground truths the oracle has to offer, but rather you just want it for a specific set of inquiries. What would you do? the most direct thing would be to carefully curate a set of inquiries that are the most valuable and then spend your resources to get the answers and use those to teach a sage system. The idea is not new and is a form of model compression/distillation [1] and teacher-student paradigm [2].

Oracle πŸ§™‍♂️ > Sage πŸ§‘‍🏫

If our oracle is a all-knowing system then a sage system is one that has equivalent deep wisdom and knoweldge on specific area(s) as the oracle but is inferior in all other areas. Our oracle is all-knowing but slow (think of an old professor πŸ§™‍♂️) but our sage is more youthful, but less expierienced (i.e., assistant professor πŸ§‘‍🏫)

For the curation of the inquiries, one could use a Bayesian optimization approach that would iteratively select the most valuable inquiries to query the oracle based on the aquisition function the user chooses. This could be a exploration or exploitation function, that is do you want to find a broad set of valuable inquiries or do you want to find a specific set of valuable inquiries. Then there are more traditional statistical sampling methods. I won't touch on this since its not the focus of this post, but keep that in mind.

This is the idea behind synthetic dataset generation, that is the creation of inquiries and answers (i.e., labels) to create a dataset that is dervived from the oracle/teacher but is not a direct copy from the corpus of knowledge/truths. This is a form of data augmentation and data compression.

Once you have the set of inquiries and the answers, you can faithful teach the a sage system. This leads to what is called model distillation. Note this isn't model compression since we didn't take the underlying knowledge-structure of the oracle and modify it, we just extract stochastic answers from the oracle. Typically the model is more efficient and computationallyheaper to both train and infer with.

In the wild

The recent paper by Gardner et al. [3] provides a nice demonstration of synthetic dataset generation and model distillation for creating fast and accurate, chemical system specific, interatomic ML potentials. Their work offers a general, architecture-agnostic protocol for distilling knowledge from large, expensive "foundation models" (our Oracle πŸ§™‍♂️) into smaller, more efficient "student" models (our Sage πŸ§‘‍🏫).

The process is essentially a three-step workflow:

%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#4f46e5','primaryTextColor':'#ffffff','primaryBorderColor':'#3730a3','lineColor':'#6b7280','sectionBkgColor':'#f8fafc','altSectionBkgColor':'#e2e8f0','gridColor':'#e5e7eb','tertiaryColor':'#f1f5f9'}}}%% graph LR; subgraph S1 ["Step 1: Fine-tuning"] A[/"Small DFT
dataset"/] --> C; B["Foundation
Model"] --> C{"Fine-tune"}; C --> D(("Oracle")); end subgraph S2 ["Step 2: Synthetic Data"] D --> E{"Generate &
Label"}; E --> F[/"Large Synthetic
Dataset"/]; end subgraph S3 ["Step 3: Distillation"] F --> G{"Train
Student"}; H["Fast
Architecture"] --> G; G --> I(("Sage")); end

  1. Fine-tuning the Teacher: They start with a pre-trained foundation model (like MACE-MP-0). These models are powerful in that they support chemical systems of any composition including atomic numbers 1-89 (or so). The downside is the architecture is slow to evaluate and there fore not suitble for large-scale atomic simulations (e.g. 100,000's of atoms). As part of Gardner et al. protocol they fine-tune the teacher/oracle2 model on a very small set of high-quality, domain-specific calculations (e.g, ~25 DFT calculations for liquid water). This turns3 the general model into a specialized "teacher" for that specific system, without the high cost of creating a large DFT dataset from scratch and still remaining general to all other chemical systems.
  2. Generating Synthetic Data: For the liquid water system, the fine-tuned teacher is used to generate a large dataset of ~1,000 atomic configurations and their corresponding energies and forces are labeled with the fine-tuned teacher. For sampling, they employ a fairly efficient "rattle-and-relax" scheme that constructs a "family tree" of configurations. At each step, a "parent" structure is selected from the tree, a new "child" structure is generated from it, and this child is added back to the tree (see Figure 1). This is much faster than running molecular dynamics simulations to explore the potential energy surface. The scheme works by taking a starting structure and iteratively:

    • Rattling: Displace the atomic positions \( \mathbf{R} \) and unit cell \( C_0 \) by applying random perturbations:
      • \( \mathbf{R}' \leftarrow [(\mathbf{A} + \mathbf{I}) \times \mathbf{R}] + \mathbf{B} \)
      • \( \mathbf{C}' \leftarrow (\mathbf{A} + \mathbf{I}) \times \mathbf{C}_0 \)
    • Relaxing: Nudging the atoms in the direction of the forces predicted by the teacher model to find a new, stable configuration. At each relaxation step ( x ), the atomic positions ( \mathbf{R} ) are updated according to the equation:

      \[ \mathbf{R}' \leftarrow \mathbf{R} + \frac{\sigma_B}{x} \cdot \frac{\mathbf{F}}{||\mathbf{F}||} \]

      where \( \mathbf{F} \) represents the forces predicted by the teacher model, and \( \sigma_B \) is a hyperparameter controlling the rattle intensity that is also used to scale the relaxation step size. This iterative process — the Robbins-Monro algorithm — allows for an efficient exploration of the local energy landscape to generate new structures.

  3. Distilling the Student (Sage): Finally, they train a much smaller, computationally cheaper model, the "student" (our Sage πŸ§‘‍🏫), on this large synthetic dataset. Because the student model architecture (e.g., PaiNN or ACE) is simpler and has fewer parameters, both training and inference are much faster.

Figure 1. Synthetic data generation process (Fig. 6 from Gardner et al.).

As a proof-of-concept for liquid water, thier results suggest that you can get massive speed-ups with only a minor hit to accuracy.


Results from distilling a fine-tuned MACE-MP-0 foundation model for liquid water.
Model Type Relative Speed Force MAE (meV/Γ…)
MACE-MP-0b3 Teacher (Fine-tuned, Oracle πŸ§™‍♂️) 1x 32
TensorNet Student (Sage πŸ§‘‍🏫) > 10x 37
PaiNN Student (Sage πŸ§‘‍🏫) > 30x 39
ACE Student (Sage πŸ§‘‍🏫) > 100x 51

The student/sage models: PaiNN, TensorNet, and ACE are more computationally efficient for a couple of key reasons (see Figure 2). First being less complexity in the model, i.e., less parameters and layers. The next is that these models have computational cost scales dramatically with the interaction cut-off radius (\(r\)), often as \( \mathcal{O}(r^3) \). The big foundation models need a large radius (e.g., 6 Γ…) to be general to capture many-body interactions. However, sage/student models, however, can get away with a smaller radius (e.g., 4.5 Γ… for PaiNN) without losing much accuracy for the specific system they're trained on. This hyperparameter adjustment can lead to large reductions in computational cost, which is how the authors get >10x to >100x speedups. They are still "good" because they have learned the essential physics for that one system from the huge synthetic dataset provided by the all-knowing (but slow) teacher.

Figure 2. Computational efficiency of distilled models (Fig. 2 from Gardner et al.).

Outlook

The synthetic dataset generation and distillation model isn't just a numerical trick; the resulting sage models can produce physically meaningful simulations. For the liquid water, Gardner et al. report that the distilled models reproduce the structural properties of the much more expensive teacher model well. The distilled PaiNN model, for example, almost perfectly reproduces the radial distribution function (RDF).

Other structural and dynamical properties such as ring-size distributions and tetrahedral order parameters, which describe the medium-range ordering and local geometry, show the PaiNN student model behaves similarly to the teacher. The ACE model, which is very fast, leads to a slightly more ordered, crystal-like water structure, but still captures the key chemical/material physics.

Broader Applicability

The Garnder et al. paper [3] appears to be a versatility distillation method. Demonstrations on some systems included are:

  • Metallic Hydrogen: Distilled models seem to reproduced DFT EOS for hydrogen at the extreme temperatures and pressures.
  • Porous Amorphous Silica: The distilled models correctly captured the structure factor for porous amorphous silica.
  • Hybrid Perovskites: The distilled models replicate the rotational dynamics of the molecular cations inside frameworks.
  • Organic Reactions: Distilled models successfully reproduced the intended SN2 reaction mechanism. However, long-timescale simulations led to unphysical products.

Prospects

This fairly broad range of applications showcases that the distillation protocol is a powerful and general tool for creating more computationally efficient, specialized potentials for higher-throughput research. I'm bullish on this approach and think this is one-direction atomistic modeling and simulation is headed if there is no need for electronic effects are needed; although who knows these models might get good enough to predict charge transfer and polarization effects.

Footnotes


  1. Expensive here can mean time to get the answer or cost in terms of resources. 

  2. The all-knowing oracle in this fine-tuning process would be the DFT calculations, although this doesn't truly fit the notion of a black-box since we know the physics that is being solved, in contrast with GNN potentials we kind of know what embeddings it is learning. So in the protocol of Gardner et al., what we have is the all-knowing DFT oracle and a semi-oracle (fine-tuned model) that then teaches the sage. 

  3. Fine-tuning is a technique in machine learning where a pre-trained model is adapted to a specific task by updating its parameters on a smaller dataset. This is often used to improve the performance of a model. While this helps the student model significantly, I feel the better approach is for the community to create larger more accurate foundation models and just use those as a true oracle. 

References

[1] G. Hinton, O. Vinyals, J. Dean, Distilling the Knowledge in a Neural Network, (2015). DOI.

[2] C. Buciluǎ, R. Caruana, A. Niculescu-Mizil, Model compression, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Philadelphia PA USA, 2006: pp. 535–541. DOI.

[3] J.L.A. Gardner, et al., Distillation of atomistic foundation models across architectures and chemical domains, (2025). DOI.



Reuse and Attribution

Sunday, June 22, 2025

DFT is gettings its share of AI

I made a post on LinkedIn the other day sharing a recent preprint[1] by Microsoft AI that introduces a significant advancement in exchange-correlation functionals (XC) for DFT. This, to me, seemed like an important piece of news given that much of the data used for training the recent explosion in foundational interatomic potentials for atomistics uses massive DFT datasets.

The biggest achievement seems to be the ability for Microsoft AI's DFT XC functional, called Skala, to achieve near chemical accuracy (i.e., 1 kcal/mol) for molecular systems while doing it with reasonable computational cost. What this means is that we may get a new rung added to the chemistry Jacob's ladder of accuracy for DFT. I wanted to illusrate this so I made a diagram of the chemistry Jacob's ladder with the new Skala XC added as shown in Figure 1. I got the idea to annotate it from Keiran Rowell's blog[2].

The idea of Jacobs Ladder in chemistry, I think, was first introduced by J. Perdew in 2013(?) [3], I believe. The primary idea is that like the in the biblical story, Jacob dreams of a ladder reaching from earth to heaven and God standing above it promising Jacob divine blessing and protection. In the spirit of this, we the scientist dream of reaching chemical accuracy using our ladder of DFT approximations. Why is this important? Well, chemical accuracy with DFT would enable accurate enough predictions in reasonable computational time to improve discovery, design, and understanding of chemicals and materials. In particular understanding reaction mechanisms and kinetics would be much more tractable if chemical accuracy is reached.

Figure 1. Computational chemistry Jacob's ladder for DFT with the Microsoft AI Skala XC added.

DFT XC Overview

First, let me briefly summarize some basics of density functional theory (DFT) to explain why the exchange-correlation (XC) functional is so important. The Nobel-worthy Hohenberg–Kohn theorems and the formulation of the Kohn–Sham equations established that the total ground-state (i.e., 0 K) energy of a many-body electronic system can be written as:

$ \begin{equation} E[\rho] = T_s[\rho] + V_{ext}[\rho] + J[\rho] + E_{xc}[\rho] \end{equation} \label{eq:dft_energy} $

The first three terms are kinetic energy, external potential energy, and Coulombic repulsion energy. The last term is the XC energy which is the only term we don't know exactly and is crucial because it encapsulates the many-body quantum mechanical effects of the electrons whereas the others are single-particle terms.

Throughout the successful history of DFT, different approximations have been used for the XC term. The first was local density approximation (LDA), with the idea being that the XC energy is simply a function of the local density of the electrons. For all its simplicity, LDA did work reasonably for some systems but shortcomings were quickly discovered.

Then the generalized gradient approximation (GGA) was introduced, where the XC energy is now not only a function of the local density but also the gradient of the density. This introduced the idea of non-local XC functionals (i.e., the energy now depends on how the density changes with position, rather than just at that position).

After GGA, what followed was a "flavor-of-the-month" approach toward improving XC functions. There were hybrid functionals that mixed DFT XC with the Hartree-Fock exchange. Then meta-GGA was introduced with the idea of adding higher-order derivatives to the density. Add to that expensive XC functionals that include perturbation quantum chemistry methods. All have their improvements and shortfalls. You can see all the major players in the rungs of the ladder in Figure 1 showing how different XC functionals improve the range of accuracy as you climb up the ladder.

These are just the formalisms; there are also ad-hoc tweaks and specializations that get added based on your domain use and expertise. The limitations of such approaches are:

  1. Handcrafted Features: Most functionals rely on fixed analytic forms and known constraints.
  2. Non-locality Deficiency: Electron correlation is inherently non-local; local/semi-local functionals cannot fully capture this.
  3. Empirical Tuning vs. Physical Justification: Empirical functionals may generalize poorly or violate constraints.
  4. Slow Progress at Higher Rungs: Despite decades of work, hybrid/double-hybrid methods still fall short of universal chemical accuracy (error < 1 kcal/mol).

Skala Neural XC

The efforts by Microsoft AI represent a significant leap in XC functional design. Unlike traditional functionals that have relied on hand-crafted mathematical forms, Skala incorporates deep learning (i.e., Neural Networks) to achieve near-chemical accuracy while maintaining computational efficiency comparable to meta-GGA functionals[1]. Skala is particularly impressive because it navigates the trade-off between accuracy and computational cost. The Microsoft AI team has designed it to bridge the gap between semi-local functionals (fast but less accurate) and hybrid/double-hybrid functionals (accurate but computationally expensive).

Architecture and Design Philosophy

As mentioned, Skala XC is a neural network architecture trained to learn the non-local electron density interactions without requiring the full computational burden of exact exchange calculations. Like meta-GGA, it starts with seven semi-local density-derived features, but the design then employs what the authors call a "coarse-fine grid structure" that captures long-range density correlations consistent with multipole-like behavior.

For the exchange, the authors use LDA but incorporate a neural enhancement factor $f_\theta$:

$ \begin{equation} E_{xc}^\theta[\rho] = -\frac{3}{4}\left(\frac{6}{\pi}\right)^{1/3} \int \left(\rho_\uparrow(r)^{4/3} + \rho_\downarrow(r)^{4/3}\right) f_\theta[x[\rho]](r) \, dr \label{eq:skala_xc} \end{equation} $

Not knowing too much about XC design, this does seem clever though because $f_\theta$ operates on a feature vector $x[\rho]$ that encodes both local and non-local density information while maintaining computational efficiency (would like to understand this but above my head for now). This is reminiscent of delta-learning in atomistic ML, though Skala performs a direct functional approximation rather than a residual correction atop an existing functional.

What data was used?

The Microsoft AI team used a quantum chemistry dataset that seems only they could have curated, something like 150,000 data points spanning thermochemistry, conformational energies, noncovalent interactions, and ionization potentials. This was all done with gold-standard wavefunction methods like CCSD(T) to generate the data, 🀯. Their MSR-ACC/TAE dataset alone includes about 80,000 total atomization energies with errors below 1 kcal/mol, which is just staggering when you consider the computational expense of generating such reference data. The datasets alone might be useful for others.

Peformance Metrics

I'm not too familiar with benchmarks in this space, but they mention the challenging W4-17 benchmark and Skala achieves a mean absolute error (MAE) of around 1.0 kcal/mol, outperforming more established functionals. On the GMTKN55 benchmark, it scores a WTMAD-2 of 3.89 kcal/mol, which puts it in competition with the best hybrid functionals while requiring significantly less computational resources.

Another thing that is super interesting is that for large atomic numbers that are out of distribution for the trained Scala XC, They are showing that it maintains very good accuracy. This is usually a failure point for data-driven ML models where out of distribution data is not well handled particularly well [4]. The authors also method that the self-consistent field convergence is stable, crucial for practical application in DFT --- since, unlike post-SCF models, Skala is trained and used self-consistently.

Redefining Jacob's Ladder?

So going back to the Jacob's ladder of DFT functionals in Figure 1. Whether Skala constitutes a sixth rung or a bypass of the traditional ladder remains obviously open to interpretation, but it clearly represents a shift. I'm not a veteran in this space so hard for me to be definitive, but it probably represents some departure to non-locality from semi-local features to effectively bypass the traditional way of doing things.

It will be interesting to see from the "Kings of DFT" to see how they view Skala XC. Based on the Microsoft AI preprint this functional(s) is showing systematic improvement through use of high-quality data and training procedures, and the authors claim it can encode known physics through appropriate constraints and design. On the other hand, they lack the theoretical transparency and interpretability that traditional functional forms offer will probably be a critiqued, although existing functionals are not perfect either.

I'm just interested to see how this gets used for the other down stream applications in solid-state physics and materials science. Will we get DFT training sets for Materials Project structures that range in the 1 million of structures and all have chemical accuracy? If so that could in turn make these foundation models even better for materials discovery and classical atomistic modeling.


References

[1] G. Luise, C.-W. Huang, T. Vogels, D.P. Kooi, S. Ehlert, S. Lanius, K.J.H. Giesbertz, A. Karton, D. Gunceler, M. Stanley, W.P. Bruinsma, L. Huang, X. Wei, J.G. Torres, A. Katbashev, B. MΓ‘tΓ©, S.-O. Kaba, R. Sordillo, Y. Chen, D.B. Williams-Young, C.M. Bishop, J. Hermann, R. van den Berg, P. Gori-Giorgi, Accurate and scalable exchange-correlation with deep learning, (2025). DOI.

[2] K. Rowell, An Ersatz Ansatz, Blog (2023). https://keiran-rowell.github.io/guide/2023-04-12-compchem-methods-basics (accessed June 21, 2025).

[3] J.P. Perdew, Climbing the ladder of density functional approximations, MRS Bull. 38 (2013) 743–750. DOI.

[4] K. Li, A.N. Rubungo, X. Lei, D. Persaud, K. Choudhary, B. DeCost, A.B. Dieng, J. Hattrick-Simpers, Probing out-of-distribution generalization in machine learning for materials, Commun Mater 6 (2025) 1–10. DOI.



Reuse and Attribution