|𝔻⟩irac's Student: LCSM Dataset

Early last year I started playing around with the CrystaLLM package, which I've also mentioned in previous posts, to gauge the utility of these generative tools for structure creation. CrystaLLM is a autoregressive model that generates crystal structures by condition and populating a CIF format document [1]. So what it is doing is writing the CIF file given the chemistry and optional spacegroup and/or unit replicate factor. I'm not going to go into the technical details of the architecture of the model and training data here as thats a whole post I need to do on generative AI for structures. The main thing is I used my newish personal Desktop with a RTX 5070Ti to do the inference and ended up with about 7,889 structures that are distinct¹. It did take quite some time to configure CrystaLLM and generate the structures², since I enabled/modified the code to verify that the CIF files were valid and matched the target symmetry spacegroup.

In addition to using CrystaLLM to generate the structures, I decided before hand that I would wrap in a labeling step that would compute the total energy, forces, and stress of the crystal structures. I decided why not use ensemble of pre-trained foundation models that are on matbench discovery to do this. For no particular reason other than ones that I was familiar with, I selected seven foundation models to label each structure. This produced the final dataset of 7,889 structures each labeled by the seven foundation models, yielding a ASE database of 53,749 entries.

Figure 1. Element Distribution

Figure 2. Spacegroup Distribution

The resulting distribution of elements and what fraction associated with binary, ternary, quaternary, and quinary can be see in Figure 1. From my perspective the element and component distributions seem reasonable. I also looked at the spacegroup distribution, as shown in Figure 2, I'm less familiar with what to expect but again seems reasonable that majority are orthorhombic or tetragonal.

For now, due to limited personal time, I decided I would make the dataset available on Zenodo upon request since I don't think I'll have much time on weekends to work on the analysis aspect I was hoping to do with the dataset. Eventually by the end of the year I will create a blog post³ on the dataset and analysis. The main question I was trying to answer is can the foundation model ensemble variance across the pre-trained foundation models serve as a proxy for epistemic uncertainty to identify which unknown/novel CrystaLLM-generated crystal structures are physically legitimate versus incorrect? This would require also considering that the foundation models are trained on mostly the same datasets and therefore systematic biases or shared epistemic limitations might exist within all the models. This means that ensemble agreement could reflect a false positive in epistemic knowledge, potentially limiting the extent to which ensemble variance purely reflects epistemic uncertainty about novel structures.

The zenodo entry, which I call Labeled Synthetic Crystal Material (LSCM) dataset [2], can be found here. If you would like to obtain the dataset, I kindly ask you request it via the zenodo entry and I will be happy to provide access. The dataset is in a ASE sqlite3 format. I'm not providing any guarantees on the quality as this is a raw dataset generated purely by the workflow using CrystaLLM and ASE calculators for foundation models model checkpoints. As to whether I'll add to the dataset in the future, probably not as it ties up my GPU considerably and need to use it for other stuff.

Footnotes

The generated structures are distinct within the dataset, i.e. no replicating chemistries+spacegroup, but I haven't yet checked them against the training datasets and known structures in databases like ICSD or COD. ↩
I think I started this running on my personal machine in March 2025 and stopped running things in July 2025, but this was not continuous process, I really only ran things on the weekends. ↩
If the analysis results turn out to be particularly important and impactful, for example, several generated structures are legit and unknown, then I would probably waver more to writing a formal research paper. This would of course require a lot more time and effort since I would probably have to do some DFT calculations and scour the literature. Could be LLM tools make this feasible to where it actually becomes a viable option for me to do on my own time. ↩

References

[1] L.M. Antunes, K.T. Butler, R. Grau-Crespo, Crystal structure generation with autoregressive large language modeling, Nature Communications. 15 (2024). https://doi.org/10.1038/s41467-024-54639-7.

[2] S. Bringuier, Labeled Synthetic Crystal Material Dataset, (2026). https://doi.org/10.5281/zenodo.18135201.

Reuse and Attribution