|𝔻⟩irac's Student: September 2025

Saturday, September 27, 2025

I now know why Boron Carbide shatters

Brittle Fracture Materials Science Simulation

Hard high-temperature ceramics is an area I've been involved with throughout my career. One of these materials is Boron Carbide. The B₄C polymorph is one of the hardest known materials and is incredibly lightweight due to its constituents and density. These attributes make B₄C a leading candidate for personal body armor, vehicle shielding, and other high-impact applications. Its remarkable properties originate from its rhombohedral crystal structure that consists of a 12-atom icosahedra that are connected by stiff three-atom chains.

Figure 1. Unit cell of B₄C with 12-atom icosahedra & C-B-C fragment (Materialscientist, CC BY-SA)

Despite its exceptional hardness boron carbide exhibits a critical weakness under high-speed impact: it can fail catastrophically through amorphization, where the ordered crystal collapses into a glassy/disordered state. This phase change results in loss of outstanding hardness and strength. For a long time the atomic-scale mechanisms behind this phenomenon weren't really well characterized or understood. I remember shock physics molecular dynamics studies trying to look at this and it was never quite clear what the incipient (i.e. onset/preceding) steps were to amorphiziation. From a modeling perspective, a lot of it is related to the ability of interatomic potentials to capture the atomic-scale process. That seems to have changed. A recent study by Ghaffari et al. [1] used machine-learned interatomic potentials (MLIPs) and MD simulations to reveal how the crystal orientation, polymorphism, and velocity impact the amorphization failure process.

One of the nails in the coffin, so to speak, is shown in Figure 2, which reveals thin amorphous bands forming at ~45° relative to the impact direction, demonstrating the shear-driven nature of failure. Previous MD simulations could not reproduce this experimentally observed phenomenon.

Figure 2. Amorphous bands at ~45° relative to impact (Fig. 4 from Ghaffari et al. 2025)

The Paradox of Boron Carbide

The exceptional hardness of most carbides, like B₄C (Vickers ≥ 30 GPa)¹, originate from the covalent bonds, and in the case of B₄C its icosahedra and three-atom chains. However, under dynamic loading beyond its Hugoniot Elastic Limit (HEL), which is the transition condition where a material goes from elastic to plastic deformation under shock loading and for B₄C it is experimentally around 20 to 25 GPa. So under shock loading B₄C softens instead of hardening, exhibiting glass-like mechanical behavior [1]. Amorphization degrades the structure of the material by breaking up icosahedral cages, resulting in loss of long-range order, and therefore sharp reduction in shear strength and hardness. In experiments they have observed narrow amorphous bands forming along shear planes [2-4], but these of course lacked temporal and spatial resolution to identify the atomic processes responsible for the amorphization.

New kid on the block: MLIPs

I've posted a bit on this blog about machine-learned interatomic potentials (MLIPs) because they have shifted the type of classical atomistic simulations that can be now, mostly in terms of chemistries simulated, but with the work by Ghaffari et al. [1] its pushed this to new physics regimes and mechanisms. From what I recall, simulating shock physics amorphization in carbides has been a challenge with classical interatomic potentials. I'm sure some have tried AIMD but the spatial scale would always be the limiting factor, i.e., nearly impossible to capture the amorphous band formation. Force fields such as ReaxFF have been tried [5] but fail to reproduce the observed experimental features. Ghaffari et al. [1] address this by developing a machine-learned interatomic potential trained on DFT-calculated structures for four dominant polymorphs, amorphous phases, and high-pressure states. Using DeePMD-kit, the resulting model achieved high accuracy and enables MD simulations of systems scaled up to ~450K atoms.

tl;dr

Simulations using their trained boron carbide MLIP model successfully capture shear-induced amorphous band formation for the first time, directly matching experimental observations and providing atomistic process responsible for the amorphization.

So what did they find?

I mentioned earlier in Figure 2 that the amorphous bands formation being directly observed in the MD simulations. This was one of the key results that they were able to achieve, but some other important details were also revealed.

1. Crystal Orientation Strongly Influences HEL

The HEL depends strongly on the orientation of the three-atom chain relative to the impact direction. The lowest HEL values, between 18 and 22 GPa, occurred when the chains were either aligned parallel or perpendicular to the impact. In contrast, orientations such as 45° and 60° showed the highest HEL values, between 32 and 38 GPa [1]. This seems to overturn the earlier assumptions from ReaxFF-based simulations [5] that aligning the orientation with the least compliance with the impact direction would maximize strength. But what the results in the paper indicate is the failure is driven primarily by shear stress rather than compression. Amorphous bands consistently form at planes oriented approximately 45° from the impact direction, corresponding to the planes of maximum shear stress.

2. Polymorphism Governs Toughness

Boron carbide polymorphs differ subtly in atomic arrangement but behave very differently under shock loading, as highlighted in Figure 3. The paper directly compares B₁₂(CBC), which contains a C–B–C chain, and B₁₂(CCC), which has a C–C–C chain, revealing how polymorphism controls toughness. B₁₂(CBC) exhibits the highest Hugoniot Elastic Limit (HEL) and the greatest resistance to amorphization, surviving shock by forming amorphous bands that expand in volume. In contrast, B₁₂(CCC) has a much lower HEL and fails via irreversible structural collapse and significant volume shrinkage. This difference is directly linked to the atomic arrangement: in B₁₂(CBC), the middle boron atom temporarily migrates into a cage space during compression and returns upon unloading, resulting in reversible chain bending and recovery of the lattice structure. For B₁₂(CCC), the central carbon atom instead forms permanent bonds with nearby icosahedra due to the higher bond dissociation energy of C–B bonding (448 KJ/mol) relative to B–B bonding (297 KJ/mol), leading to irreversible bonding, collapse, and densification. Figure 4 provides side-by-side atomistic snapshots of these chain mechanics, illustrating the reversible bending and recovery in CBC versus the irreversible bonding and collapse in CCC. Figure 5 further strengthens this mechanistic explanation by showing the cage-space migration pathway that enables reversibility in CBC.

Figure 3. Reversible vs Irreversible Chain Bending (Fig. 8 from Ghaffari et al. 2025)

Figure 4. Reversible vs Irreversible Chain Bending (Fig. 10 from Ghaffari et al. 2025)

Figure 5. Cage-Space Migration (Fig. 11 from Ghaffari et al. 2025)

3. Two Distinct Regimes of Amorphization

The study also identifies two fundamentally different deformation regimes controlled by impact velocity [1]. In Figure 6 the response to 2.5, 3.0, and 4.0 km/s shocks is shown and it establishes distinct band-dominated versus collapse-dominated regimes. At 2.5 km/s, localized amorphous bands dominate, and the volume inside these bands expands by ~6% relative to the surrounding crystal lattice, which then suppresses crack initiation (i.e. extrinsic toughening). At 3.0 km/s, amorphous banding and partial structural collapse both coexist. But at 4.0 km/s and above, widespread collapse of icosahedral units occurs, producing significant volume shrinkage and creating voids that promote crack growth. The preservation or destruction of icosahedral motifs determines whether a region undergoes expansion, local compressive shielding, or irreversible densification.

Figure 6. Atomic Volume Variation

Implications for Materials Design

The results from Ghaffari et al. [1] provide an atomic-level framework for designing boron carbide that is more resistant to amorphization. Orienting grains to align their stiffest chains with the planes of maximum shear resistance, approximately 45°, increases HEL and delays failure. Favoring polymorphs containing C–B–C chains during synthesis improves toughness by enabling reversible chain bending and reducing susceptibility to catastrophic collapse. Furthermore, the demonstrated accuracy of machine-learned interatomic potentials shows that AI-driven simulations are now a critical tool for predicting material behavior under extreme conditions and guiding the design of next-generation superhard ceramics.

Footnotes

In contrast to Boron Carbide, Tungsten carbide is another hard and refractory material, but Tungsten carbide density is ~7x (15.7 g/cm³) that of Boron Carbide. I believe the hardness of Boron Carbide is also slightly higher than that of Tungsten carbide. ↩

References

[1] K. Ghaffari, S. Bavdekar, D.E. Spearot, G. Subhash, Influence of Crystal Orientation and Polymorphism on the Shock Response of Boron Carbide, SSRN Preprint, (2025). https://ssrn.com/abstract=5186595.

[2] G. Parsard, G. Subhash, P. Jannotti, Amorphization-Induced Volume Change and Residual Stresses in Boron Carbide, Journal of the American Ceramic Society. 101 (2018) 2606–2615. https://doi.org/10.1111/jace.15442.

[3] M. Chen, J.W. McCauley, K.J. Hemker, Shock-Induced Localized Amorphization in Boron Carbide, Science. 299 (2003) 1563–1566. https://doi.org/10.1126/science.1080067.

[4] K.M. Reddy, P. Liu, A. Hirata, T. Fujita, M.W. Chen, Atomic Structure of Amorphous Shear Bands in Boron Carbide, Nature Communications. 4 (2013) 2483. https://doi.org/10.1038/ncomms3483.

[5] A.A. Cheenady, A. Awasthi, M. DeVries, C. Haines, G. Subhash, Shock Response of Single-Crystal Boron Carbide Along Orientations with the Highest and Lowest Elastic Moduli, Physical Review B 104 (2021) 184110. https://doi.org/10.1103/PhysRevB.104.184110

Reuse and Attribution

Saturday, September 13, 2025

Conformal Prediction

Data Science Machine Learning

An area that has gained my specific attention lately is the idea of trying to quantify how confident a model is in its prediction. There are various techniques for doing this, the familiar Bayesian methods are some of the most popular, while others are still new to me, the one that I've been reading on this weekend is the idea of conformal prediction [1]. So what is Conformal Prediction? A Conformal prediction is a method for uncertainty quantification where instead of giving a single prediction we provide a set (e.g., ${y_1, y_2, ..., y_n}$) of most predicted values with a guarantee about how frequently the true value will be in that set. To put it even more simply, the model spits out a bunch of possible answers given several inputs and we just ask, “Is the real answer somewhere in this pile? If so are you 95% sure?”.

This idea is more aligned with the frequentist thinking, where we don't concern ourselves with a prior or the full posterior distribution, but only seeking how well the prediction is in the set.

The prediction interval when doing regression is then some range where we specify the values of the predictions should fall in. So say its the bandgap we are predicting the we specify say , 3.0 to 3.5 eV and we attached some confidence level that the true bandgap value will be in that range. If we are focused on assigning category labels, the it would be something like the crystal structure is either rocksalt, zincblende, or perovskite and again the confidence that the true label will be in that set is given some value (e.g. 90%).

Important Distinction

These are prediction intervals, not confidence intervals. Prediction intervals quantify uncertainty about individual predictions, while confidence intervals quantify uncertainty about distribution parameters like the mean.

Another aspect to keep in mind is that conformal prediction uses calibration residuals from a held-out set to size the prediction interval. This differs from Bayesian or bootstrap methods, which model the data-generating process (e.g., via likelihoods) to derive uncertainty.

Lets put the conformal predictions words into math. First we define $P(\cdot)$ as the probability of some value occurring ($P(E_b=3.1\text{eV})$), then we define the true value as $E_b^{n+1}$, where the $n+1$ is the new input. Now we define the prediction interval as $\Gamma(X_{n+1})$, where $X_{n+1}$ is the new input, so again this just tells us given an input evaluate a set-valued function that returns a set of possible values. For regression, this is typically a continuous interval like $[3.1, 3.5]$ eV for the bandgap prediction. Then we have to define our confidence level, i.e., quantify how well we want to or do know the prediction is in the set. To do this we use $1-\alpha$, where $\alpha$ is the confidence level. If $\alpha = 0$ we are saying that with 100% confidence the true value will be in the prediction interval. If $\alpha = 0.05$ we are saying that with 95% confidence the true value will be in the prediction interval, and so on. Now we can write the probability statement as:

$$ \begin{equation} P(E_b^{n+1} \in \Gamma(X_{n+1})) \geq 1 - \alpha. \label{eq:conformal_prediction} \end{equation} $$

Reading through this, we have the probability of the true value $E_b^{n+1}$ (i.e., the true bandgap value at next input $X_{n+1}$) being in the prediction interval $\Gamma(X_{n+1})$ set with a confidence level of $1 - \alpha$.

The thing to note is this is more of a guarantee statement; it is saying that if we define the confidence level and interval set then we can make a statement about how likely the true value is in the interval. We never need to directly compute the probability of the true value being in the interval in $\eqref{eq:conformal_prediction}$, which is why this method is called "distribution-free".

We also need a way to measure how "bad" the model's guess was compared to the truth, that is compute some loss function, in other words a nonconformity score. There are several different ways to define the nonconformity score depending on the type of prediction we are making. I'll just show the L1 norm as an example for regression, which would be

$$ \begin{equation} s_i = \vert E_{b}^{i} - \hat{E}_{b}^{i} \vert \nonumber \end{equation} $$

where $\hat{E}_b^{i}$ is the model's prediction for the bandgap value given input $X_{i}$, and the true value is $E_b^{i}$; $s_i$ is the nonconformity score for the $i$-th prediction. For example, a nonconformity score of 0.5 eV quantifies how much the model's prediction deviates from the true value. This is standard in regression. Importantly, the nonconformity scores are computed on a calibration set, that is, for $i \in {1, 2, \ldots, m}$, where this set is held out and not used for training.

So how do we use this in practice?

Switching from bandgap to forces, consider a GNN trained to predict atomic forces from atomic structures. We hold out a calibration set and compute nonconformity scores $s_i = |F_i^{true} - \hat{F}_i|$ for each force component, where $F_i^{true}$ is the true force and $\hat{F}_i$ is the model prediction.

We now build the prediction interval. Choose a desired coverage level, say $1-\alpha = 0.9$. Sort the $m$ calibration scores $s_{(1)} \le \cdots \le s_{(m)}$ and take $(r = \lceil (m+1)(1-\alpha) \rceil)$. The threshold is the $r$-th smallest score, $\hat q = s_{(r)}$. This is the standard split-conformal finite-sample rule.

Now we take a new atomic structure $X_{n+1}$ and compute the updated prediction interval.

$$ \begin{equation} \Gamma(F^{n+1}) = [\hat{F}^{n+1} - \hat q, \hat{F}^{n+1} + \hat q] \label{eq:conformal_prediction_updated} \end{equation} $$

With the updated prediction interval (with $\hat q$ chosen as above) we can then make a statement about how likely the true value is in the interval in $\eqref{eq:conformal_prediction_updated}$.

Again this is pretty basic concepts that are to follow and its more that the theory guarantees the prediction interval/set will contain the true value with the desired confidence level. This makes it simpler than Bayesian methods, where you have to "try" and compute the full posterior distribution through likelihood and prior probability distributions, all which require reasonably good choices.

To further illustrate let me work through this simple example of computing the conformal prediction for the forces using the orb-v3 pre-trained models on structures from the load-atoms GST dataset. The basic steps are to define the calibration and test sets, compute the nonconformity scores, build the prediction interval/set, and then make a statement about how likely the true value is in the interval. Below is the code snippet that shows the conformal prediction for the forces in addition to the model predictions.

Code


  # /// script
# requires-python = ">=3.8"
# dependencies = [
#   "numpy",
#   "matplotlib",
#   "orb-models",
#   "load-atoms",
#   "seaborn",
# ]
# ///
import math
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
from matplotlib.patches import Patch
from sklearn.neural_network import MLPRegressor
from load_atoms import load_dataset
from orb_models.forcefield import pretrained
from orb_models.forcefield.calculator import ORBCalculator

data = load_dataset("GST-GAP-22")
calibration, tests = data.random_split([500, 100], seed=42)

orb = pretrained.orb_v3_direct_20_omat(device="cpu")
calc = ORBCalculator(orb, device="cpu")


def pred_forces(atoms):
    atoms.calc = calc
    return atoms.get_forces()


# Locally adaptive split-conformal prediction using normalized residuals
# Note: These are prediction intervals, not confidence intervals
calibration_true = np.concatenate([a.arrays["forces"].ravel() for a in calibration])
calibration_pred = np.concatenate([pred_forces(a).ravel() for a in calibration])
scores = np.abs(calibration_true - calibration_pred)  # L1 nonconformity scores
z_calibration = np.abs(calibration_pred)  # feature: force magnitude


def fit_cqr_model(X, y, alpha=0.001):
    """CQR model for adaptive conformal prediction - learns local uncertainty scales.

    This model learns the relationship between force magnitude and prediction
    uncertainty to create adaptive prediction intervals.
    """
    model = MLPRegressor(
        hidden_layer_sizes=(100, 50, 25),
        activation="relu",
        solver="adam",
        alpha=alpha,
        learning_rate="adaptive",
        learning_rate_init=0.001,
        max_iter=1000,
        early_stopping=True,
        validation_fraction=0.1,
        n_iter_no_change=20,
        random_state=42,
    )
    model.fit(X, y)
    return model


# Fit adaptive scale model and compute normalized scores
cqr_scale = fit_cqr_model(z_calibration.reshape(-1, 1), scores, alpha=0.01)
sigma_calibration = np.clip(cqr_scale.predict(z_calibration.reshape(-1, 1)), 1e-8, None)
scores = np.sort(scores / sigma_calibration)

test_true = np.concatenate([a.arrays["forces"].flatten() for a in tests])
test_pred = np.concatenate([pred_forces(a).flatten() for a in tests])

z_test = np.abs(test_pred)
sigma_test = np.clip(cqr_scale.predict(z_test.reshape(-1, 1)), 1e-8, None)

levels = {
    "2σ": math.erf(2 / np.sqrt(2)),
    "3σ": math.erf(3 / np.sqrt(2)),
}

# ---- Plotting ----
lo, hi = float(min(test_true.min(), test_pred.min())), float(
    max(test_true.max(), test_pred.max())
)
ax_lim = max(abs(lo), abs(hi))
lo, hi = -ax_lim, ax_lim
idx = np.argsort(test_true)
xs, ys = test_true[idx], test_pred[idx]

fig, axes = plt.subplots(1, 2, figsize=(10, 5), sharex=True, sharey=True)
hexbin_for_colorbar = None
for ax, (tag, p) in zip(axes, levels.items()):
    delta_n = scores[
        min(max(int(np.ceil((len(scores) + 1) * p)) - 1, 0), len(scores) - 1)
    ]
    halfw = delta_n * sigma_test
    hw_s = halfw[idx]

    inside_band = np.abs(test_pred - test_true) <= halfw
    ax.plot([lo, hi], [lo, hi], "k-", lw=0.2)
    inside_band_sorted = np.abs(ys - xs) <= hw_s

    main_mask = (xs >= -1.5) & (xs <= 1.5) & (ys >= -1.5) & (ys <= 1.5)
    main_xs = xs[main_mask]
    main_ys = ys[main_mask]
    main_hw_s = hw_s[main_mask]
    main_inside = inside_band_sorted[main_mask]

    # Plot adaptive prediction interval boundaries
    ax.plot(main_xs, main_xs - main_hw_s, "b-", lw=0.2, alpha=0.6, zorder=2)
    ax.plot(main_xs, main_xs + main_hw_s, "b-", lw=0.2, alpha=0.6, zorder=2)

    confident_xs = main_xs[main_inside]
    confident_ys = main_ys[main_inside]
    if len(confident_xs) > 0:
        hb = ax.hexbin(
            confident_xs, confident_ys, gridsize=25, cmap="Greens", alpha=0.7, mincnt=1
        )

    uncertain_xs = main_xs[~main_inside]
    uncertain_ys = main_ys[~main_inside]
    if len(uncertain_xs) > 0:
        ax.scatter(
            uncertain_xs,
            uncertain_ys,
            s=1.5,
            alpha=0.5,
            color="red",
            marker="x",
            zorder=1,
        )

    coverage = np.mean(inside_band) * 100
    ax.set_title(
        f"{tag}: Test Coverage = {coverage:.1f}%", fontsize=14, fontweight="bold"
    )
    ax.set_xlabel("True Forces (eV/Å)", fontsize=12)
    if ax is axes[0]:
        ax.set_ylabel("Predicted Forces (eV/Å)", fontsize=12)
    ax.tick_params(labelsize=11)
    ax.set_xlim(-1.5, 1.5)
    ax.set_ylim(-1.5, 1.5)
    ax.set_aspect("equal")

    inset = inset_axes(ax, width="35%", height="35%", loc="upper left")
    inset.plot([lo, hi], [lo, hi], "k-", lw=0.2)
    inset.plot(xs, xs - hw_s, "b-", lw=0.15, alpha=0.6, zorder=2)
    inset.plot(xs, xs + hw_s, "b-", lw=0.15, alpha=0.6, zorder=2)

    confident_xs_full = xs[inside_band_sorted]
    confident_ys_full = ys[inside_band_sorted]
    if len(confident_xs_full) > 0:
        hb_inset = inset.hexbin(
            confident_xs_full,
            confident_ys_full,
            gridsize=20,
            cmap="Greens",
            alpha=0.7,
            mincnt=1,
        )
        hexbin_for_colorbar = hb_inset

    uncertain_xs_full = xs[~inside_band_sorted]
    uncertain_ys_full = ys[~inside_band_sorted]
    if len(uncertain_xs_full) > 0:
        inset.scatter(
            uncertain_xs_full,
            uncertain_ys_full,
            s=1.0,
            alpha=0.5,
            color="red",
            marker="x",
            zorder=1,
        )

    inset.set_xlim(lo, hi)
    inset.set_ylim(lo, hi)
    inset.set_aspect("equal")
    inset.tick_params(labelsize=8)

plt.suptitle(
    "Adaptive Conformal Prediction for Force Prediction Intervals",
    fontsize=16,
    fontweight="bold",
)

legend_elements = [
    Patch(facecolor="red", label="Model Uncertain"),
    Patch(facecolor="blue", label="Model Adaptive PI"),
]

fig.legend(
    handles=legend_elements,
    loc="center",
    bbox_to_anchor=(0.35, 0.15),
    ncol=1,
    fontsize=10,
    framealpha=0.9,
)

if hexbin_for_colorbar is not None:
    cbar = fig.colorbar(
        hexbin_for_colorbar,
        ax=axes,
        orientation="vertical",
        shrink=0.6,
        aspect=10,
        pad=0.15,
    )
    cbar.set_label("Point Density", fontsize=9, labelpad=-60)
    cbar.ax.tick_params(labelsize=8, labelleft=False, labelright=True)
    cbar.ax.set_position([0.52, 0.15, 0.02, 0.7])

    ticks = cbar.get_ticks()
    formatted_labels = []
    for tick in ticks:
        if tick >= 1000:
            formatted_labels.append(f"{tick/1000:.0f}K")
        elif tick >= 100:
            formatted_labels.append(f"{tick/1000:.1f}K")
        else:
            formatted_labels.append(f"{tick:.0f}")
    cbar.set_ticklabels(formatted_labels)

plt.tight_layout()
plt.subplots_adjust(bottom=0.1, wspace=0.30, top=0.85)
plt.savefig("force_parity_adaptive.png", dpi=600, bbox_inches="tight")

Figure 1. Force Parity plot for test data including adaptive conformal prediction bands.

The results for the prediction interval levels of 95.3% (2$\sigma$) and 99.6% (3$\sigma$) are shown in the figure above for 500 calibration structures and 100 test structures. I use a locally adaptive split-conformal method: fit a function $\sigma(|\hat F|)$ on the calibration set to model residual scale, compute normalized scores $s_i/\sigma_i$, select the threshold $\hat q$ on these normalized scores as above, and then set the half-width $h(X)=\hat q\,\sigma(|\hat F|)$. This provides heteroscedastic, input-dependent intervals.

The adaptive conformal prediction bands reveal details about model performance. The achieved coverage (95.3% for 2$\sigma$, 99.6% for 3$\sigma$) closely matches the theoretical expectations (95.45% and 99.73% respectively), so we can say the model is well-calibrated. Points outside the prediction bands represent cases where the model's uncertainty estimate was insufficient, i.e., these are the "hard" cases where the model struggles to make the correct predictions. While 3$\sigma$ bands provide higher confidence (99.6% vs 95.3%), they are wider and less informative. The ideal model would provide narrow bands with high confidence, indicating both precision and reliability.

Notice in the inset that the bands have large uncertainty in the large force regime. This points to the fact that the model may be unable to properly capture regimes where the potential energy surface exhibits large gradients.

Beyond Distribution-Free Guarantees

Exact split-conformal validity holds when the score function (and any normalization) is fixed before seeing the calibration labels. Here the scale function $\sigma(\cdot)$ is fit on the calibration set itself, which can break exchangeability of the scores and thus the finite-sample guarantee. The trade-off often yields tighter, more efficient intervals, but coverage should be validated empirically. Using a separate proper-training set to learn $\sigma(\cdot)$ and reserving calibration solely for ranking scores restores the classical guarantee.

There is an inherent tension in uncertainty quantification between confidence and precision. Higher sigma values (3$\sigma$) provide greater confidence that the true value lies within the band, but at the cost of wider, less informative intervals. Conversely, lower sigma values (2$\sigma$) provide narrower, more precise intervals but with lower confidence guarantees.

The other aspect is we can asses the honesty of the model by looking and the test coverage, i.e., how often the true value is in the prediction interval. In Figure 1 we see the test coverage is close to the confidence level for 2$\sigma$ and 3$\sigma$.

I think this was an interesting approach to model uncertainty that is simple in a lot of ways. So what are the downsides? The main one is that it requires a lot of calibration and test data to compute reliable prediction intervals. If you have a small dataset, you may not be able to compute the prediction interval/set well. Another issue is that conformal prediction gives you marginal coverage, not conditional coverage, meaning the guarantees hold on average over your whole dataset but not uniformly for every input, so you might have a model that's well-calibrated overall but terrible for specific material classes or rare chemistries. The method also assumes your data is i.i.d. or exchangeable, but if you have time dependence, selection bias, or covariate shift, the guarantees can break down completely. The efficiency depends heavily on your choice of nonconformity score, so if you pick a poor one you'll get conservative, overly wide prediction intervals. It also doesn't give you per-point probabilities, just sets with frequency guarantees, so if you need likelihoods. Distribution shift can inflate your residual quantiles, giving you uninformative bands, and for structured targets like energy+forces+stress, joint coverage can be quite conservative.

Where does one go from here?

The prediction bands provide a natural criterion for active learning, i.e., points falling outside bands represent high-uncertainty cases that would benefit from additional training data. The idea then is to use conformal predictions to create uncertainty quantification for ML models and therefore in an active learning framework think about how to query the most informative samples to make improvements. With conformal predictions you would propose a bunch of unlabeled test inputs, pass through the model, and then see if the true value is in the prediction interval to establish the degree of confidence (i.e., uncertainty) in the prediction. Then based on your sampling strategy you would query a oracle to label the data point(s). Then one would retrain/fine-tune the model with new data, update the calibration set, and repeat the process.

Uncertainty quantification isn't about being right, it's about being right with the right amount of confidence. A model that's always right but with huge prediction bands is less useful than one that's right 95% of the time with narrow, informative bands. Our results demonstrate a well-calibrated model that appropriately increases uncertainty for challenging cases, which is exactly what you want in a reliable uncertainty quantification system.

References

[1] A.N. Angelopoulos, S. Bates, A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification, (2022). https://doi.org/10.48550/arXiv.2107.07511.

Reuse and Attribution

Search Blogs

Saturday, September 27, 2025

I now know why Boron Carbide shatters

The Paradox of Boron Carbide

New kid on the block: MLIPs

So what did they find?

1. Crystal Orientation Strongly Influences HEL

2. Polymorphism Governs Toughness

3. Two Distinct Regimes of Amorphization

Implications for Materials Design

Footnotes

References

Saturday, September 13, 2025

Conformal Prediction

So how do we use this in practice?

Where does one go from here?

References