Search Blogs

Saturday, September 13, 2025

Conformal Prediction

An area that has gained my specific attention lately is the idea of trying to quantify how confident a model is in its prediction. There are various techniques for doing this, the familiar Bayesian methods are some of the most popular, while others are still new to me, the one that I've been reading on this weekend is the idea of conformal prediction [1]. So what is Conformal Prediction? A Conformal prediction is a method for uncertainty quantification where instead of giving a single prediction we provide a set (e.g., ${y_1, y_2, ..., y_n}$) of most predicted values with a guarantee about how frequently the true value will be in that set. To put it even more simply, the model spits out a bunch of possible answers given several inputs and we just ask, “Is the real answer somewhere in this pile? If so are you 95% sure?”.

This idea is more aligned with the frequentist thinking, where we don't concern ourselves with a prior or the full posterior distribution, but only seeking how well the prediction is in the set.

The prediction interval when doing regression is then some range where we specify the values of the predictions should fall in. So say its the bandgap we are predicting the we specify say , 3.0 to 3.5 eV and we attached some confidence level that the true bandgap value will be in that range. If we are focused on assigning category labels, the it would be something like the crystal structure is either rocksalt, zincblende, or perovskite and again the confidence that the true label will be in that set is given some value (e.g. 90%).

Important Distinction

These are prediction intervals, not confidence intervals. Prediction intervals quantify uncertainty about individual predictions, while confidence intervals quantify uncertainty about distribution parameters like the mean.

Another aspect to keep in mind is that conformal prediction uses calibration residuals from a held-out set to size the prediction interval. This differs from Bayesian or bootstrap methods, which model the data-generating process (e.g., via likelihoods) to derive uncertainty.

Lets put the conformal predictions words into math. First we define $P(\cdot)$ as the probability of some value occurring ($P(E_b=3.1\text{eV})$), then we define the true value as $E_b^{n+1}$, where the $n+1$ is the new input. Now we define the prediction interval as $\Gamma(X_{n+1})$, where $X_{n+1}$ is the new input, so again this just tells us given an input evaluate a set-valued function that returns a set of possible values. For regression, this is typically a continuous interval like $[3.1, 3.5]$ eV for the bandgap prediction. Then we have to define our confidence level, i.e., quantify how well we want to or do know the prediction is in the set. To do this we use $1-\alpha$, where $\alpha$ is the confidence level. If $\alpha = 0$ we are saying that with 100% confidence the true value will be in the prediction interval. If $\alpha = 0.05$ we are saying that with 95% confidence the true value will be in the prediction interval, and so on. Now we can write the probability statement as:

$$ \begin{equation} P(E_b^{n+1} \in \Gamma(X_{n+1})) \geq 1 - \alpha. \label{eq:conformal_prediction} \end{equation} $$

Reading through this, we have the probability of the true value $E_b^{n+1}$ (i.e., the true bandgap value at next input $X_{n+1}$) being in the prediction interval $\Gamma(X_{n+1})$ set with a confidence level of $1 - \alpha$.

The thing to note is this is more of a guarantee statement; it is saying that if we define the confidence level and interval set then we can make a statement about how likely the true value is in the interval. We never need to directly compute the probability of the true value being in the interval in $\eqref{eq:conformal_prediction}$, which is why this method is called "distribution-free".

We also need a way to measure how "bad" the model's guess was compared to the truth, that is compute some loss function, in other words a nonconformity score. There are several different ways to define the nonconformity score depending on the type of prediction we are making. I'll just show the L1 norm as an example for regression, which would be

$$ \begin{equation} s_i = \vert E_{b}^{i} - \hat{E}_{b}^{i} \vert \nonumber \end{equation} $$

where $\hat{E}_b^{i}$ is the model's prediction for the bandgap value given input $X_{i}$, and the true value is $E_b^{i}$; $s_i$ is the nonconformity score for the $i$-th prediction. For example, a nonconformity score of 0.5 eV quantifies how much the model's prediction deviates from the true value. This is standard in regression. Importantly, the nonconformity scores are computed on a calibration set, that is, for $i \in {1, 2, \ldots, m}$, where this set is held out and not used for training.

So how do we use this in practice?

Switching from bandgap to forces, consider a GNN trained to predict atomic forces from atomic structures. We hold out a calibration set and compute nonconformity scores $s_i = |F_i^{true} - \hat{F}_i|$ for each force component, where $F_i^{true}$ is the true force and $\hat{F}_i$ is the model prediction.

We now build the prediction interval. Choose a desired coverage level, say $1-\alpha = 0.9$. Sort the $m$ calibration scores $s_{(1)} \le \cdots \le s_{(m)}$ and take $(r = \lceil (m+1)(1-\alpha) \rceil)$. The threshold is the $r$-th smallest score, $\hat q = s_{(r)}$. This is the standard split-conformal finite-sample rule.

Now we take a new atomic structure $X_{n+1}$ and compute the updated prediction interval.

$$ \begin{equation} \Gamma(F^{n+1}) = [\hat{F}^{n+1} - \hat q, \hat{F}^{n+1} + \hat q] \label{eq:conformal_prediction_updated} \end{equation} $$

With the updated prediction interval (with $\hat q$ chosen as above) we can then make a statement about how likely the true value is in the interval in $\eqref{eq:conformal_prediction_updated}$.

Again this is pretty basic concepts that are to follow and its more that the theory guarantees the prediction interval/set will contain the true value with the desired confidence level. This makes it simpler than Bayesian methods, where you have to "try" and compute the full posterior distribution through likelihood and prior probability distributions, all which require reasonably good choices.

To further illustrate let me work through this simple example of computing the conformal prediction for the forces using the orb-v3 pre-trained models on structures from the load-atoms GST dataset. The basic steps are to define the calibration and test sets, compute the nonconformity scores, build the prediction interval/set, and then make a statement about how likely the true value is in the interval. Below is the code snippet that shows the conformal prediction for the forces in addition to the model predictions.

Code

  # /// script
# requires-python = ">=3.8"
# dependencies = [
#   "numpy",
#   "matplotlib",
#   "orb-models",
#   "load-atoms",
#   "seaborn",
# ]
# ///
import math
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
from matplotlib.patches import Patch
from sklearn.neural_network import MLPRegressor
from load_atoms import load_dataset
from orb_models.forcefield import pretrained
from orb_models.forcefield.calculator import ORBCalculator

data = load_dataset("GST-GAP-22")
calibration, tests = data.random_split([500, 100], seed=42)

orb = pretrained.orb_v3_direct_20_omat(device="cpu")
calc = ORBCalculator(orb, device="cpu")


def pred_forces(atoms):
    atoms.calc = calc
    return atoms.get_forces()


# Locally adaptive split-conformal prediction using normalized residuals
# Note: These are prediction intervals, not confidence intervals
calibration_true = np.concatenate([a.arrays["forces"].ravel() for a in calibration])
calibration_pred = np.concatenate([pred_forces(a).ravel() for a in calibration])
scores = np.abs(calibration_true - calibration_pred)  # L1 nonconformity scores
z_calibration = np.abs(calibration_pred)  # feature: force magnitude


def fit_cqr_model(X, y, alpha=0.001):
    """CQR model for adaptive conformal prediction - learns local uncertainty scales.

    This model learns the relationship between force magnitude and prediction
    uncertainty to create adaptive prediction intervals.
    """
    model = MLPRegressor(
        hidden_layer_sizes=(100, 50, 25),
        activation="relu",
        solver="adam",
        alpha=alpha,
        learning_rate="adaptive",
        learning_rate_init=0.001,
        max_iter=1000,
        early_stopping=True,
        validation_fraction=0.1,
        n_iter_no_change=20,
        random_state=42,
    )
    model.fit(X, y)
    return model


# Fit adaptive scale model and compute normalized scores
cqr_scale = fit_cqr_model(z_calibration.reshape(-1, 1), scores, alpha=0.01)
sigma_calibration = np.clip(cqr_scale.predict(z_calibration.reshape(-1, 1)), 1e-8, None)
scores = np.sort(scores / sigma_calibration)

test_true = np.concatenate([a.arrays["forces"].flatten() for a in tests])
test_pred = np.concatenate([pred_forces(a).flatten() for a in tests])

z_test = np.abs(test_pred)
sigma_test = np.clip(cqr_scale.predict(z_test.reshape(-1, 1)), 1e-8, None)

levels = {
    "2Οƒ": math.erf(2 / np.sqrt(2)),
    "3Οƒ": math.erf(3 / np.sqrt(2)),
}

# ---- Plotting ----
lo, hi = float(min(test_true.min(), test_pred.min())), float(
    max(test_true.max(), test_pred.max())
)
ax_lim = max(abs(lo), abs(hi))
lo, hi = -ax_lim, ax_lim
idx = np.argsort(test_true)
xs, ys = test_true[idx], test_pred[idx]

fig, axes = plt.subplots(1, 2, figsize=(10, 5), sharex=True, sharey=True)
hexbin_for_colorbar = None
for ax, (tag, p) in zip(axes, levels.items()):
    delta_n = scores[
        min(max(int(np.ceil((len(scores) + 1) * p)) - 1, 0), len(scores) - 1)
    ]
    halfw = delta_n * sigma_test
    hw_s = halfw[idx]

    inside_band = np.abs(test_pred - test_true) <= halfw
    ax.plot([lo, hi], [lo, hi], "k-", lw=0.2)
    inside_band_sorted = np.abs(ys - xs) <= hw_s

    main_mask = (xs >= -1.5) & (xs <= 1.5) & (ys >= -1.5) & (ys <= 1.5)
    main_xs = xs[main_mask]
    main_ys = ys[main_mask]
    main_hw_s = hw_s[main_mask]
    main_inside = inside_band_sorted[main_mask]

    # Plot adaptive prediction interval boundaries
    ax.plot(main_xs, main_xs - main_hw_s, "b-", lw=0.2, alpha=0.6, zorder=2)
    ax.plot(main_xs, main_xs + main_hw_s, "b-", lw=0.2, alpha=0.6, zorder=2)

    confident_xs = main_xs[main_inside]
    confident_ys = main_ys[main_inside]
    if len(confident_xs) > 0:
        hb = ax.hexbin(
            confident_xs, confident_ys, gridsize=25, cmap="Greens", alpha=0.7, mincnt=1
        )

    uncertain_xs = main_xs[~main_inside]
    uncertain_ys = main_ys[~main_inside]
    if len(uncertain_xs) > 0:
        ax.scatter(
            uncertain_xs,
            uncertain_ys,
            s=1.5,
            alpha=0.5,
            color="red",
            marker="x",
            zorder=1,
        )

    coverage = np.mean(inside_band) * 100
    ax.set_title(
        f"{tag}: Test Coverage = {coverage:.1f}%", fontsize=14, fontweight="bold"
    )
    ax.set_xlabel("True Forces (eV/Γ…)", fontsize=12)
    if ax is axes[0]:
        ax.set_ylabel("Predicted Forces (eV/Γ…)", fontsize=12)
    ax.tick_params(labelsize=11)
    ax.set_xlim(-1.5, 1.5)
    ax.set_ylim(-1.5, 1.5)
    ax.set_aspect("equal")

    inset = inset_axes(ax, width="35%", height="35%", loc="upper left")
    inset.plot([lo, hi], [lo, hi], "k-", lw=0.2)
    inset.plot(xs, xs - hw_s, "b-", lw=0.15, alpha=0.6, zorder=2)
    inset.plot(xs, xs + hw_s, "b-", lw=0.15, alpha=0.6, zorder=2)

    confident_xs_full = xs[inside_band_sorted]
    confident_ys_full = ys[inside_band_sorted]
    if len(confident_xs_full) > 0:
        hb_inset = inset.hexbin(
            confident_xs_full,
            confident_ys_full,
            gridsize=20,
            cmap="Greens",
            alpha=0.7,
            mincnt=1,
        )
        hexbin_for_colorbar = hb_inset

    uncertain_xs_full = xs[~inside_band_sorted]
    uncertain_ys_full = ys[~inside_band_sorted]
    if len(uncertain_xs_full) > 0:
        inset.scatter(
            uncertain_xs_full,
            uncertain_ys_full,
            s=1.0,
            alpha=0.5,
            color="red",
            marker="x",
            zorder=1,
        )

    inset.set_xlim(lo, hi)
    inset.set_ylim(lo, hi)
    inset.set_aspect("equal")
    inset.tick_params(labelsize=8)

plt.suptitle(
    "Adaptive Conformal Prediction for Force Prediction Intervals",
    fontsize=16,
    fontweight="bold",
)

legend_elements = [
    Patch(facecolor="red", label="Model Uncertain"),
    Patch(facecolor="blue", label="Model Adaptive PI"),
]

fig.legend(
    handles=legend_elements,
    loc="center",
    bbox_to_anchor=(0.35, 0.15),
    ncol=1,
    fontsize=10,
    framealpha=0.9,
)

if hexbin_for_colorbar is not None:
    cbar = fig.colorbar(
        hexbin_for_colorbar,
        ax=axes,
        orientation="vertical",
        shrink=0.6,
        aspect=10,
        pad=0.15,
    )
    cbar.set_label("Point Density", fontsize=9, labelpad=-60)
    cbar.ax.tick_params(labelsize=8, labelleft=False, labelright=True)
    cbar.ax.set_position([0.52, 0.15, 0.02, 0.7])

    ticks = cbar.get_ticks()
    formatted_labels = []
    for tick in ticks:
        if tick >= 1000:
            formatted_labels.append(f"{tick/1000:.0f}K")
        elif tick >= 100:
            formatted_labels.append(f"{tick/1000:.1f}K")
        else:
            formatted_labels.append(f"{tick:.0f}")
    cbar.set_ticklabels(formatted_labels)

plt.tight_layout()
plt.subplots_adjust(bottom=0.1, wspace=0.30, top=0.85)
plt.savefig("force_parity_adaptive.png", dpi=600, bbox_inches="tight")

Figure 1. Force Parity plot for test data including adaptive conformal prediction bands.

The results for the prediction interval levels of 95.3% (2$\sigma$) and 99.6% (3$\sigma$) are shown in the figure above for 500 calibration structures and 100 test structures. I use a locally adaptive split-conformal method: fit a function $\sigma(|\hat F|)$ on the calibration set to model residual scale, compute normalized scores $s_i/\sigma_i$, select the threshold $\hat q$ on these normalized scores as above, and then set the half-width $h(X)=\hat q\,\sigma(|\hat F|)$. This provides heteroscedastic, input-dependent intervals.

The adaptive conformal prediction bands reveal details about model performance. The achieved coverage (95.3% for 2$\sigma$, 99.6% for 3$\sigma$) closely matches the theoretical expectations (95.45% and 99.73% respectively), so we can say the model is well-calibrated. Points outside the prediction bands represent cases where the model's uncertainty estimate was insufficient, i.e., these are the "hard" cases where the model struggles to make the correct predictions. While 3$\sigma$ bands provide higher confidence (99.6% vs 95.3%), they are wider and less informative. The ideal model would provide narrow bands with high confidence, indicating both precision and reliability.

Notice in the inset that the bands have large uncertainty in the large force regime. This points to the fact that the model may be unable to properly capture regimes where the potential energy surface exhibits large gradients.

Beyond Distribution-Free Guarantees

Exact split-conformal validity holds when the score function (and any normalization) is fixed before seeing the calibration labels. Here the scale function $\sigma(\cdot)$ is fit on the calibration set itself, which can break exchangeability of the scores and thus the finite-sample guarantee. The trade-off often yields tighter, more efficient intervals, but coverage should be validated empirically. Using a separate proper-training set to learn $\sigma(\cdot)$ and reserving calibration solely for ranking scores restores the classical guarantee.

There is an inherent tension in uncertainty quantification between confidence and precision. Higher sigma values (3$\sigma$) provide greater confidence that the true value lies within the band, but at the cost of wider, less informative intervals. Conversely, lower sigma values (2$\sigma$) provide narrower, more precise intervals but with lower confidence guarantees.

The other aspect is we can asses the honesty of the model by looking and the test coverage, i.e., how often the true value is in the prediction interval. In Figure 1 we see the test coverage is close to the confidence level for 2$\sigma$ and 3$\sigma$.

I think this was an interesting approach to model uncertainty that is simple in a lot of ways. So what are the downsides? The main one is that it requires a lot of calibration and test data to compute reliable prediction intervals. If you have a small dataset, you may not be able to compute the prediction interval/set well. Another issue is that conformal prediction gives you marginal coverage, not conditional coverage, meaning the guarantees hold on average over your whole dataset but not uniformly for every input, so you might have a model that's well-calibrated overall but terrible for specific material classes or rare chemistries. The method also assumes your data is i.i.d. or exchangeable, but if you have time dependence, selection bias, or covariate shift, the guarantees can break down completely. The efficiency depends heavily on your choice of nonconformity score, so if you pick a poor one you'll get conservative, overly wide prediction intervals. It also doesn't give you per-point probabilities, just sets with frequency guarantees, so if you need likelihoods. Distribution shift can inflate your residual quantiles, giving you uninformative bands, and for structured targets like energy+forces+stress, joint coverage can be quite conservative.

Where does one go from here?

The prediction bands provide a natural criterion for active learning, i.e., points falling outside bands represent high-uncertainty cases that would benefit from additional training data. The idea then is to use conformal predictions to create uncertainty quantification for ML models and therefore in an active learning framework think about how to query the most informative samples to make improvements. With conformal predictions you would propose a bunch of unlabeled test inputs, pass through the model, and then see if the true value is in the prediction interval to establish the degree of confidence (i.e., uncertainty) in the prediction. Then based on your sampling strategy you would query a oracle to label the data point(s). Then one would retrain/fine-tune the model with new data, update the calibration set, and repeat the process.

Uncertainty quantification isn't about being right, it's about being right with the right amount of confidence. A model that's always right but with huge prediction bands is less useful than one that's right 95% of the time with narrow, informative bands. Our results demonstrate a well-calibrated model that appropriately increases uncertainty for challenging cases, which is exactly what you want in a reliable uncertainty quantification system.


References

[1] A.N. Angelopoulos, S. Bates, A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification, (2022). https://doi.org/10.48550/arXiv.2107.07511.



Reuse and Attribution