TxPert: using multiple knowledge graphs for prediction of transcriptomic perturbation effects – Nature Biotechnology
TxPert architecture
TxPert predicts the transcriptional response \({\bf{y}}\in {\mathcal{Y}}\subset {{\mathbb{R}}}^{n}\) given a set of perturbation tokens \(P\subset {\mathcal{P}}:=\{1,\ldots ,N\}\) and a basal state representation derived from a control expression profile \({\bf{x}}\in {\mathcal{X}}\subset {{\mathbb{R}}}^{n}\), which has been aligned with the predicted cell with respect to cell line and batch effect. Here, \(n\in {\mathbb{N}}\) denotes the number of experimentally measured genes and \(N\in {\mathbb{N}}\) denotes the total number of observed perturbations in the data. These perturbation tokens (or identifiers) are used to select node representations from a biological gene (product)–gene (product) interaction KG whose embeddings are integrated with the basal state to produce the perturbed expression profile.
To combine the information of the basal state and the perturbations, we first learn latent representations of both, that is, \({\bf{x}}\mapsto {\bf{s}}\in {{\mathbb{R}}}^{d}\) and \(P\mapsto \{{{\bf{z}}}_{p}\in {{\mathbb{R}}}^{d}:p\in P\}\) for a chosen latent dimension \(d\in {\mathbb{N}}\). Then, we combine the information through latent shift, where a learned decoder gϕ predicts the perturbation effect from the given context, that is, \(\widehat{{\bf{y}}}={g}_{\phi }({\bf{s}}+{\sum }_{p\in P}{{\bf{z}}}_{p})\), using the mean squared error (MSE):
$${\mathcal{L}}({\bf{y}},\widehat{{\bf{y}}})=\frac{1}{n}| | {\bf{y}}-\widehat{{\bf{y}}}| {| }_{2}^{2}$$
This setup naturally integrates with single, double and more general multiperturbation settings of order m: = ∣P∣ through the additive and compositional design. More sophisticated combination functions may be used to learn transition functions \({{\bf{s}}}^{{\prime} }={T}_{\psi }({\bf{s}},{\bf{z}})\) that allow sequential latent cell state modeling of subsequently applied perturbations, for example, \(\widehat{{\bf{y}}}={g}_{\phi }({T}_{\psi }(\ldots {T}_{\psi }({\bf{s}},{{\bf{z}}}_{{{\bf{p}}}_{{\bf{1}}}})\ldots ,{{\bf{z}}}_{{{\bf{p}}}_{{\bf{m}}}}))\). To obtain s and zp, p ∈ P, we use learned encoders, namely the basal state model and the perturbation model, which are discussed below.
Basal state model
The basal state encoder is designed to capture intrinsic cellular attributes, such as cell-cycle stage, cell type and other baseline phenotypic features, by mapping a control cell’s gene expression profile \({\bf{x}}\in {{\mathbb{R}}}^{n}\) to a compact, low-dimensional embedding, \({\bf{s}}={f}_{{\mathtt{basal}}}({\bf{x}})\in {{\mathbb{R}}}^{d}\). The basal state model was subject to hyperparameter tuning, with an MLP performing best for unseen and double tasks (Supplementary Note 1 and Supplementary Fig. 7 for cross-cell type). The MLP learns a direct deterministic mapping from high-dimensional input gene expression data to a fixed-size embedding space. The MLP architecture offers a simple and computationally efficient framework for representation learning, while still retaining the capacity to model complex, nonlinear dependencies inherent in gene expression data.
Basal information matching and aggregation
An important aspect of modeling the basal state is the alignment of the control with the predicted perturbed cell. Note that experimental protocols can vary widely between different data sources. Therefore, we randomly sample this control according to the same cell line and dataset or experimental protocol. Furthermore, we explore basal state matching, where the control cell is selected to closely match the batch metadata of the perturbed sample. As this matching is not unique, we randomly sample one such appropriate control. Lastly, we experiment with basal state averaging, where instead of a single control, we compute the average expression profile across all matching controls for a given cell line and/or batch. This produces a more stable estimate of the basal state. Both strategies consistently improved model performance in our experiments.
Encoders
Beyond MLP encoders on raw gene expression profiles, we explored multiple transcriptomics foundation model embeddings to obtain a latent representation of the basal state. Specifically, we experiment with scGPT, and scVI pretrained on the Joung dataset39. We also include a variant with no basal state encoder, where the (latent) basal state space is represented directly by the raw expression profile (that is, s: = x). In this configuration, the perturbation decoder learns a Δ vector from the perturbation embedding, which is added to the control profile: \({\bf{x}}\in {{\mathbb{R}}}^{n}\), that is, \(\widehat{{\bf{y}}}={\bf{x}}+{g}_{\phi }({\sum }_{p\in P}{{\bf{z}}}_{p})\). This resembles the formulation of the general baseline, with a trained model predicting the Δ instead of a hand-crafted heuristic. Unsurprisingly, this variant shines most in settings with limited data availability for learning robust basal state representations, for example, perturbation effect prediction in cell lines without seen perturbations.
Perturbation model
We rely on GNNs that use biological KGs capturing gene (product)–gene (product) interactions to learn informative embeddings for gene perturbations. The GNN learns a matrix of node embeddings associated with the perturbation tokens \(\{1,\ldots ,N\}\mapsto {\bf{Z}}\in {{\mathbb{R}}}^{N\times d}\), where \(N\in {\mathbb{N}}\) is the number of perturbations relevant to the investigated task. Each row (or node representation) of this matrix represents the latent encoding of a specific perturbation, that is, \({{\bf{z}}}_{p}\in {{\mathbb{R}}}^{d}\) required for the latent shift is the pth row of Z. More specifically, we first associate each perturbation p ∈ {1, …, N} with a randomly initialized input node embedding \({{\bf{h}}}_{p}^{0}\in {{\mathbb{R}}}^{{d}_{0}},{d}_{0}\in {\mathbb{N}}\) that are consolidated in the input node feature matrix \({{\bf{H}}}^{0}\in {{\mathbb{R}}}^{N\times {d}_{0}}\). During training, those node features are (1) treated as model parameters that are learned using backpropagation and (2) subsequently refined through the message passing of an L-layer GNN, that is, \({{\bf{H}}}^{\ell }={{\mathtt{LAYER}}}_{{\theta }_{\ell }}({{\bf{H}}}_{\ell -1}),0\le \ell \le L\) and Z: = HL. This allows the model to characterize perturbation effects on the basis of known relationships from KGs such as GO19, STRING18 or proprietary data sources.
Real-world KGs present inherent imperfections; they often contain noisy or incorrect edges (false positives), suffer from missing connections (false negatives) and may originate from diverse sources offering multiple, sometimes conflicting, perspectives. The effective use of such complex data necessitates GNN architectures specifically chosen to address these challenges.
To this end, we select and adapt two fundamental GNN approaches. First, to handle noisy edges, we use attention-based models such as the GAT40,41. The ability of GAT to dynamically (re)weight neighbor importance provides much needed robustness by effectively downweighting less credible connections, which is a major difference relative to non-attention-based methods such as the simple graph convolution42 used in GEARS. Second, to address graph incompleteness and capture long-range dependencies, we use GTs, specifically Exphormer16,17. Its capacity for attention beyond immediate graph neighbors allows it to potentially model implicit relationships and bridge structural gaps.
Furthermore, it proved crucial for the presented tasks to use multiple KGs that offer complementary and reinforcing perspectives on the task-related biology. For this, we explore architectures designed for synergistic learning from diverse sources. This includes extending GAT to GAT-Hybrid (allowing for node-level attention weighting of information from different KGs), introducing our provenance-aware Exphormer-MG variant and developing GAT-MLG, a multilayer extension of GAT that uses a supra-adjacency representation to effectively integrate information across multiple biological networks.
In Supplementary Note 3, we provide rigorous details on the relevant graph representation learning used for encoding perturbations, the proposed models and the techniques applied to take advantage of complementary information from multiple KGs. In our experimental setup, the learned embeddings for the perturbed gene(s) from the GNN were extracted and combined with a basal state representation to predict the resulting gene expression profile. This comparative analysis allowed us to investigate how different GNN strategies (attention, flexible connectivity and multigraph fusion) perform when learning from the complexities of biological KGs.
Training and evaluation framework
Algorithm 1
TxPert training algorithm
Require: Pert. cells \({\mathcal{Y}}\), control cells \({\mathcal{X}}\), biological prior graph G = (V, E) with perturbations \({\mathcal{P}}\subset V\)
Ensure: Minimize MSE loss between predicted and true perturbed cell measurements
1: Initialize input perturbation embeddings \({\{{{\bf{h}}}_{v}^{0}\}}_{v\in V}\) randomly
2: for each training step do
3: Sample a batch of perturbed cell profiles: \({\{{{\bf{y}}}_{i}\}}_{i=1}^{B}\subset {\mathcal{Y}}\)
4: Sample corresponding control cells from the same experimental batches: \({\{{{\bf{x}}}_{i}\}}_{i=1}^{B}\subset {\mathcal{X}}\)
5: Enrich perturbation embeddings using graph prior: \({\{{{\bf{z}}}_{v}\}}_{v\in V}\leftarrow {\mathtt{GNN}}(G,{\{{{\bf{h}}}_{v}^{0}\}}_{v\in V})\)
6: for each sample in batch do
7: Encode control cells into basal latent space: bi ← MLPbasal(xi)
8: Retrieve perturbations \({P}_{i}\subset {\mathcal{P}}\) associated with target yi
9: Combine control and perturbation embeddings: \({\widehat{{\bf{z}}}}_{i}\leftarrow {\mathtt{COM}}({{\bf{b}}}_{i},\{{{\bf{z}}}_{p}:p\in {P}_{i}\})\)
10: Decode to predicted perturbed profile: \({\widehat{{\bf{y}}}}_{i}\leftarrow {{\mathtt{MLP}}}_{{\rm{dec}}}({\widehat{{\bf{z}}}}_{i})\)
11: Compute loss for each sample: \({{\mathcal{L}}}_{i}\leftarrow {\mathtt{MSE}}({\widehat{{\bf{y}}}}_{i},{{\bf{y}}}_{i})\)
12: end for
13: Compute total loss over batch: \({\mathcal{L}}\leftarrow {\sum }_{i=1}^{B}{{\mathcal{L}}}_{i}\)
14: Backpropagate and update model parameters
15: end for
Data splits
The data were split into training, validation and test sets through grouping by perturbation such that distinct sets of unseen perturbations were reserved both the validation and test sets with target ratios of 0.5625, 0.1875 and 0.25 for the training, validation and test sets, respectively. Moreover, for the cross-cell-type task, the test set was a reserved cell type with only control cells included during training with a breakdown into seen and unseen perturbations therein. As an exception, for the doubles task on the Norman dataset, predefined splits were loaded from the GEARS setup.
Optimal hyperparameters for each model were selected based on the validation Pearson Δ metric. Only metrics on the test set are reported.
Metric definitions
All metrics are reported as weighted averages, that is, the mean of the mean across cells subjected to each unique perturbation, unless otherwise specified.
Expression value representations and Δ
Where not otherwise specified, expression values \({\bf{x}}\in {\mathcal{X}}\cup {\mathcal{Y}}\), are represented as log1p-transformed and library size-normalized counts (with target library size of 4,000); that is, for a raw count \({{\bf{x}}}_{{\rm{raw}}}\in {{\mathbb{R}}}^{n}\), we define
$${\bf{x}}:=\log \left(1+4000\cdot \frac{{{\bf{x}}}_{\mathrm{raw}}}{\parallel {{\bf{x}}}_{\mathrm{raw}}{\parallel }_{1}}\right).$$
The other representation used is a Δ representation, which is centered on batch-matched controls. Specifically, for each perturbed cell expression \({{\bf{y}}}_{i}\in {\mathcal{Y}}\) with cell line c and batch b, the expression is transformed to
$${\delta }_{i}:={{\bf{y}}}_{i}-{\overline{{\bf{x}}}}_{c,b},$$
where \({\overline{x}}_{c,b}\) represents the mean expression of control cells \({\bf{x}}\in {\mathcal{X}}\) with batch b and cell line c.
Pearson Δ
Slightly modified from the metric ‘Pearson correlation (Δ expression)’ from the GEARS manuscript, Pearson Δ calculates the correlation between predicted and observed log fold change versus batch-matched control mean,
$${\text{Pearson}}\Delta (p):=\mathrm{Pearson}({\widehat{\delta }}_{p},{\delta }_{p}),$$
where \({\widehat{\delta }}_{p}\) and δp are the batch-matched control centering of the prediction and ground truth, respectively, averaged across replicates of certain perturbation \(p\in {\mathcal{P}}\). For simplicity, we define this and following metrics for single perturbations \(p\in {\mathcal{P}}\) but note that analogous formulations are appropriate for multiple perturbations \(P\subset {\mathcal{P}}\). The results across all predicted perturbation effects are then averaged to obtain an overall performance estimate.
Note that, for the GEARS model only, we report the exact ‘Pearson correlation (Δ expression)’ from the GEARS code base instead. We confirmed that any differences between ‘Pearson correlation (Δ expression)’ and our ‘Pearson Δ’ were much smaller in practice than the differences between models.
Retrieval
We use two variants of the retrieval rank metric that score a prediction’s similarity to the ground truth not overall but relative to other perturbations. These metrics are the same as rank average from PerturBench10, except that they focus on similarity with a perfect score of 1, a random score of 0.5 and perfect anticorrelated prediction score of 0:
$$\begin{array}{ll}\mathrm{Retrieval}: & =\displaystyle\frac{1}{N}\sum _{p\in {\mathscr{ \mathcal P }}}\,\mathrm{rank}\,({\hat{\delta }}_{p}),\\ \mathrm{rank}\,({\hat{\delta }}_{p}): & =\displaystyle\frac{1}{N-1}\mathop{\sum}\limits_{{q\in {\mathscr{ \mathcal P }}}\atop {q\ne p}}{{\bf{1}}}_{\{\mathrm{Pearson}({\hat{\delta }}_{p},{\delta }_{p})\ge \mathrm{Pearson}({\hat{\delta }}_{p},{\delta }_{q})\}}.\end{array}$$
For ‘normalized’ retrieval, the perturbation count \(N:=| {\mathcal{P}}|\) matches the original experiment, whereas, for ‘fast retrieval’, for computational efficiency, a seeded random reference set of only 100 perturbations is used, with the addition of the query perturbant p when not in the reference set (thus, N ∈ {100, 101}. Similar to Person Δ, we report the averaged performance across all perturbations.
Nonlearned general baseline
To establish a performance floor, we implement a nonlearned general baseline model that predicts expression profiles using mean values observed in the training data. This baseline uses an additive approach that combines the following:
-
The mean test cell type control expression profile
-
Either the perturbation-specific mean changes (for seen perturbations) or the global perturbation mean (for unseen perturbations)
-
When multiple cell lines are present in the training set, we either use a weighted average according to the number of samples per cell line or the perturbation-specific mean changes from the most similar cell line (nearest-cell-line baseline). Here, similarity is determined on the basis of mean correlation of shared perturbation Δ values between the test and candidate cell line.
Consider a multiset of training samples \(\,\mathrm{Train}\,\subset {{\mathcal{P}}}_{\mathrm{train}}\times {{\mathcal{C}}}_{\mathrm{train}}\times {{\mathcal{B}}}_{\mathrm{train}}\) consisting of combinations of perturbation(s), cell line(s) and batch effect(s) with a multiset test defined analogously. Consider a perturbation p such that (p, cp, bp) ∈ test with cell line cp and batch effect bp. Implicitly, a (cp, bp) map is associated with a set of control cell profiles in that context.
If there exists (p, c, b) ∈ train, we have
$$\begin{array}{ll}{\widehat{{\bf{y}}}}_{(p,{c}_{p},{b}_{p})} & =\\ & {\bar{{\bf{x}}}}_{({c}_{p},{b}_{p})}+\displaystyle\frac{1}{| \{(q,c,b)\in \,\mathrm{Train}:q=p\}| }\mathop{\sum }\limits_{{(q,c,b)\in \,\mathrm{Train}}\atop{q=p}}{{\bf{y}}}_{(p,c,b)}-{\bar{{\bf{x}}}}_{(c,b)}.\end{array}$$
Otherwise, we use the global Δ across perturbations observed in the training set, that is,
$${\widehat{{\bf{y}}}}_{(p,{c}_{p},{b}_{p})}={\bar{{\bf{x}}}}_{({c}_{p},{b}_{p})}+\frac{1}{| \,\mathrm{Train}| }\mathop{\sum }\limits_{(q,c,b)\in \mathrm{Train}}{{\bf{y}}}_{(p,c,b)}-{\bar{{\bf{x}}}}_{(c,b)}.$$
For multiple perturbations, this baseline is implemented to initially attempt to use samples where the exact perturbation configuration is present. Otherwise, the perturbation is split into its components and each component is sequentially added to the test control mean according to the above method, adding a local Δ estimate if available and resorting to a global Δ otherwise.
Experimental reproducibility estimation: split-half validation and sample-based extension
As Perturb-seq is a destructive assay, we cannot observe the same cell in both perturbed and unperturbed states. This necessitates focusing on distribution means rather than individual cell accuracies. To approximate experimental reproducibility, we first use a split-half validation approach:
For each combination of perturbation(s), cell line context and batch, we apply three operations:
-
1.
Divide test cells into two roughly equal halves
-
2.
Calculate mean expression profiles for each half
-
3.
Measure the agreement between these means using various metrics
To account for the randomness in choosing the half-split, we repeat the experiment across multiple seeded runs and report average performance. This provides a performance benchmark analogous to human-level reproducibility, which is called accuracy in other machine learning domains.
Consider the set of expression profiles \({\mathcal{S}}\subset {\mathcal{Y}}\) for a fixed perturbation cell line context and batch (p, c, b) in the test set:
$$\begin{array}{rcl}{{\mathcal{S}}}^{{\prime} } & \subseteq & {\mathcal{S}}:\,| {{\mathcal{S}}}^{{\prime} }| \approx | {\mathcal{S}}| /2\\ {\bar{{\mathcal{S}}}}_{1} & = & \frac{1}{| {{\mathcal{S}}}^{{\prime} }| }\mathop{\sum }\limits_{{\bf{y}}\in {{\mathcal{S}}}^{{\prime} }}{\bf{y}}\\ {\bar{{\mathcal{S}}}}_{2} & = & \frac{1}{| {\mathcal{S}}\backslash {{\mathcal{S}}}^{{\prime} }| }\mathop{\sum }\limits_{{\bf{y}}\in {\mathcal{S}}\backslash {{\mathcal{S}}}^{{\prime} }}{\bf{y}}.\end{array}$$
We then report
$$\mathrm{Reproduce}(p,c,b)=\mathrm{Metric}({\bar{{\mathcal{S}}}}_{1},{\bar{{\mathcal{S}}}}_{2}),$$
where metric represents any of our evaluation metrics, for example, Pearson Δ, Retrieval or MSE. Theoretically, the split-half experimental reproducibility is not expected to establish an upper bound for performance of all models at test time because it operates on a different test set (only using half for prediction and testing, respectively). However, it empirically proves to be useful as a competitive (but still theoretically reachable) mark to beat.
As split-half reproducibility is likely an underestimate because of a reduced (halved) number of replicates, we also introduce a sample-based approach that gives an estimate for the reproducibility of the full-size dataset. Using the original count matrix, we calculate the per-batch probability distribution over genes (multinomial distribution maximum-likelihood estimator) for each perturbation. We then sample these distributions to generate two datasets that have the same number of observations (that is, cells) as the original dataset, but with stochastically resampled counts. These are then log1p-normalized and subset to the HVGs in the same way as the original dataset, before calculating the experiment reproducibility as described above. A comparison of split-half and sampled reproducibility is shown in Supplementary Table 4.
Data
Perturb-seq data sources
We demonstrate the efficacy of our approach across a range of datasets, including CRISPRi (gene knockdown) of ~2,000 essential genes in K562 and RPE1 cell lines from a previous study13 (also used in GEARS4) and similarly designed CRISPRi experiments in Jurkat and HEPG2 cell lines from another previous study24. Furthermore, we implement the Norman15 dataset with 94 unique single and 110 unique double CRISPRa (gene overexpression) perturbations respectively in the K562 cell line.
Graphs: sourcing and processing
The graphs used as inductive bias in this work can be classified into two main categories: (1) curated publicly available biological knowledge and (2) large-scale perturbation screens.
The curated graphs from category 1 include the GO graph, first used by GEARS, which is constructed by assigning edges between nodes that have a high Jaccard Index in their GO terms19, the STRING graph18 and Reactome43.
Category 2 graphs are generated from large-scale perturbation screens including DepMap44 and Perturb-seq45. These are extensive datasets linking genetic perturbation to either morphological or transcriptomic outcomes, which can offer particularly crucial insights into cellular responses to stimuli. To translate these experimental screens into graphs, we use derived embeddings to represent the genes and cell lines in a high-dimensional space, allowing for the analysis of relationships and identification of dependencies.
To curate these graphs, we first compute the pairwise similarity score between all combinations of genes. This means that, for each pair of genes (gi, gj), we compute the cosine similarity between their (aggregated) embeddings \({{\bf{x}}}_{{g}_{i}}\) and \({{\bf{x}}}_{{g}_{\!j}}\). Cosine similarity is computed as follows:
$$\begin{array}{rcl}\,{{\rm{cosine}}\; {\rm {similarity}}}\,({{\bf{x}}}_{{g}_{i}},{{\bf{x}}}_{{g}_{\!j}}) & = & \frac{{{\bf{x}}}_{{g}_{i}}\cdot {{\bf{x}}}_{{g}_{\!j}}}{\parallel {{\bf{x}}}_{{g}_{i}}\parallel \parallel {{\bf{x}}}_{{g}_{\!j}}\parallel }\\ & = & \frac{{\sum }_{k=1}^{n}{{\bf{x}}}_{{g}_{i}k}{{\bf{x}}}_{{g}_{\!j}k}}{\sqrt{{\sum }_{k=1}^{n}{{\bf{x}}}_{{g}_{i}k}^{2}}\cdot \sqrt{{\sum }_{k=1}^{n}{{\bf{x}}}_{{g}_{\!j}k}^{2}}}\end{array}$$
where
-
\({{\bf{x}}}_{{g}_{i}}\cdot {{\bf{x}}}_{{g}_{\!j}}\) represents the dot product of the vectors
-
\(\parallel {{\bf{x}}}_{{g}_{i}}\parallel\) and \(\parallel {{\bf{x}}}_{{g}_{i}}\parallel\) represent the Euclidean norms (magnitudes) of vectors \({{\bf{x}}}_{{g}_{i}}\) and \({{\bf{x}}}_{{g}_{\!j}}\), respectively
-
\({{\bf{x}}}_{{g}_{i}k}\) and \({{\bf{x}}}_{{g}_{\!j}k}\) are the individual components of vectors \({{\bf{x}}}_{{g}_{i}k}\) and \({{\bf{x}}}_{{g}_{\!j}k}\).
These cosine similarities are converted to their absolute values because the difference between highly cosine negative and highly cosine positive does not translate literally to the signed weight of the edge in the graph.
We additionally use proprietary data from internal genome-wide perturbation screens, where we measure the similarity of perturbation effect using both microscopy imaging and transcriptomics in various cell types.
Filtering configurations were optimized empirically. We found that the most performant configuration involved selecting for the top 1% of edges by (absolute) weight for screen-based graphs. For all other graph types, we (additionally) filtered for no more than 20 incoming nodes by target. Edge direction was assigned arbitrarily for undirected edges.
Data understanding
Additional methods related to specific analyses are described below.
Pharos knowledge rank
The Pharos initiative consolidates a variety of statistics relating to how researched and well known specific genes are26. Starting from this, we ranked knowledge levels as the mean of the rank of the Pharos Pubmed score and the rank of the Pharos negative log novelty score to create a single Pharos knowledge rank. We used this rank to break down and compare to the performance of models and understand potential bias. The ‘knowledge levels’ 0, 1, 2 and 3 correspond to the following bins of the Pharos knowledge rank:
-
knowledge level 0 (least characterized): 0–0.2
-
knowledge level 1: 0.2–0.4
-
knowledge level 2: 0.4–0.6
-
knowledge level 3: 0.6–0.8
-
knowledge level 3 (most characterized): 0.8–1.
Within versus across
In investigating the correlations between controls and mean baselines, we compared ‘within’-context correlation to ‘across’-context correlation. Generally, before calculating either, all examples were first split into two mutually exclusive halves, A and B, where within-context correlation is a comparison of A versus B, in each context, while across-context correlation is a comparison of an arbitrary half of one context to another context. The only exception is across-batch controls in Fig. 1a, for which, to make a conservative estimate of across-batch variance, full batches were aggregated without splitting. For batch comparison, individual control cells were split and aggregated; for the mean baselines, the δ of perturbant replicate cells was preaggregated and then split (such that the halves had nonoverlapping perturbations).
Functional enrichment
To achieve a descriptive biological summary of the actual gene expression changes in the mean baseline, we first calculated a meta mean baseline (mean of all cell types and datasets, using the intersect of provided expressed genes) and defined upregulated and downregulated genes as having a remaining δ > 0.05 or δ < −0.05, respectively. We then ran functional enrichment testing separately on each on these sets (versus the background of all genes in the dataset intersect) using the GOATOOLS package46.
Metric selection through retrieval
To avoid confusion and distraction caused by reporting many similarly performing or perhaps slightly contradicting metrics, we first performed an empirical ‘test of the test metrics’. We adapted the evaluation method pioneered by a previous study12 to work for our data and task. In short, this method uses cross-context retrieval of a perturbation as a way to judge whether a representation and metric together allow the retention and comparison of details necessary to distinguish perturbations. In our case, we modified the method to work on nonaggregated single-cell expression profiles (as this is the input and output of our model) and ran retrieval across the essential perturbation set on core cell types (K562, RPE1, HEPG2 and Jurkat). For each retrieval calculation, instead of aggregating, we first randomly sampled one cell of each perturbation. We ran three replicates on each of the cell × cell pairings for n = 3 × 6 = 18 total estimates per perturbation. We report the 0.9 quantile, to focus on active perturbants; however, similar patterns were observed at other thresholds.
Note that the choice to focus on single cells excluded use of the representation selected by a previous study12, namely the signed P value. To derisk this, we ran a preliminary analysis on the exact setup described previously12. We were only able to reproduce their results when making choices that would have limited the extensibility of our data and training setup; in particular, we found the high performance of the signed P value to rely on performing a global fit (and, thus, using a global estimate for gene-wise variance) across all contexts for determining differential expression.
We focused our selection of representations and metrics especially common in the perturbational transcriptomics literature but acknowledge the current omission of count-based representation and metrics.
General
Model development and analysis were performed with Python 3.12 and PyTorch 2.6.0. Plotting was performed with a combination of seaborn 0.13.2 and Matplotlib 3.10.8. Box-and-whisker plots used seaborn defaults, where the box represents the 0.25–0.75 quantiles, and the center line the median. The whiskers extend to the furthest observed data point within 1.5× the nearest interquartile range.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.