Multiomics integration aims to combine data from multiple molecular layers — DNA, RNA, protein, metabolites, epigenome — into a unified analysis. Done well, it reveals biology that no single layer can. Done poorly, it produces complicated visualizations that don’t actually mean anything.
Why integration is hard
Each omics layer has different scale, distribution, sparsity, and noise structure. Genomic data is mostly invariant across cells. Transcriptomic data is noisy and sparse, especially at single-cell resolution. Proteomic data has fewer features than transcriptomic. Metabolomic data is the noisiest of all. There’s no single statistical model that fits all layers naturally.
Three categories of integration
Early integration (concatenation)
Stack features from all omics layers into one matrix and analyze together. Simple but ignores layer-specific structure and is heavily affected by feature-count imbalance (transcriptomics dominates over proteomics simply by having more features).
Intermediate integration (joint modeling)
Model each layer with appropriate statistics, then jointly learn shared latent representations. Most successful modern approaches sit here.
- MOFA / MOFA+: Multi-omics factor analysis — finds latent factors explaining variation across layers
- Multi-Omics Factor Analysis with structured priors: Models layer-specific noise distributions
- scVI / totalVI / multiVI: Variational autoencoder–based methods for single-cell multi-omics
- Seurat WNN (weighted nearest neighbor): Combines modalities by learning per-cell modality weights
Late integration (post-hoc combination)
Analyze each omics layer separately and combine results at the conclusion stage (e.g., gene lists, pathway enrichments). Easiest to implement, sometimes most interpretable, but misses cross-layer interactions.
Common questions and how integration helps
| Question | Approach |
|---|---|
| Which DNA variants drive expression? | eQTL: variant + RNA |
| Which methylation changes drive expression? | mQTL or RNA-methylation correlation |
| How does chromatin accessibility shape expression? | scATAC + scRNA-seq joint modeling |
| Which signals pass through to protein? | RNA + proteomics correlation |
| What’s the molecular subtype of a tumor? | Multi-omics clustering (MOFA, iCluster) |
Bulk vs single-cell multi-omics
Bulk multi-omics: TCGA, ICGC, and similar consortia have generated WGS + WES + RNA-seq + methylation + proteomics + clinical data on thousands of samples. Standard tools (MOFA, iCluster, NEMO) handle this well.
Single-cell multi-omics: Increasingly common. CITE-seq combines transcriptomics with surface protein. 10x Multiome captures RNA + chromatin accessibility from the same cell. Spatial multi-omics is emerging.
Practical workflow
- QC each layer independently — bad data in any one layer corrupts the integration
- Normalize each layer with method-appropriate techniques (DESeq2 for RNA, log-CPM for chromatin, robust scaling for proteomics)
- Decide on the integration approach based on your biological question
- Run the integration; interpret latent factors or clusters biologically
- Validate cross-layer findings with independent data or experiments
Common pitfalls
- Batch effects: If different layers were collected in different batches, batch effects can dominate the integration. Use batch correction methods that respect layer structure (Harmony, MNN with care)
- Feature imbalance: Without weighting, the layer with more features dominates. Use weighting or layer-balanced methods
- Over-interpretation: Latent factors are not always biologically meaningful — validate
- Missing data: Not every sample has every layer. Choose methods that handle missingness gracefully
Useful resources
- MOFA / MOFA+ tutorial
- Single-Cell Best Practices (online book covering integration in depth)
- Cancer Genome Atlas (TCGA) and ICGC for example bulk multi-omics datasets
- Human Cell Atlas datasets for single-cell multi-omics
Multiomics integration is most valuable when you have a biological question that spans layers — not just because you can. Start with the question, choose the integration method that addresses it, and validate cross-layer findings with orthogonal data.



