Next-generation sequencing (NGS) is now standard in most molecular biology labs, but the end-to-end workflow is still a black box for many users. Understanding what happens at each stage helps you design better experiments and read sequencing reports critically.
Stage 1: Sample extraction and QC
Whatever you sequence, you start with clean nucleic acid. DNA extraction kits work for genomic DNA; RNA requires DNase treatment and often poly-A selection or rRNA depletion. Quantify with a fluorometric method (Qubit) — UV absorbance (NanoDrop) overestimates and is not reliable for library prep. Assess integrity on a Bioanalyzer or TapeStation; RNA quality is reported as RIN, and most protocols require RIN > 7.
Stage 2: Library preparation
This is where input nucleic acid is converted into a sequencing-ready library. Core operations:
- Fragmentation — mechanical, enzymatic, or transposon-based (Tn5)
- End repair and A-tailing — preparing fragment ends for adapter ligation
- Adapter ligation — adapters contain flow cell sequences, indices for multiplexing, and primer sites
- PCR amplification — usually 8–15 cycles; minimize cycles to reduce duplication
- Size selection and cleanup — typically with SPRI beads
QC the final library on a Bioanalyzer to confirm size distribution and quantify by qPCR for accurate cluster density.
Stage 3: Sequencing
Most short-read sequencing runs on Illumina platforms using sequencing by synthesis with reversible-terminator chemistry. Libraries are loaded onto a flow cell, amplified into clusters via bridge amplification, and sequenced one base at a time with fluorescent nucleotides imaged each cycle.
Long-read platforms (PacBio HiFi, Oxford Nanopore) use single-molecule sequencing — no amplification, much longer reads (10 kb–100+ kb), with modern chemistries achieving high accuracy.
Stage 4: Primary data processing
The sequencer outputs raw signals, basecalled into FASTQ files containing sequences plus per-base quality scores (Q-scores, on the Phred scale). Q30 — meaning 99.9% accuracy — is the typical benchmark for a successful Illumina run.
Stage 5: Bioinformatics analysis
This stage varies most by application:
- WGS: Align to reference (BWA), call variants (GATK)
- RNA-seq: Align (STAR, HISAT2) or pseudoalign (Salmon, kallisto), quantify, run differential expression (DESeq2, edgeR)
- ChIP-seq: Align, call peaks (MACS2), motif analysis
- scRNA-seq: Cell Ranger or STARsolo, then Seurat or Scanpy
Choosing depth and read length
Depth (coverage) and read length are the main cost drivers. Choose by application: ~30× for human WGS, ~100× for somatic variant calling, 20–30M reads per sample for bulk RNA-seq, far more for single-cell. Short reads work for most quantification tasks; long reads are essential for de novo assembly, structural variant detection, and complex genome regions.
Common pitfalls
- Using NanoDrop concentrations to plan library inputs
- Over-amplifying libraries, leading to duplicate-heavy data
- Skipping batch design — randomize, never group by condition
- Underpowering — most “negative” RNA-seq results stem from too few replicates
NGS is a chain of steps where each one constrains what’s possible downstream. Pay close attention to extraction QC and library prep — those upstream decisions determine the ceiling on your data quality.


