NGS Workflow Explained: From Sample to Sequence Data

Table of Contents

Next-generation sequencing (NGS) is now standard in most molecular biology labs, but the end-to-end workflow is still a black box for many users. Understanding what happens at each stage helps you design better experiments and read sequencing reports critically.

Stage 1: Sample extraction and QC

Whatever you sequence, you start with clean nucleic acid. DNA extraction kits work for genomic DNA; RNA requires DNase treatment and often poly-A selection or rRNA depletion. Quantify with a fluorometric method (Qubit) — UV absorbance (NanoDrop) overestimates and is not reliable for library prep. Assess integrity on a Bioanalyzer or TapeStation; RNA quality is reported as RIN, and most protocols require RIN > 7.

Stage 2: Library preparation

This is where input nucleic acid is converted into a sequencing-ready library. Core operations:

  • Fragmentation — mechanical, enzymatic, or transposon-based (Tn5)
  • End repair and A-tailing — preparing fragment ends for adapter ligation
  • Adapter ligation — adapters contain flow cell sequences, indices for multiplexing, and primer sites
  • PCR amplification — usually 8–15 cycles; minimize cycles to reduce duplication
  • Size selection and cleanup — typically with SPRI beads

QC the final library on a Bioanalyzer to confirm size distribution and quantify by qPCR for accurate cluster density.

Stage 3: Sequencing

Most short-read sequencing runs on Illumina platforms using sequencing by synthesis with reversible-terminator chemistry. Libraries are loaded onto a flow cell, amplified into clusters via bridge amplification, and sequenced one base at a time with fluorescent nucleotides imaged each cycle.

Long-read platforms (PacBio HiFi, Oxford Nanopore) use single-molecule sequencing — no amplification, much longer reads (10 kb–100+ kb), with modern chemistries achieving high accuracy.

Stage 4: Primary data processing

The sequencer outputs raw signals, basecalled into FASTQ files containing sequences plus per-base quality scores (Q-scores, on the Phred scale). Q30 — meaning 99.9% accuracy — is the typical benchmark for a successful Illumina run.

Stage 5: Bioinformatics analysis

This stage varies most by application:

  • WGS: Align to reference (BWA), call variants (GATK)
  • RNA-seq: Align (STAR, HISAT2) or pseudoalign (Salmon, kallisto), quantify, run differential expression (DESeq2, edgeR)
  • ChIP-seq: Align, call peaks (MACS2), motif analysis
  • scRNA-seq: Cell Ranger or STARsolo, then Seurat or Scanpy

Choosing depth and read length

Depth (coverage) and read length are the main cost drivers. Choose by application: ~30× for human WGS, ~100× for somatic variant calling, 20–30M reads per sample for bulk RNA-seq, far more for single-cell. Short reads work for most quantification tasks; long reads are essential for de novo assembly, structural variant detection, and complex genome regions.

Common pitfalls

  • Using NanoDrop concentrations to plan library inputs
  • Over-amplifying libraries, leading to duplicate-heavy data
  • Skipping batch design — randomize, never group by condition
  • Underpowering — most “negative” RNA-seq results stem from too few replicates

NGS is a chain of steps where each one constrains what’s possible downstream. Pay close attention to extraction QC and library prep — those upstream decisions determine the ceiling on your data quality.

Featured Articles

Join 85,000+ Biotech, MedTech, and Pharma Leaders

Your Daily Edge in Biotech, MedTech, and Pharma

Get trusted, high-signal updates every morning
Breakthroughs, trial data, deals, and the news that matters