Getting started with rnaseq-nf
rnaseq-nf
is a basic Nextflow pipeline for RNA-Seq analysis that performs quality control, transcript quantification, and result aggregation. The pipeline processes paired-end FASTQ files, generates quality control reports with FastQC, quantifies transcripts with Salmon, and produces a unified report with MultiQC.
This tutorial describes the architecture of the rnaseq-nf
pipeline and provides instructions on how to run it.
Pipeline architecture
The pipeline is organized into modular workflows and processes that coordinate data flow from input files through analysis steps to final outputs.
Entry workflow
The entry workflow orchestrates the entire pipeline by coordinating input parameters and data flow:
Data flow:
The
transcriptome
andreads
parameters are passed to theRNASEQ
subworkflow, which performs indexing, quality control, and quantificationThe outputs from
RNASEQ
along with themultiqc
configuration are passed to theMULTIQC
module, which aggregates results into a unified HTML reportThe
outdir
parameter defines where all results are published
RNASEQ
The RNASEQ
subworkflow coordinates three processes that run in parallel and sequence:
Inputs (take:
):
transcriptome
: Reference transcriptome fileread_pairs_ch
: Channel of paired-end read files
Process execution (main:
):
INDEX
creates a Salmon index from thetranscriptome
(runs once)FASTQC
analyzes theread_pairs_ch
in parallel (runs independently for each sample)QUANT
quantifies transcripts using both the index from INDEX and theread_pairs_ch
(runs for each sample after INDEX completes)
Outputs (emit:
):
All outputs from
FASTQC
andQUANT
are collected and emitted for downstream processing
MULTIQC
The MULTIQC
module aggregates all quality control and quantification outputs into a comprehensive HTML report.
Inputs:
RNASEQ
outputs: All collected outputs from theRNASEQ
subworkflow (FastQC reports and Salmon quantification files)MultiQC config: Custom configuration files and branding (logo, styling)
Process execution:
MULTIQC
scans all input files, extracts metrics and statistics, and generates a unified report
Outputs:
multiqc_report.html
: A single consolidated HTML report providing an overview of:General stats
Salmon fragment length distribution
FastQC quality control
Software versions
Pipeline parameters
The pipeline behavior can be customized using command-line parameters to specify input data, output locations, and configuration files.
The pipeline accepts the following command-line parameters:
--reads
: Path to paired-end FASTQ files (default:data/ggal/ggal_gut_{1,2}.fq
)--transcriptome
: Path to reference transcriptome FASTA (default:data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa
)--outdir
: Output directory for results (default:results
)--multiqc
: Path to MultiQC configuration directory (default:multiqc
)
Execution profiles
Execution profiles allow you to customize how and where the pipeline runs by specifying the -profile
flag. Multiple profiles can be combined by separating them with commas. Profiles are located in the nextflow.config
file in the base directory.
Container profiles
Container profiles specify which containerization technology to use for running the pipeline tools:
standard
: Use default Docker containerdocker
: Explicitly use Dockersingularity
: Use Singularity containerswave
: Use Wave container provisioning with Condawave-mirror
: Use Wave container mirroring strategy
Note
The respective container tools must be installed to use these profiles.
Environment profiles
Environment profiles manage software dependencies through package managers or specify architecture requirements:
conda
: Use Conda environment managementmamba
: Use Micromamba for faster dependency resolutionarm64
: Use ARM64 architecture support
Note
The respective environment tools must be installed to use these profiles.
Cloud and HPC profiles
Cloud and HPC profiles enable execution on distributed computing infrastructure and cloud storage:
slurm
: Run on SLURM-managed HPC clustersbatch
: Run on AWS Batch compute environmentsgoogle-batch
: Run on Google Cloud Batchazure-batch
: Run on Azure Batch compute poolss3-data
: Use input data stored in AWS S3gs-data
: Use input data stored in Google Cloud Storage
Note
To use the Cloud and HPC profiles, you must configure credentials, resource pools, and storage paths before execution.
Other profiles
all-reads
: Process all FASTQ files matchingggal_*_{1,2}.fq
Test data
The pipeline includes test data located in the data/ggal/
directory for demonstration and validation purposes:
Paired-end FASTQ files: Four tissue samples (gut, liver, lung, spleen) from Gallus gallus (chicken)
ggal_gut_{1,2}.fq
- Default sample used when running with standard parametersggal_liver_{1,2}.fq
ggal_lung_{1,2}.fq
ggal_spleen_{1,2}.fq
Reference transcriptome:
ggal_1_48850000_49020000.Ggal71.500bpflank.fa
- A subset of the chicken genome
Tip
Use the all-reads
profile to process all four tissue samples instead of just the default gut sample. See Execution profiles for more information.
Quick start
rnaseq-nf
is a runnable pipeline. This section provides examples for running the pipeline with different configurations.
Basic execution
Run the pipeline with default parameters using Docker:
nextflow run nextflow-io/rnaseq-nf -with-docker
Configuring individual parameters
Override default parameters to use custom input files and output locations:
nextflow run nextflow-io/rnaseq-nf \
--reads '/path/to/reads/*_{1,2}.fastq.gz' \
--transcriptome '/path/to/transcriptome.fa' \
--outdir 'my_results' \
-with-docker
Using profiles
Specify execution profiles to customize runtime environments and data sources:
# Use Conda for dependency management
nextflow run nextflow-io/rnaseq-nf -profile conda
# Run on a SLURM cluster
nextflow run nextflow-io/rnaseq-nf -profile slurm
# Combine multiple profiles: process all reads using Docker
nextflow run nextflow-io/rnaseq-nf -profile all-reads,docker
Tip
See Execution profiles for more information about profiles.
Expected outputs
The rnaseq-nf
pipeline generates the following outputs in the results directory:
results/
├── fastqc_<SAMPLE_ID>_logs/ # FastQC quality reports per sample
│ ├── <SAMPLE_ID>_1_fastqc.html
│ ├── <SAMPLE_ID>_1_fastqc.zip
│ ├── <SAMPLE_ID>_2_fastqc.html
│ └── <SAMPLE_ID>_2_fastqc.zip
└── multiqc_report.html # Aggregated QC and Salmon report
The MultiQC report (multiqc_report.html
) can be viewed in a web browser.