6. Automating data analysis
Everything required (software, scripts) is available in the Singularity container. Go to the preparation chapter to see how to acquire this container.
In the previous chapters, you learned how to perform each step of the amplicon data analysis pipeline manually. While this is a valuable learning experience, it’s not practical for analyzing large datasets or for ensuring reproducibility in the long term.
We can make use of a tool called Snakemake to automate the previous steps into a single pipeline. With Snakemake, you can define the steps of your analysis in a Snakefile
and then let Snakemake handle the execution, dependency management, and error handling.
6.1 Preparing to run the workflow
To run the automated workflow, you’ll need to make sure that your project directory is set up correctly.
To make the project setup process even easier, we’ve created a simple command-line tool called amplicon_project.py
. This tool automates the creation of the project directory and the sample configuration file (sample.tsv
), guiding you through each step with clear prompts and error checking.
The amplicon_project.py tool is built into the singularity container image. Instead of using singularity shell
, we can use singularity exec
to directly execute commands. Try accessing amplicon_project.py:
singularity exec naam_workflow.sif python /amplicon_project.py --help
usage: amplicon_project.py [-h] [-p PROJECT_DIR] -n STUDY_NAME -d RAW_FASTQ_DIR --virus-config VIRUS_CONFIG --sample-map SAMPLE_MAP
Interactive tool for setting up a multi-virus amplicon analysis project.
options:
-h, --help show this help message and exit
-p PROJECT_DIR, --project_dir PROJECT_DIR
Project directory path (default: current directory)
-n STUDY_NAME, --study_name STUDY_NAME
Name of the study
-d RAW_FASTQ_DIR, --raw_fastq_dir RAW_FASTQ_DIR
Directory containing raw FASTQ barcode subdirectories
--virus-config VIRUS_CONFIG
Path to the virus configuration YAML file.
--sample-map SAMPLE_MAP
Path to the sample map CSV file. (CSV must have 'barcode_dir' and 'virus_id' columns)
You can prepare your project directory with amplicon_project.py as follows:
singularity exec \
--bind /mnt/viro0002-data:/mnt/viro0002-data \ # Use whatever --bind is required for your context
--bind $HOME:$HOME \
--bind $PWD:$PWD \
\
naam_workflow.sif \
python /amplicon_project.py -p {project.folder} \
-n {name} \
-d {reads} \
--virus-config {virus_config.yaml} \
--sample-map {sample_map.csv}
Please use absolute paths for the reads, virus_config.yaml and sample_map.csv so that they can always be located.
{project.folder}
is your project folder. This is where you run your workflow and store results.{name}
is the name of your study, no spaces allowed.{reads}
is the folder that contains your barcode directories (e.g. barcode01, barcode02).{virus_config.yaml}
is a yaml file specifying run parameters for your viruses.{sample_map.csv}
is a comma separated file, specifying sample + virus combination.
Binding directories
The --bind
arguments are needed to explicitly tell Singularity to mount the necessary host directories into the container. The part before the colon is the path on the host machine that you want to make available. The path after the colon is the path inside the container where the host directory should be mounted.
As a default, Singularity often automatically binds your home directory ($HOME
) and the current directory ($PWD
). We also explicitly bind /mnt/viro0002-data
in this example. If your input files (reads, reference, databases) or output project directory reside outside these locations, you MUST add specific --bind /host/path:/container/path
options for those locations, otherwise the container won’t be able to find them.
Virus config and sample map
The virus config file is the central “database” of all supported viruses and their specific parameters. This file contains information about file paths for references and primers, as well as analysis parameters for each virus.
Example of a virus config:
sars-cov-2:
# Paths to reference and primer files
reference_genome: /path/to/reference.fasta
primer: /path/to/primer.fasta
primer_reference: /path/to/primer_reference.fasta
# Required analysis parameters
min_length: 250
coverage: 30
# Optional workflow steps
run_nextclade: true
nextclade_dataset: 'nextstrain/sars-cov-2/wuhan-hu-1/orfs' # Official Nextclade maintained dataset
measles:
reference_genome: /path/to/reference.fasta
primer: /path/to/primer.fasta
primer_reference: /path/to/primer_reference.fasta
min_length: 100
coverage: 30
run_nextclade: true
nextclade_dataset: '/path/to/custom/measles/dataset' # Custom user created dataset
mpox:
reference_genome: /path/to/reference.fasta
primer: /path/to/primer.fasta
primer_reference: /path/to/primer_reference.fasta
min_length: 1000
coverage: 30
run_nextclade: false # Nextclade will not run for this virus
nextclade_dataset: null
# ... add other viruses if needed.
Key parameters within each virus entry include:
reference_genome, primer, primer_reference
: Absolute paths to the respective FASTA files.min_length
: The minimum read length to keep after QC. Must be below the expected amplicon size.coverage
: The minimum read depth required for consensus calling. 30x is a common minimum.run_nextclade
: Set to true to enable the Nextclade analysis for this virus, false otherwise.nextclade_dataset
: Path to a Nextclade dataset. This can be an official dataset name (the workflow will download it) or an absolute path to a custom dataset you have locally.
The sample map is a simple file to link each barcode/sample to a virus from the config. For each sample, amplicon_project.py uses its virus_id to pull the correct parameters (paths, min_length, etc.) from the virus config and place them into a sample.tsv.
Example of a sample map:
barcode_dir,virus_id
barcode01,sars-cov-2
barcode02,sars-cov-2
barcode03,measles
barcode04,measles
barcode05,mpox
barcode_dir
: name of barcode directory containing raw fastq.gz files.virus_id
: virus name, must match with names in the virus config.
After running amplicon_project.py, move to your newly created project directory with cd
, check where you are with pwd
.
Next, use the ls
command to list the files in the project directory and check if the following files are present: sample.tsv
and Snakefile
.
The sample.tsv file is automatically generated by the setup script. It contains all the per-sample information the workflow needs to run. Each row represents one sample, with the following columns:
unique_id
: a unique identifier for the sample, generated from the barcode directory name (e.g., BC01)sequence_name
: the final name for the consensus sequence, formatted as {study_name}_{unique_id}.fastq_path
: the location of the raw *.fastq.gz files.virus_id
: name of the virus.reference_genome
: the location of the reference sequence for the consensus generation.primer
: the location of the file containing the primer sequences.primer_reference
: the location of the reference sequence for primer trimming.min_length
: minimum length required.coverage
: minimum coverage required.run_nextclade
: true or false, determines if nextclade should be run.nextclade_dataset
: path to the nextclade dataset. This field can be empty if you’re not running Nextclade.
The Snakefile is the “recipe” for the workflow, describing all the steps we have done by hand, and it is most commonly placed in the root directory of your project (you can open the Snakefile with a text editor and have a look).
6.2 Running the workflow
After setting everything up, we can redo the analysis for all samples in a single step. First we will test out a dry run to see if any errors appear. A dry run will not execute any of the commands but will instead display what would be done. This will help identify any errors in the Snakemake file.
Run inside of your project directory:
singularity exec \
--bind /mnt/viro0002-data:/mnt/viro0002-data \
--bind $HOME:$HOME \
--bind $PWD:$PWD \
\
naam_workflow.sif --snakefile Snakefile \
snakemake --cores {threads} \
--dryrun
{threads}
: sets the maximum number of cores Snakemake can use in parallel. If you want a specific step of the workflow to utilize more threads, then you can manually edit the Snakefile.
If no errors appear, then remove the --dryrun
argument and run it again to fully execute the workflow.