6. Automating data analysis

Important

Everything required (software, scripts) is available in the Singularity container. Go to the preparation chapter to see how to acquire this container.

In the previous chapters, you learned how to perform each step of the amplicon data analysis pipeline manually. While this is a valuable learning experience, it’s not practical for analyzing large datasets or for ensuring reproducibility in the long term.

We can make use of a tool called Snakemake to automate the previous steps into a single pipeline. With Snakemake, you can define the steps of your analysis in a Snakefile and then let Snakemake handle the execution, dependency management, and error handling.

6.1 Preparing to run the workflow

To run the automated workflow, you’ll need to make sure that your project directory is set up correctly.

To make the project setup process even easier, we’ve created a simple command-line tool called amplicon_project.py. This tool automates the creation of the project directory, the sample configuration file (sample.tsv), and the general settings configuration file (config.yaml), guiding you through each step with clear prompts and error checking.

The amplicon_project.py tool is built into the singularity container image. Instead of using singularity shell, we can use singularity exec to directly execute commands. Try accessing amplicon_project.py:

singularity exec naam_workflow.sif python /amplicon_project.py --help
usage: amplicon_project.py [-h] [-p PROJECT_DIR] -n STUDY_NAME -d RAW_FASTQ_DIR -P PRIMER -r PRIMER_REFERENCE -R REFERENCE_GENOME [-m MIN_LENGTH] [-c COVERAGE] [-t THREADS] [--use_sars_cov_2_workflow]
                           [--nextclade_dataset NEXTCLADE_DATASET]

Interactive tool for setting up a Snakemake project.

options:
  -h, --help            show this help message and exit
  -p PROJECT_DIR, --project_dir PROJECT_DIR
                        Project directory path (default: current directory)
  -n STUDY_NAME, --study_name STUDY_NAME
                        Name of the study
  -d RAW_FASTQ_DIR, --raw_fastq_dir RAW_FASTQ_DIR
                        Directory containing raw FASTQ files
  -P PRIMER, --primer PRIMER
                        Fasta file containing primer sequences
  -r PRIMER_REFERENCE, --primer_reference PRIMER_REFERENCE
                        Fasta file containing primer reference sequence
  -R REFERENCE_GENOME, --reference_genome REFERENCE_GENOME
                        Fasta file containing reference genome
  -m MIN_LENGTH, --min_length MIN_LENGTH
                        Minimum read length (default: 1000)
  -c COVERAGE, --coverage COVERAGE
                        Minimum coverage required for consensus (default: 30)
  -t THREADS, --threads THREADS
                        Maximum number of threads for the Snakefile (default: 8)
  --use_sars_cov_2_workflow
                        Add this parameter if you want to analyze SARS-CoV-2 data
  --nextclade_dataset NEXTCLADE_DATASET
                        Path to a custom Nextclade dataset directory, OR an official Nextclade dataset name (e.g., 'nextstrain/sars-cov-2/wuhan-hu-1/orfs'). Check official nextclade datasets with `nextclade
                        dataset list`.

Now prepare your project directory with amplicon_project.py as follows:

singularity exec \
  --bind /mnt/viro0002:/mnt/viro0002 \
  --bind $HOME:$HOME \
  --bind $PWD:$PWD \
  naam_workflow.sif \
  python /amplicon_project.py \
    -p {project.folder} \
    -n {name} \
    -d {reads} \
    -m {min_length} \
    -c {coverage} \
    -P {primer} \
    -r {primer.reference} \
    -R {reference} \
    -t {threads} \
    --nextclade_dataset {nextclade.dataset}

REQUIRED ARGUMENTS:

  • {project.folder} is your project folder. This is where you run your workflow and store results.
  • {name} is the name of your study, no spaces allowed.
  • {reads} is the folder that contains your barcode directories (e.g. barcode01, barcode02).
  • {min_length} is the minimum length required for the reads to be accepted. This must be below the expected size of the amplicon, for example, for the 2500nt mpox amplicon we use a threshold of 1000
  • {coverage} is the minimum coverage required, anything lower than 30 is not recommended, for low accuracy basecalling, higher coverage is recommended.
  • {primer} is the file containing the primer sequences
  • {primer.reference} is the reference sequence .fasta file used for primer trimming.
  • {reference} is the reference sequence .fasta file used for the consensus generation.

OPTIONAL ARGUMENTS:

  • {nextclade.dataset} the path to an official or custom nextclade dataset. A list official nextclade datasets can be checked with the following command: singularity exec naam_workflow.sif nextclade dataset list. If you are using a self made custom nextclade dataset, then please provide the absolute path to the dataset.
Important

Please use absolute paths for the reads, primers and references so that they can always be located.

The --bind arguments are needed to explicitly tell Singularity to mount the necessary host directories into the container. The part before the colon is the path on the host machine that you want to make available. The path after the colon is the path inside the container where the host directory should be mounted.

As a default, Singularity often automatically binds your home directory ($HOME) and the current directory ($PWD). We also explicitly bind /mnt/viro0002 in this example. If your input files (reads, reference, databases) or output project directory reside outside these locations, you MUST add specific --bind /host/path:/container/path options for those locations, otherwise the container won’t be able to find them.

Once the setup is completed, move to your newly created project directory with cd, check where you are with pwd.

Next, use the ls command to list the files in the project directory and check if the following files are present: sample.tsv, config.yaml and Snakefile.

  • The sample.tsv should have 9 columns:

    • unique_id: the unique sample name that’s generated based on the barcode directories.
    • sequence_name: the name given to the consensus sequence at the end of the pipeline. It’s generated with the following template: {study_name}_{unique_id}.
    • fastq_path: the location of the raw .fastq.gz files per sample.
    • reference: the location of the reference sequence for the consensus generation.
    • primers: the location of the file containing the primer sequences.
    • primer_reference: the location of the reference sequence for primer trimming.
    • coverage: minimum coverage required.
    • min_length: minimum length required.
    • nextclade_db: path to the nextclade dataset. This column can be empty if you’re not including Nextclade.
  • The config.yaml determines if the SARS_CoV_2 section of the workflow is enabled and the amount of default threads to use.

  • The Snakefile is the “recipe” for the workflow, describing all the steps we have done by hand, and it is most commonly placed in the root directory of your project (you can open the Snakefile with a text editor and have a look).

6.2 Running the workflow

After setting everything up, we can redo the analysis for all samples in a single step. First we will test out a dry run to see if any errors appear. A dry run will not execute any of the commands but will instead display what would be done. This will help identify any errors in the Snakemake file.

Run inside of your project directory:

singularity exec \
  --bind /mnt/viro0002:/mnt/viro0002 \
  --bind $HOME:$HOME \
  --bind $PWD:$PWD \
  naam_workflow.sif \
  snakemake --snakefile Snakefile \
  --cores {threads} \
  --dryrun

If no errors appear, then remove the --dryrun argument and run it again to fully execute the workflow.