4. Comparing sequence to reference sequences

4.1 Creating multiple sequence alignment

We can now create a multiple sequence alignment (MSA). (Not to be confused with a read alignment .bam file).

We can use a reference based multiple sequence alignment approach with minimap2 and gofasta. This is very fast and works well even for large genomes (e.g. 200kb+) or many sequences (10,000+). However, gofasta does not perform a “real” multiple alignment, because it ignores insertions in the sequences compared to the reference and removes them. Therefore if insertions are expected and present in the sequences, they will have to be added manually. On the positive side, phylogenetic analysis tools, such as IQTREE2, also ignore any insertions, so for the phylogenetic analysis the removal of insertions does not matter.

minimap2 -t {threads} -a \
-x asm20 \
--sam-hit-only \
--secondary=no \
--score-N=0 \
{reference} \
{input} \
-o tmp.sam

gofasta sam toMultiAlign \
-s tmp.sam \
-o {output}
  • {input} here is the .fasta file containing all consensus sequences and references you would like to align.
  • {output} is the name of the aligned fasta file (e.g. consensus_with_ref_aligned.fasta).
  • {reference} is the reference used to performe a reference based multiple alignment, use the same reference as we used for read mapping before.

(tmp.sam can be deleted)

4.2 Generate useful stats

You can generate read stats with seqkit stats -T for the raw, QC, trimmed and mapped reads.

seqkit stats -T {input} > {output}

If want to check the read stats for a mapping file, you can use the following:

for file in {input}; do
    samtools fastq $file | seqkit stats -T --stdin-label $file | tail -1
done > {output}
  • {input} is your fastq or bam file.
  • {output} is a tab separated (.tsv) file with the read stats.
Note

If you are not analyzing SARS-CoV-2 data, then you can skip the next chapter and go to the final chapter to automate all of the steps we’ve previously discussed.