Module: Single Nucleotide Variant Analysis ¶

The Single Nucleotide Variant (SNV) module has two commands:

Single-sample allele tallying with the run_snps command;
SNV calling across all the samples, with merge_snps command.

The first step can be run in parallel. We assume users have already completed the species module and have a species profile (e.g. midas2_output/sample1/species/species_profile.tsv) ready for each sample. Alternatively, advanced users can pass a pre-built representative genome index ready for single-sample SNV analysis.

Single-Sample Analysis ¶

Conceptually, a typical invocation of the run_snps command proceeds by

selecting sample-specific abundant species based on statistics calculated in the species module;
compiling these representative genomes and building a sample-specific bowtie2 index;
mapping reads to this index with bowtie2;
outputting a read mapping summary and pileup result for each representative genome (i.e. each species).

MIDAS2 purposely holds off any filtering or identification of variant positions until the subsequent cross-sample analysis.

(In this document, we continue to use the data from the Quickstart as an example.)

midas2 run_snps \
  --sample_name sample1 \
  -1 reads/sample1_R1.fastq.gz \
  --midasdb_name uhgg \
  --midasdb_dir my_midasdb_uhgg \
  --select_by median_marker_coverage,unique_fraction_covered \
  --select_threshold=2,0.5 \
  --num_cores 8 \
  midas2_output

See the documentation on selecting abundant species for more information about the --select_by and --select_threshold flags.

Tip

This step can be parallelized over samples (e.g. using shell background processes).

Note

In MIDAS2, run_snps can automatically download the reference genomes for the selected species.

Warning

(Race condition) If starting multiple calls to run_snps simultaneously, be sure that reference genomes have already been downloaded. Otherwise multiple redundant downloads may be started.

Cross-Samples Analysis ¶

After running all samples individually in this way, users can then compute population SNVs across samples using the merge_snps command.

midas2 merge_snps \
  --samples_list list_of_samples.tsv \
  --midasdb_name uhgg \
  --midasdb_dir my_midasdb_uhgg \
  --num_cores 8 \
  midas2_output/merge

Key Outputs ¶

Single-Sample ¶

Unlike the species and CNV modules, the single-sample outputs from the SNV module are less interesting than the merged results (at least with the default mode).

Users may, however, find several files useful.

A summary of read alignment and pileups for each of the genomes included in the (usually sample-specific) bowtie2 index is reported in midas2_output/sample1/snps/snps_summary.tsv.

species_id	genome_length	covered_bases	total_depth	aligned_reads	mapped_reads	fraction_covered	mean_coverage
102506	5339468	2373275	8045342	468667	224553	0.444	3.390
102337	2749621	2566404	47723458	1479479	1010530	0.933	18.595

Where each columns has the following meaning:

species_id:       six-digit species id
genome_length:    genome length
covered_bases:    number of bases covered by at least one post-filtered reads
total_depth:      total read depth across all covered_bases
aligned_reads:    total read counts across covered_bases before post-alignment filter
mapped_reads:     total read counts across covered_bases after post-alignment filter
fraction_covered: fraction of covered_bases (aka horizontal genome coverage)
mean_coverage:    mean read depth across all covered_bases (aka vertical genome coverage)

For each sample and species—e.g. here sample1 and species 102506 (E. coli)—the per-species read pileup is found in midas2_output/sample1/snps/102506.snps.tsv.lz4. Positions are filtered to only sites in the reference genome covered by at least two reads.

Note

Large output files are compressed with LZ4 to minimize storage requirements.

When uncompressed, the contents of this file should look like the following CSV:

ref_id	ref_pos	ref_allele	depth	count_a	count_c	count_g	count_t
gnl\|Prokka\|UHGG144544_1	881435	T	11	0	0	0	11
gnl\|Prokka\|UHGG144544_1	881436	T	13	0	5	0	8
gnl\|Prokka\|UHGG144544_1	881437	T	12	0	6	0	6

Where the columns have the following meaning:

ref_id:     scaffold/contig id
ref_pos:    reference position
ref_allele: reference nucleotide
depth:      number of post-filtered reads
count_a:    post-filtered read counts of A allele
count_c:    post-filtered read counts of C allele
count_g:    post-filtered read counts of G allele
count_t:    post-filtered read counts of T allele

Across-Samples ¶

A number of outputs result from the multiple samples population SNV analysis.

A merged pileup summary is found in midas2_output/merge/snps/snps_summary.tsv.

sample_name	species_id	genome_length	covered_bases	total_depth	aligned_reads	mapped_reads	fraction_covered	mean_coverage
sample1	100122	2560878	2108551	10782066	248700	207047	0.823	5.113
sample2	100122	2560878	2300193	39263110	1180505	820736	0.898	17.069

The reported columns from genome_length to mean_coverage are the same as from the single-sample SNV summary.

For each species, information about SNVs identified across samples is written to midas2_output/merge/snps/102506.snps_info.tsv.lz4.

site_id

major_allele

minor_allele

sample_counts

snp_type

rc_A

rc_C

rc_G

rc_T

sc_A

sc_C

sc_G

sc_T

locus_type

gene_id

site_type

amino_acids

gnl|Prokka|UHGG000587_14|34360|A

A

C

2

bi

26

10

0

2

0

CDS

UHGG000587_02083

4D

T,T,T,T

gnl|Prokka|UHGG000587_11|83994|T

G

T

2

bi

0

11

45

0

2

IGR

None

Where columns have the following meaning:

site_id:       unique site id, composed of ref_id|ref_pos|ref_allele
major_allele:  most common/prevalent allele in metagenomes
minor_allele:  second most common/prevalent allele in metagenomes
sample_counts: number of relevant samples where metagenomes is found
snp_type:      the number of alleles observed at site (mono,bi,tri,quad)
rc_A:          accumulated read counts of A allele in metagenomes
rc_C:          accumulated read counts of C allele in metagenomes
rc_G:          accumulated read counts of G allele in metagenomes
rc_T:          accumulated read counts of T allele in metagenomes
sc_A:          accumulated sample counts of A allele in metagenomes
sc_C:          accumulated sample counts of C allele in metagenomes
sc_G:          accumulated sample counts of G allele in metagenomes
sc_T:          accumulated sample counts of T allele in metagenomes
locus_type:    CDS (site in coding gene), RNA (site in non-coding gene), IGR (site in intergenic region)
gene_id:       gene identified if locus type is CDS, or RNA
site_type:     indicates degeneracy: 1D, 2D, 3D, 4D
amino_acids:   amino acids encoded by 4 possible alleles

A site-by-sample minor allele frequency matrix is written to midas2_output/merge/snps/102506.snps_freq.tsv.lz4.

site_id	sample1	sample2
gnl\|Prokka\|UHGG000587_11\|83994\|T	0.692	0.837
gnl\|Prokka\|UHGG000587_14\|34360\|A	0.300	0.269

A site-by-sample read depth matrix is written to midas2_output/merge/snps/102506.snps_freq.tsv.lz4.

Note

This table only accounts for the alleles matching the population major and/or minor allele. Other bases are dropped.

site_id	sample1	sample2
gnl\|Prokka\|UHGG000587_11\|83994\|T	13	43
gnl\|Prokka\|UHGG000587_14\|34360\|A	10	26

Advanced SNV Calling ¶

Single-Sample Post-alignment Filter ¶

Users can adjust post-alignment filters via the following command-line options (default values indicated):

--mapq >= 10: discard read alignment with alignment quality < 10
--mapid >= 0.94: discard read alignment with alignment identity < 0.94
--aln_readq >= 20: discard read alignment with mean quality < 20
--aln_cov >= 0.75: discard read alignment with alignment coverage < 0.75
--aln_baseq >= 30: discard bases with quality < 30
--paired_only: only recruit properly aligned read pairs for post-alignment filter and pileup
--fragment_length 5000: maximum fragment length for paired-end alignment. Incorrect fragment length would affect the number of proper-aligned read pairs

Single-Sample Advanced SNV Calling ¶

In recognition of the need for single-sample variant calling, we provided --advanced option to users for single-sample variant calling for all the species in the rep-genome index with run_snps command.

In the --advanced mode, per-species pileup results will also report major allele and minor allele for all the genomic sites covered by at least two post-filtered reads, upon which custom variant calling filter can be applied by the users. Users are advised to use the setting --ignore_ambiguous to avoid falsely calling major/minor alleles for sites with tied read counts.

midas2 run_snps
  --sample_name sample1 \
  -1 reads/sample1_R1.fastq.gz \
  --midasdb_name uhgg \
  --midasdb_dir my_midasdb_uhgg \
  --select_by median_marker_coverage,unique_fraction_covered \
  --select_threshold=2,0.5 \
  --fragment_length 2000 --paired_only \
  --advanced --ignore_ambiguous \
  --num_cores 8
  midas2_output

Expected Output¶

In the --advanced mode, per-species pileup results will include five additional columns of the major/minor allele for all the covered genomic sites.

ref_id	ref_pos	ref_allele	depth	count_a	count_c	count_g	count_t	major_allele	minor_allele	major_allele_freq	minor_allele_freq	allele_counts
gnl\|Prokka\|UHGG144544_1	881435	T	11	0	0	0	11	T	T	1.000	0.000	1
gnl\|Prokka\|UHGG144544_1	881436	T	13	0	5	0	8	T	C	0.615	0.385	2
gnl\|Prokka\|UHGG144544_1	881437	T	12	0	6	0	6	C	T	0.500	0.500	2

major_allele: the allele with the most read counts
minor_allele: the allele with the 2nd most read counts; same with major_allele if only one allele is observed
major_allele_freq: allele frequency of major_allele
minor_allele_freq: allele frequency of minor_allele; 0.0 if only one allele is observed
allele_counts: number of alleles observed

Adjust Population SNV Filters ¶

Advanced users can refer to this page for understanding the compute of population SNV. The species, sample, and site filters for the across-samples SNV calling can be customized with command-line options. For example,

We can select species with horizontal coverage > 40%, vertical coverage > 3X and present in more than 2 relevant samples:

--genome_coverage 0.4 --genome_depth 3 --sample_counts 2

We can apply the following site selections: only consider site with read depth >= 5, and read depth <= 3 * genome_depth, and the minimal allele frequency to call an allele present is 0.05.

--site_depth 5 --site_ratio 3 --snp_maf 0.05

We can only report populations SNV meeting the following criteria: bi-allelic, common population SNV (present in more than 80% of the population) from the protein coding genes based on accumulated sample counts.

--snp_type bi --snv_type common --site_prev 0.8 --locus_type CDS --snp_pooled_method prevalence

Now we can put all the above-mentioned filters in one merge_snps command:

midas2 merge_snps
  --samples_list list_of_samples.tsv \
  --midasdb_name uhgg \
  --midasdb_dir my_midasdb_uhgg \
  --genome_coverage 0.4 --genome_depth 3 --sample_counts 2 \
  --site_depth 5 --site_ratio 3 --snp_maf 0.05 \
  --snp_type bi --snv_type common --site_prev 0.8 --locus_type CDS --snp_pooled_method prevalence \
  --num_cores 8 \
  midas2_output/merge