Quickstart¶
Setup¶
Install MIDAS2¶
The fastest way to install all the dependencies MIDAS2 needed is install from
YAML file (midas2.yaml
) using Conda.
On a Linux machine, download a copy of the test data from our GitHub repository release.
$ wget https://github.com/czbiohub/MIDAS2/releases/download/v1.0.0/tests.tar.gz
$ tar -zxf tests.tar.gz
$ cd tests
Install the dependencies.
$ conda env create -n midas2 -f midas2.yml
$ conda activate midas2
$ cpanm Bio::SearchIO::hmmer --force # Temporary fix for Prokka
Install MIDAS2.
$ pip install midas2
Alternative installation procedures are also described elsewhere.
Example Data¶
Assuming you’re in the tests/
directory you just cd
-ed to,
two single-end gzipped FASTQ files are in the folder tests/reads
. If not,
navigate to the tests
directory
$ cd tests
Pre-download SCG Genes¶
Download the universal single copy genes for MIDAS Reference Database (MIDAS DB) of uhgg
to a new subfolder called my_midasdb_uhgg
$ midas2 database --init --midasdb_name uhgg --midasdb_dir my_midasdb_uhgg
Identify Abundant Species¶
We’ll start by searching for reads that align to single-copy, taxonomic marker genes in order to identify abundant species in each sample.
for sample_name in sample1 sample2
do
midas2 run_species \
--sample_name ${sample_name} \
-1 reads/${sample_name}_R1.fastq.gz \
--midasdb_name uhgg \
--midasdb_dir my_midasdb_uhgg \
--num_cores 4 \
midas2_output
done
Single-nucleotide Variant Analysis¶
Identify SNVs in Each Sample¶
We’ll next run the single-sample SNV analysis for each sample.
for sample_name in sample1 sample2
do
midas2 run_snps \
--sample_name ${sample_name} \
-1 reads/${sample_name}_R1.fastq.gz \
--midasdb_name uhgg \
--midasdb_dir my_midasdb_uhgg \
--num_cores 4 \
midas2_output
done
The pileup summary for sample1
is written to
midas2_output/sample1/snps/snps_summary.tsv
.
This file summarizes the read mapping
and pileup results for each of the abundant species determined in the previous
step.
By default, species are selected based on the filter:
median_marker_coverage > 2
. More details about abundant species selection can
be found here.
Compute Population SNVs across multiple samples¶
In order to compute population SNV from multiple single-sample pileup results, we first
need to construct a tab-separated sample manifest file: list_of_samples.tsv
.
This file has a column for the sample_name
and another for
midas_output
, and it is required for multi-sample analyses.
echo -e "sample_name\tmidas_outdir" > list_of_samples.tsv
ls reads | awk -F '_' '{print $1}' | awk -v OFS='\t' '{print $1, "midas2_output"}' >> list_of_samples.tsv
We can take a look at the list_of_samples.tsv
:
cat list_of_samples.tsv
sample_name midas_outdir
sample1 midas2_output
sample2 midas2_output
Based on this output, we can run merge_snps
and MIDAS2 will know to
look at midas2_output/sample1/snps/snps_summary.tsv
for the run_snps
output from sample1.
Now we are ready to compute the population SNVs across these two samples:
midas2 merge_snps \
--samples_list list_of_samples.tsv \
--midasdb_name uhgg \
--midasdb_dir my_midasdb_uhgg \
--genome_coverage 0.7 \
--num_cores 4 \
midas2_output/merge
Users may be interested in the contents of the file
midas2_output/merge/snps_summary.tsv
written in this step.
cat midas2_output/merge/snps_summary.tsv
sample_name species_id genome_length covered_bases total_depth aligned_reads mapped_reads fraction_covered mean_coverage
sample1 102454 2762447 2322823 15271923 145639 131992 0.841 6.575
sample2 102454 2762447 2322823 15270765 145639 131982 0.841 6.574
Other output files and the full output directory structure can be found at Output Files and Directory Layout.
Gene Copy-Number Variant Analysis¶
Identify CNVs in Each Sample¶
Since building bowtie2 indexes for the species pangenomes takes a long time, we
first build the bowtie2 indexes for one species (102454) to a new subfolder bt2_indexes/
:
midas2 build_bowtie2db \
--midasdb_name uhgg --midasdb_dir my_midasdb_uhgg \
--species_list 102454 \
--bt2_indexes_name pangenomes \
--bt2_indexes_dir bt2_indexes \
--num_cores 4
More information about building your own bowtie2 indexes for either representative genome (repgenome) or pangenome can found here.
Now we can run the single-sample CNV analysis for each sample with the existing bowtie2 indexes.
The pileup summary for sample1
will be generated under the directory
midas2_output/sample1/genes/genes_summary.tsv
.
for sample_name in sample1 sample2
do
midas2 run_genes \
--sample_name ${sample_name} \
-1 reads/${sample_name}_R1.fastq.gz \
--midasdb_name uhgg \
--midasdb_dir my_midasdb_uhgg \
--prebuilt_bowtie2_indexes bt2_indexes/pangenomes \
--prebuilt_bowtie2_species bt2_indexes/pangenomes.species \
--num_cores 4 \
midas2_output
done
Compile CNVs across multiple samples¶
Same with the population SNV analysis, multi-sample CNV analysis also requires a tab-separated sample manifest file.
We can then merge the per-sample CNV results:
midas2 merge_genes \
--samples_list list_of_samples.tsv \
--midasdb_name uhgg \
--midasdb_dir my_midasdb_uhgg \
--num_cores 4 \
midas2_output/merge
Users may be interested in the contents of the file
midas2_output/merge/genes_summary.tsv
written in this step.
cat midas2_output/merge/genes_summary.tsv
sample_name species_id pangenome_size covered_genes fraction_covered mean_coverage aligned_reads mapped_reads marker_coverage
sample1 102454 129140 4004 0.031 3.495 162476 28611 3.410
sample2 102454 129140 4199 0.033 3.603 169286 34908 3.410
Other output files and the full output directory structure can be found at Output Files and Directory Layout.