As next-generation sequencing technologies advance rapidly and the cost of metagenomic sequencing continues to decrease, researchers now face an unprecedented volume of microbiome data. This surge stimulates the development of scalable microbiome data analysis methods and necessitates the incorporation of phylogenetic information for improved accuracy. Tools for constructing phylogenetic trees from 16S rRNA sequencing data are well-established, since there are limited highly conserved regions of the 16S gene, simplifying the identification of marker genes. In contrast, metagenomic or wholegenome shotgun sequencing involves sequencing from random fragments of the entire gene, making it challenging to identify consistent marker genes due to the vast diversity of genomic regions. Because of these challenges and the early stage of development, there remains a scarcity of robust tools for building these phylogenetic trees. Moreover, insufficient knowledge and limited awareness restrict the integration of phylogenetic tree structures into downstream statistical analyses.

Tree construction based on 16S rRNA sequence data - QIIME 2

Install QIIME 2, users can find tutorial of installing QIIME 2 at https://docs.qiime2.org/2024.5/install/
Here, we provide an example dataset from the public data resource QIITA, specifically from study ID 12675 (https://qiita.ucsd.edu/study/description/12675). First, select “16S” from the “Data Types” tab on the left side. From the processing network, select any deblur table from the tips of the processing tree. Here, we choose the representative sequence file from the “deblur final table” at the “Trimmed Demultiplexed 100” point.

## Import fasta file into QIIME 2
qiime tools import --type 'FeatureData[Sequence]' --input-path 170104_all.seqs.fa --output-path 170104_all.qza

## Alignment
qiime alignment mafft --i-sequences 170102_all.qza --o-alignment 170102-aligned.qza

## Masking
qiime alignment mask --i-alignment 170102-aligned.qza --o-masked-alignment masked-170102.qza

## Build the tree
qiime phylogeny fasttree --i-alignment masked-170102.qza --o-tree unrooted-170102.qza 

## Root the tree
qiime phylogeny midpoint-root --i-tree unrooted-170102.qza --o-rooted-tree rooted-170102.qza

## Export the tree
qiime tools export --input-path rooted-170102.qza --output-path rooted-170102

Then you will get a newick format tree file.

Tree construction based on shotgun metagenomic sequence data

Tree construction using MetaPhlAn 4

To install MetaPhlAn 4, users can refer to the detailed tutorial available at the following link: https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-4.
Here, we also use study 12675 in QIITA as an example, for metagenomic shotgun sequence data, users need to download raw sequence data manually. On the main page of study 12675, users can download raw sequence data by clicking the link from EBI: https://www.ebi.ac.uk/ena/browser/view/PRJEB42155.

## Run MetaPhlAn 4 on paired-end FASTQ files to generate a species-level profile
metaphlan sample_R1.fastq,sample_R2.fastq --input_type fastq -o metaphlan_output.txt

Download reference tree of MetaPhlAn 4 from https://github.com/biobakery/MetaPhlAn/blob/master/metaphlan/utils/mpa_vJun23_CHOCOPhlAnSGB_202403.nwk.
Read your taxonomy profiling file in R, then subset the reference tree.

## Install necessary packages
install.packages("ape")  # For reading and subsetting phylogenetic trees
install.packages("dplyr")  # For data manipulation

## Load necessary libraries
library(ape)
library(dplyr)

## Read the MetaPhlAn 4 output (species-level profile)
metaphlan_data <- read.table("metaphlan_output.txt", header = TRUE, sep = "\t")

## Read the reference tree in Newick format
reference_tree <- read.tree("mpa_vJun23_CHOCOPhlAnSGB_202403.nwk")

## Subset the tree to keep only the detected taxa
subset_tree <- keep.tip(reference_tree, rownames(metaphlan_data))

Tree construction using Woltka

To install Woltka, users can refer to the tutorial at the link: https://github.com/qiyunzhu/woltka/blob/main/doc/install.md.
Since Woltka accepts the align files, users still need to process raw sequence files through Bowtie 2, users can install and learn the Bowtie 2 at https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml. Users should install ‘Web of Life’ (WoL) database to keep align with the reference tree of Woltka, tutorial is at https://github.com/qiyunzhu/woltka/blob/main/doc/wol.md. After users finish Bowtie 2 step, the results are in one folder, for example, named as ‘bowtie2_out’, then we can perform Woltka to obtain phyloseq object.

## bowtie 2 processing code
bowtie2 -x reference_index -1 sample_R1.fastq -2 sample_R2.fastq -S output.sam

## Processing aligned files with Woltka
woltka classify -i bowtie2_out -o table.biom

## Converting BIOM file into a tab-delimited file
biom convert --to-tsv -i table.biom -o table.tsv