Chapter 8 file2meco package

In the microtable class part, we showed the basic way about how to create microtable object with the example data. Actually, constructing the microtable object from other tools/platforms (e.g., QIIME, QIIME2, HUMAnN, Kraken2 and phyloseq) can be easily achieved with the package file2meco (https://github.com/ChiLiubio/file2meco). The idea of creating file2meco package comes from a study involved in complex metagenomic analysis (C. Liu et al. 2022). Note that the sample_table parameter in each function of file2meco package supports various metadata input format, including
1) comma seperated file with the suffix .csv or tab seperated file with the suffix .tsv or .txt;
2) Excel file with the suffix .xlsx or .xls;
3) data.frame object in R session.

# install file2meco package (>= 0.9.0)
if(!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
if(!require("file2meco")) install.packages("file2meco", repos = BiocManager::repositories())

▲ Trouble shooting:
• The files in the following examples all come from the package file2meco and are found by the function system.file automatically irrespective of Operating System. When the user imports a file, please donot use this function. The first thing should be to make sure that R can find your input file. The user should either provide a full path to the import function or only the file name after putting the file into the R working directory. For the latter one, if the user does not know where the working directory is, please use the function getwd to find the working directory or directly create a new project in the target directory with RStudio [File –> New Project].

8.1 QIIME

The qiime1meco() function can be used to construct the microtable object using the raw OTU file from QIIME 1 (Caporaso et al. 2010).

library(file2meco)
# see the help document
?qiime1meco
# Let's run the examples
# use the raw data files stored inside the package
otu_file_path <- system.file("extdata", "otu_table_raw.txt", package="file2meco")
# csv file of metadata
sample_file_path <- system.file("extdata", "sample_info.csv", package="file2meco")
phylo_file_path <- system.file("extdata", "rep_phylo.tre", package="file2meco")
# if you want to use Tax4Fun2 approach, you need read the representative sequences and add it to the microtable object.
rep_fasta_path <- system.file("extdata", "rep.fna", package="file2meco")
# contruct microtable object
qiime1meco(otu_file_path)
qiime1meco(otu_file_path, sample_table = sample_file_path)
qiime1meco(otu_file_path, sample_table = sample_file_path, phylo_tree = phylo_file_path)
qiime1meco(otu_file_path, sample_table = sample_file_path, phylo_tree = phylo_file_path, rep_fasta = rep_fasta_path)

8.2 QIIME2

The qiime2meco() function is designed to create the microtable object using files from QIIME2 (Bolyen et al. 2019). The example data is the ASV (amplicon sequence variant) abundance table based on DADA2 (Callahan et al. 2016).

library(file2meco)
?qiime2meco
# use data files inside the package which were downloaded from (https://docs.qiime2.org/2022.2/tutorials/pd-mice/).
abund_file_path <- system.file("extdata", "dada2_table.qza", package="file2meco")
# tsv file of metadata
sample_file_path <- system.file("extdata", "sample-metadata.tsv", package="file2meco")
taxonomy_file_path <- system.file("extdata", "taxonomy.qza", package="file2meco")
# construct microtable object
qiime2meco(abund_file_path)
qiime2meco(abund_file_path, sample_table = sample_file_path, taxonomy_table = taxonomy_file_path)
# add phylogenetic tree and fasta for more demonstrations
# please download tree from https://docs.qiime2.org/2022.2/data/tutorials/pd-mice/tree.qza
# the file name is 'tree.qza'; put it into the R working directory
tree_data <- "tree.qza"
# please download fasta from https://docs.qiime2.org/2022.2/data/tutorials/pd-mice/dada2_rep_set.qza
# the file name is 'dada2_rep_set.qza'; put it into the R working directory
rep_data <- "dada2_rep_set.qza"
test1 <- qiime2meco(abund_file_path, sample_table = sample_file_path, taxonomy_table = taxonomy_file_path, phylo_tree = tree_data, rep_fasta = rep_data, auto_tidy = TRUE)
test1

8.3 HUMAnN

Many methods in microeco package can be used not only for the traditional species abundance data, i.e. species-sample table, but also for other data, such as metagenomic data. HUMAnN (Franzosa et al. 2018) is an excellent tool for functional profiling analysis of metagenomes and metatranscriptomes at species-level. The humann2meco() function can be used to create the microtable object using metagenomic analysis files from HUMAnN3 (https://huttenhower.sph.harvard.edu/humann). Certainly, it can also be used for the whole community profile of metabolic pathways when needed. Currently, it supports both the MetaCyc (https://metacyc.org/) and KEGG pathway abundance file input directly.

library(file2meco)
library(microeco)
library(magrittr)
?humann2meco
sample_file_path <- system.file("extdata", "example_metagenome_sample_info.tsv", package="file2meco")
match_file_path <- system.file("extdata", "example_metagenome_match_table.tsv", package="file2meco")

# MetaCyc pathway database based analysis
# use the raw data files stored inside the package for MetaCyc pathway database based analysis
abund_file_path <- system.file("extdata", "example_HUMAnN_MetaCyc_abund.tsv", package="file2meco")
# the default db is "MetaCyc"
humann2meco(abund_file_path, db = "MetaCyc")
humann2meco(abund_file_path, db = "MetaCyc", sample_table = sample_file_path, match_table = match_file_path)
# Let's try more interesting usages with microeco
test <- humann2meco(abund_file_path, db = "MetaCyc", sample_table = sample_file_path, match_table = match_file_path)
test$tidy_dataset()
# rel = FALSE sum original abundance instead of relative abundance
test$cal_abund(select_cols = 1:3, rel = FALSE)
test$taxa_abund$Superclass1 %<>% .[!grepl("unclass", rownames(.)), ]
# use_percentage = FALSE disable percentage for relative abundance
test1 <- trans_abund$new(test, taxrank = "Superclass1", ntaxa = 10, use_percentage = FALSE)
# reassign ylab title instead of default 'Relative Abundance'
test1$ylabname <- "Abundance (RPK)"
# bar_full = FALSE show original abundance instead of normalized 0-1
test1$plot_bar(facet = "Group", bar_full = FALSE)
# select both function and taxa
test$cal_abund(select_cols = c("Superclass1", "Phylum", "Genus"), rel = TRUE)
test1 <- trans_abund$new(test, taxrank = "Phylum", ntaxa = 10, delete_taxonomy_lineage = FALSE)
test1$plot_bar(facet = "Group")
# functional biomarker
test$cal_abund(select_cols = 1:3, rel = TRUE)
test$taxa_abund$Superclass1 %<>% .[!grepl("unclass", rownames(.)), ]
test1 <- trans_diff$new(test, method = "lefse", group = "Group")
test1$plot_diff_bar(use_number = 1:20)
# taxonomic biomarker
test$cal_abund(select_cols = 4:9, rel = TRUE)
test$taxa_abund$Phylum %<>% .[!grepl("unclass", rownames(.)), ]
# p_adjust_method = "none" shut down the p value adjustment
test1 <- trans_diff$new(test, method = "lefse", group = "Group", p_adjust_method = "none")
test1$plot_diff_bar(threshold = 2)

# use KEGG pathway based HUMAnN result
abund_file_path <- system.file("extdata", "example_HUMAnN_KEGG_abund.tsv", package="file2meco")
test <- humann2meco(abund_file_path, db = "KEGG", sample_table = sample_file_path, match_table = match_file_path)
test$tax_table %<>% subset(Level.1 != "unclassified")
test$tidy_dataset()

8.4 MetaPhlAn

MetaPhlAn is an software used for metagenomic taxonomic profiling (Blanco-Míguez et al. 2023). The format of MetaPhlAn classification results is usually called ‘mpa’ format. The mpa2meco function is developed for this format conversion to microtable object. See the following example of Kraken2 part.

8.5 Kraken2/Bracken

Kraken is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps k-mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer. Kraken2 (Wood, Lu, and Langmead 2019) is the newest version. Bracken (Lu et al. 2017) can be applied to estimate species abundance following the Kraken analysis. The merged Kraken2/Bracken results can be obtained by merge_metaphlan_tables.py from MetaPhlAn or combine_mpa.py from KrakenTools (https://ccb.jhu.edu/software/krakentools/).

# use the raw data files inside the package
abund_file_path <- system.file("extdata", "example_kraken2_merge.txt", package="file2meco")
sample_file_path <- system.file("extdata", "example_metagenome_sample_info.tsv", package="file2meco")
match_file_path <- system.file("extdata", "example_metagenome_match_table.tsv", package="file2meco")
mpa2meco(abund_file_path)
# 'rel = FALSE' means raw abundance in taxa_abund
test1 <- mpa2meco(abund_file_path, sample_table = sample_file_path, match_table = match_file_path, rel = FALSE, use_level = "s__")
test1$taxa_abund$Kingdom[, 1:3]
# 'rel = TRUE' means relative abundance in taxa_abund
test2 <- mpa2meco(abund_file_path, sample_table = sample_file_path, match_table = match_file_path, rel = TRUE)
test2$taxa_abund$Kingdom[, 1:3]
# The relative abundance in test2 is different with that in test3
# The taxonomic abundances in taxa_abund of test3 is calculated based on the otu_table (species abundance) and tax_table
# So when the user need to use raw or relative taxonomic abundance coming from original file, please donot run cal_abund function.
library(microeco)
test3 <- clone(test2)
test3$cal_abund()
test3$taxa_abund$Kingdom[, 1:3]

8.6 NCycDB/PCycDB

NCycDB database (Tu et al. 2018) is a curated integrative database for fast and accurate metagenomic profiling of nitrogen cycling genes. The ncyc2meco function is designed for construct the microtable object using gene abundance files from NCycDB. This function can also be used to parse the output of PCycDB (Zeng et al. 2022) database benefiting from implemented mapping database from v0.7.0. The ncyc2meco function can identify the database and invoke the internal mapping data automatically according to the gene names of input features.

library(file2meco)
library(microeco)
library(magrittr)
?ncyc2meco
# use the raw data files stored inside the package
abund_file_path <- system.file("extdata", "example_Ncyc_table.tsv", package="file2meco")
sample_file_path <- system.file("extdata", "example_metagenome_sample_info.tsv", package="file2meco")
match_file_path <- system.file("extdata", "example_metagenome_match_table.tsv", package="file2meco")
ncyc2meco(abund_file_path)
ncyc2meco(abund_file_path, sample_table = sample_file_path, match_table = match_file_path)

# Let's try more interesting usages with microeco
test <- ncyc2meco(abund_file_path, sample_table = sample_file_path, match_table = match_file_path)
test$tidy_dataset()
# use split_group = TRUE to calculate the pathway abundance with multipe map correspondance
test$cal_abund(select_cols = 1:2, rel = TRUE, split_group = TRUE, split_column = "Pathway")
test$taxa_abund$Pathway %<>% .[!grepl("unclass", rownames(.)), ]
test1 <- trans_abund$new(test, taxrank = "Pathway")
test1$plot_bar(bar_full = FALSE)
# for gene abundance, no splitting on the pathways
test$cal_abund(select_cols = 1:2, rel = TRUE, split_group = FALSE)
test$taxa_abund$Gene %<>% .[!grepl("unclass", rownames(.)), ]
test1 <- trans_abund$new(test, taxrank = "Gene")
test1$plot_bar(bar_full = FALSE)

8.7 phyloseq

Two functions meco2phyloseq() and phyloseq2meco() were provided for the conversion between microtable object and phyloseq object of phyloseq package (Mcmurdie and Holmes 2013).

# Please first install phyloseq
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("phyloseq")
library(phyloseq)

# from microtable to phyloseq object
data("dataset")
physeq <- meco2phyloseq(dataset)
physeq

# from phyloseq to microtable object
data("GlobalPatterns")
meco_dataset <- phyloseq2meco(GlobalPatterns)
meco_dataset

8.8 PICRUSt2

PICRUSt2 (Douglas et al. 2020) contains an updated and larger database of gene families and reference genomes compared to the original version of PICRUSt. We do not create a special file conversion function for PICRUSt2 as it is very easy to convert the output pathway files of PICRUSt2 to microtable object. Two example files of PICRUSt2 output in file2meco package were used to show the operation.

# MetaCyc pathway output
tmp_file_path <- system.file("extdata", "example_PICRUSt2_MetaCyc_path_abun_unstrat.tsv", package="file2meco")
pathway_table <- read.delim(tmp_file_path, row.names = 1)
data("MetaCyc_pathway_map")
tmp <- microtable$new(otu_table = pathway_table, tax_table = MetaCyc_pathway_map)
tmp$tidy_dataset()
tmp

# KEGG pathway output
tmp_file_path <- system.file("extdata", "example_PICRUSt2_KEGG_path_abun_unstrat.tsv", package="file2meco")
pathway_table <- read.delim(tmp_file_path, row.names = 1)
data("Tax4Fun2_KEGG")
tmp <- microtable$new(otu_table = pathway_table, tax_table = Tax4Fun2_KEGG$ptw_desc)
tmp$tidy_dataset()
tmp

8.9 ViromeScan

ViromeScan (Rampelli et al. 2016) is a tool for metagenomic viral community profiling. The input of vs2meco function must be a folder containing all the directories named by sample names. Each sample directory should have the original output files generated by ViromeScan software.

library(microeco)
library(file2meco)
# use viromescan directory inside the package
dir_path <- system.file("extdata", "viromescan", package="file2meco")
d1 <- vs2meco(dir_path)
d1$cal_abund(rel = TRUE)
# d1$taxa_abund$Family is same with the percentage output of viromescan at 
# Family level, i.e. Family_level_results-%.txt file
d1$cal_abund(rel = FALSE)
# d1$taxa_abund$Family is same with the count output of viromescan at 
# Family level, i.e. Family_level_results-Counts.txt file

References

Blanco-Míguez, Aitor, Francesco Beghini, Fabio Cumbo, Lauren J. McIver, Kelsey N. Thompson, Moreno Zolfo, Paolo Manghi, et al. 2023. “Extending and Improving Metagenomic Taxonomic Profiling with Uncharacterized Species Using MetaPhlAn 4.” Journal Article. Nature Biotechnology 41 (11): 1633–44. https://doi.org/10.1038/s41587-023-01688-w.

Bolyen, E., J. R. Rideout, M. R. Dillon, N. A. Bokulich, C. C. Abnet, G. A. Al-Ghalith, H. Alexander, et al. 2019. “Reproducible, Interactive, Scalable and Extensible Microbiome Data Science Using QIIME 2.” Journal Article. Nat Biotechnol 37 (8): 852–57. https://doi.org/10.1038/s41587-019-0209-9.

Callahan, Benjamin J, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. 2016. “DADA2: High-Resolution Sample Inference from Illumina Amplicon Data.” Journal Article. Nature Methods 13 (7): 581–83. https://doi.org/10.1038/nmeth.3869.

Caporaso, J. Gregory, Justin Kuczynski, Jesse Stombaugh, Kyle Bittinger, Frederic D. Bushman, Elizabeth K. Costello, Noah Fierer, et al. 2010. “QIIME Allows Analysis of High-Throughput Community Sequencing Data.” Journal Article. Nature Methods 7 (5): 335–36. https://doi.org/http://www.nature.com/nmeth/journal/v7/n5/suppinfo/nmeth.f.303_S1.html.

Douglas, G. M., V. J. Maffei, J. R. Zaneveld, S. N. Yurgel, J. R. Brown, C. M. Taylor, C. Huttenhower, and M. G. I. Langille. 2020. “PICRUSt2 for Prediction of Metagenome Functions.” Journal Article. Nat Biotechnol 38 (6): 685–88. https://doi.org/10.1038/s41587-020-0548-6.

Franzosa, Eric A., Lauren J. McIver, Gholamali Rahnavard, Luke R. Thompson, Melanie Schirmer, George Weingart, Karen Schwarzberg Lipson, et al. 2018. “Species-Level Functional Profiling of Metagenomes and Metatranscriptomes.” Journal Article. Nature Methods 15 (11): 962–68. https://doi.org/10.1038/s41592-018-0176-y.

Liu, Chi, Xiangzhen Li, Felipe R. P. Mansoldo, Jiaxing An, Yongping Kou, Xiao Zhang, Junming Wang, Jianxiong Zeng, Alane B. Vermelho, and Minjie Yao. 2022. “Microbial Habitat Specificity Largely Affects Microbial Co-Occurrence Patterns and Functional Profiles in Wetland Soils.” Journal Article. Geoderma 418: 115866. https://doi.org/10.1016/j.geoderma.2022.115866.

Lu, Jennifer, Florian P Breitwieser, Peter Thielen, and Steven L Salzberg. 2017. “Bracken: Estimating Species Abundance in Metagenomics Data.” Journal Article. PeerJ Computer Science 3: e104. https://doi.org/10.7717/peerj-cs.104.

Mcmurdie, Paul J, and Susan Holmes. 2013. “Phyloseq: An r Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” Journal Article. Plos One 8 (4): e61217.

Rampelli, Simone, Matteo Soverini, Silvia Turroni, Sara Quercia, Elena Biagi, Patrizia Brigidi, and Marco Candela. 2016. “ViromeScan: A New Tool for Metagenomic Viral Community Profiling.” Journal Article. Bmc Genomics 17 (1): 1–9. https://doi.org/10.1186/s12864-016-2446-3.

Tu, Qichao, Lu Lin, Lei Cheng, Ye Deng, and Zhili He. 2018. “NCycDB: A Curated Integrative Database for Fast and Accurate Metagenomic Profiling of Nitrogen Cycling Genes.” Journal Article. Bioinformatics 35 (6): 1040–48. https://doi.org/10.1093/bioinformatics/bty741.

Wood, D. E., J. Lu, and B. Langmead. 2019. “Improved Metagenomic Analysis with Kraken 2.” Journal Article. Genome Biology 20 (1): 257. https://doi.org/10.1186/s13059-019-1891-0.

Zeng, Jiaxiong, Qichao Tu, Xiaoli Yu, Lu Qian, Cheng Wang, Longfei Shu, Fei Liu, et al. 2022. “PCycDB: a comprehensive and accurate database for fast analysis of phosphorus cycling genes.” Journal Article. Microbiome 10 (1): 101. https://doi.org/10.1186/s40168-022-01292-1.