mbImpute: an accurate and robust imputation method for microbiome data ================ Ruochen Jiang, Wei Vivian Li, and Jingyi Jessica Li 2020-03-14 # mbImpute The goal of mbImpute is to impute false zero counts in microbiome sequencing data, i.e., a sample-by-taxon count matrix, by jointly borrowing information from similar samples, similar taxa and optional metadata including sample covariates and taxon phylogeny. ## Installation Please install the following R packages first. ``` r install.pacakges("glmnet") install.packages("devtools") ``` Then you can use the following R command to directly install the mbImpute package from GitHub: ``` r library(devtools) install_github("ruochenj/mbImpute/mbImpute R package") ``` ## Example We use the microbiome dataset from Karlsson et al (2013) as an example to demonstrate the use of mbImpute: ``` r # Load the R packages library(mbImpute) library(glmnet) #> Loading required package: Matrix #> Loading required package: foreach #> Loaded glmnet 2.0-18 # Display part of the OTU table otu_tab[1:6, 1:6] #> s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii #> S112 3954419 2602398 #> S118 0 1169731 #> S121 550000 3162050 #> S126 0 986563 #> S127 0 3940520 #> S131 0 502030 #> s__Dialister_invisus s__Dorea_longicatena s__Ruminococcus_obeum #> S112 1671620 1440718 1112728 #> S118 1412991 623343 190968 #> S121 827400 855614 274969 #> S126 2411487 163956 112493 #> S127 0 554154 286616 #> S131 0 159910 802850 #> s__Coprococcus_comes #> S112 947723 #> S118 58660 #> S121 281024 #> S126 148430 #> S127 210698 #> S131 450721 # Display part of the taxon phylogenetic distance matrix, whose rows and columns correspond to the columns in otu_tab D[1:6, 1:6] #> [,1] [,2] [,3] [,4] [,5] [,6] #> [1,] 0 2 9 10 10 8 #> [2,] 2 0 9 10 10 8 #> [3,] 9 9 0 3 3 3 #> [4,] 10 10 3 0 2 4 #> [5,] 10 10 3 2 0 4 #> [6,] 8 8 3 4 4 0 # Display part of the (optional) meta data, i.e., the sample covariate matrix with rows representing samples and corresponding to the rows in otu_tab meta_data[1:6, 1:6] #> study_condition age number_reads triglycerides hba1c #> S112 IGT 1.293993 0.6475183 0.9926486 1.2575721 #> S118 control 2.587987 1.4075527 0.2357540 0.8803004 #> S121 control 1.293993 0.9558486 0.1613054 1.2575721 #> S126 control 1.293993 1.2244933 0.7072621 1.3833293 #> S127 control 2.587987 0.9428587 0.9057918 1.2575721 #> S131 IGT 1.293993 1.0145444 1.8239918 1.5090865 #> ldl #> S112 1.0022890 #> S118 2.8883168 #> S121 0.9915117 #> S126 2.4895566 #> S127 3.2116358 #> S131 1.7351455 # For all the categorical variables (columns) in meta_data, make sure they are converted to numerical variables. For example, meta_data[,1] <- as.numeric(as.factor(meta_data[,1])) # Demo 1: run mbImpute (imputation will be performed within each condition) on a single core imputed_count_mat_list <- mbImpute(condition = meta_data$study_condition, otu_tab = otu_tab, meta_data = meta_data, D = D) #> [1] "condition 2 is imputing" #> [1] "Working on it!" #> [1] "condition 1 is imputing" #> [1] "Working on it!" #> [1] "condition 3 is imputing" #> [1] "Working on it!" #> [1] "Finished." # A glance at the imputed result, which includes three matrices ## The first is an imputed matrix on the log10 scale; we recommend users to perform downstream analysis based on normal distributions on this data, whose values in each taxon (column) follows an approximate normal distribution (see our paper for detail) imputed_count_mat_list$imp_count_mat_lognorm[1:3, 1:2] #> s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii #> S112 5.335660 5.153952 #> S118 4.670297 4.721291 #> S121 4.409327 5.168918 ## The second is an imputed normalized count matrix, where each sample (row) is set to have the same total of a million reads imputed_count_mat_list$imp_count_mat_norm[1:3, 1:2] #> s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii #> S112 216599 142544 #> S118 46804 52635 #> S121 25663 147541 ## The third is an imputed count matrix on the original scale, with each sample (row) having the read count same as that in the original otu_tab imputed_count_mat_list$imp_count_mat_origlibsize[1:3, 1:2] #> s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii #> S112 3954404 2602397 #> S118 1040127 1169709 #> S121 549997 3162029 # Demo 2: if you have multiple (e.g., 4) cores and would like to do parallel computing imputed_count_mat_list <- mbImpute(condition = meta_data$study_condition, otu_tab = otu_tab, meta_data = meta_data, D = D, parallel = TRUE, ncores = 4) #> [1] "condition 2 is imputing" #> [1] "Working on it!" #> [1] "condition 1 is imputing" #> [1] "Working on it!" #> [1] "condition 3 is imputing" #> [1] "Working on it!" #> [1] "Finished." # A glance at the imputed result, which includes three matrices ## The first is an imputed matrix on the log10 scale; we recommend users to perform downstream analysis based on normal distributions on this data, whose values in each taxon (column) follows an approximate normal distribution (see our paper for detail) imputed_count_mat_list$imp_count_mat_lognorm[1:3, 1:2] #> s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii #> S112 5.335660 5.153952 #> S118 4.670297 4.721291 #> S121 4.409327 5.168918 ## The second is an imputed normalized count matrix, where each sample (row) is set to have the same total of a million reads imputed_count_mat_list$imp_count_mat_norm[1:3, 1:2] #> s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii #> S112 216599 142544 #> S118 46804 52635 #> S121 25663 147541 ## The third is an imputed count matrix on the original scale, with each sample (row) having the read count same as that in the original otu_tab imputed_count_mat_list$imp_count_mat_origlibsize[1:3, 1:2] #> s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii #> S112 3954404 2602397 #> S118 1040127 1169709 #> S121 549997 3162029 # Demo 3: if you do not have meta data or phylogenetic information, and the samples belong to one condition otu_tab_T2D <- otu_tab[meta_data$study_condition == "T2D",] imputed_count_matrix_list <- mbImpute(otu_tab = otu_tab_T2D) #> [1] "Meta data information unavailable" #> [1] "Phylogenentic information unavailable" #> [1] "Finished." # A glance at the imputed result, which includes three matrices ## The first is an imputed matrix on the log10 scale; we recommend users to perform downstream analysis based on normal distributions on this data, whose values in each taxon (column) follows an approximate normal distribution (see our paper for detail) imputed_count_mat_list$imp_count_mat_lognorm[1:3, 1:2] #> s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii #> S112 5.335660 5.153952 #> S118 4.670297 4.721291 #> S121 4.409327 5.168918 ## The second is an imputed normalized count matrix, where each sample (row) is set to have the same total of a million reads imputed_count_mat_list$imp_count_mat_norm[1:3, 1:2] #> s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii #> S112 216599 142544 #> S118 46804 52635 #> S121 25663 147541 ## The third is an imputed count matrix on the original scale, with each sample (row) having the read count same as that in the original otu_tab imputed_count_mat_list$imp_count_mat_origlibsize[1:3, 1:2] #> s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii #> S112 3954404 2602397 #> S118 1040127 1169709 #> S121 549997 3162029 ``` Reference: Karlsson, F. H., Tremaroli, V., Nookaew, I., Bergström, G., Behre, C. J., Fagerberg, B., … & Bäckhed, F. (2013). Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature, 498(7452), 99-103.