微生物多样性（扩增子/16S rDNA测序）—差异分析方法描述

一、差异分析内容及意义

a)随机森林模型（Random Forest）

一种非线性分类器，挖掘变量之间复杂的非线性的相互依赖关系。

意义：找到能够区分两组样品间差异的关键OTU。

b)交叉验证（Cross validation）

对随机森林筛选出的关键OTU组成进行遍历。

意义：以期用最少的OTU数目组合构建一个错误率最低的高效分类器。

c)ROC曲线（接收者操作特征曲线）

属于二元分类算法，用来处理只有两种分类的问题，可以用于选择最佳的判别模型，选择最佳的诊断界限值。

d)LEfSe分析

区别两个或两个以上生物条件；

识别不同丰度的特征以及相关联的类别；

主要计算方式：通过非参数因子KW和秩检验找到丰度差异类群；LDA（线性判别分析）评估每个组分丰度对差异效果的影响大小。

e)Wilcoxon秩和检验分析[又称曼-惠特尼U检验（Mann–Whitney U test）]

两组独立样本非参数检验的一种方法。

可以对两组样品的物种进行显著性差异分析。

f)两组样本Welch’s t-test分析

适用于两组不同方差的样本。

获得在两组中有显著性差异的物种。

g)ANOSIM相似性分析

非参数检验，基于距离（Bray-Curtis等）计算组间差异是否大于组内。

h)Adonis多因素方差分析（置换多因素/非参数多因素方差分析）

半度量（如Bray-Curtis）或度量距离矩阵（如Euclidean）对总方差进行分解，分析不同分组因素对样品差异的解释度，并使用置换检验对划分的统计学意义进行显著性分析。

i)基于差异分析的可视化图表

①多样性指数盒状图（基于非阐述Manny- Whitney计算显著差异）

②基于距离的箱式图（基于multiple Student’s two-sample t-tests判断样本组间差异的显著性）

易于识别异常值、特征值等。

二、差异分析在论文中的描述

a)随机森林模型&ROC曲线

方法描述示例：

Random Forest (RF) analysis was used to find the most discriminatory OTUs between XX with active disease versus XX. As it is unlikely that an OTU present in a minority of samples will have group-related importance, OTUs were only included in the statistical analysis if they were detected in at least 20% of the samples in one of the groups. Prior to actual RF analyses, the microbiome data were transformed via an inverse hyperbolic sine transformation and then mean centered per individual patient. The first step accounts for skewness and can deal with sparse microbiome data. The mean centering per individual diminishes the influence of inter-individual variation.

In the current study, two different RF models were built. The first RF model based on XX different randomly selected subsets, aimed to find the most discriminatory OTUs between XX and XX. The second RF model was performed to demonstrate the contribution of the most discriminatory OTUs in differentiating XX and XX and to test the classification performance of the model in the validation set. The second RF model was based on XX randomly selected subsets. For both RF models, each subset contained all samples from the same individual either in the training set, consisting of 80% of all samples, or in the validation set (the remaining 20%). Thereby, the RF classification model was never trained on part of the measurements of one subject and tested on the remaining measurements of that subject.

The final classification of each sample was determined by a majority of votes (>50%) from XX RF classification models. The final performance of the RF classification model is demonstrated by the receiver operating characteristic (ROC) curve.

ROC分析示意图

结果描述：

We subsequently performed RF analysis to examine whether we could discriminate samples collected during XX and XX based upon the microbiota composition. First, we reduced the data by including only those OTUs (n= XXX) that were present in at least 20% of the XX and/ or XX samples. Subsequently, a first RF analysis was used for the selection of the most discriminatory OTUs between XX and XX samples. The RF-analysis assigned a variable importance score to each OTU, indicating to what extend the OTUs contributed to the model. Based on the variable importance profile, XX OTUs with the highest variable importance scores were selected. The performance of the RF classification model based on the most discriminatory OTUs resulted in an area under the ROC curve (AUC) of 0.82 for the validation set, corresponding to a sensitivity of 0.79 and a specificity of 0.73. The positive predictive value (PPV) and negative predictive value (NPV) were both 0.76.

b)随机森林模型、交叉验证、ROC曲线、Wilcoxn检验：

①方法描述：

We mapped reads from the discovery phase, validation phase and independent diagnosis phase against these represented sequences to generate the discovery OTU frequency profile, validation OTU frequency profile and independent diagnosis frequency profile, respectively. Wilcoxon test was used to determine the significance (p<0.05), based on which XX OTU biomarkers were selected for further analysis. Five-fold cross-validation was performed on a random forest model with default parameters using the XX-OTU abundance profile of training cohort, including XX and XX with liver cirrhosis (assigned as non-HCC cohort) and XX with HCC. Using five trials of the five fold cross-validation, we then obtained the cross-validation error curve. The point with the minimum cross-validation error was viewed as the cut-off point, and the cut-off value was determined via the minimum error plus the SD at the corresponding point. We listed all sets (≤XX) of OTU markers with the error less than the cut-off value and chose the set with the smallest number of OTUs as the optimal set.

POD index was defined as the ratio between the number of randomly generated decision trees that predicted sample as ‘HCC’ and that of healthy controls. The identified optimal set of OTUs was finally used for the calculation of POD index for both the training and testing cohort. And the receiver operating characteristic (ROC) curve was obtained for the evaluation of the constructed models, and the area under the ROC curve (AUC) was used to designate the ROC effect. The detailed script of microbial marker identification and POD construction can be found in the online supplementary methods.

结果的描述：

分析示意图

To illustrate the diagnostic value of faecal microbiome for early HCC, we constructed a random forest classifier model that could specifically identify early HCC samples from non-HCC samples. To detect unique OTUs markers of early HCC, we conducted a fivefold cross-validation on a random forest model between XX and XX samples in the discovery phase. The result indicated that the 30 OTU markers were selected as the optimal marker set . The relative abundance of the 30 OTUs markers in each sample from the discovery phase were presented. The corresponding bacterial genera of the 30 OTUs markers are listed in the online. The POD index was calculated using the identified optimal 30 OTUs set for both the discovery cohort and the validation cohort. In the discovery phase, the POD index achieved an AUC value of 80.64% with 95% CI of 74.47% to 86.8% between early HCC and non-HCC cohorts. The POD value was significantly increased in the early HCC samples versus the non-HCC samples (p=1.5×10–14). These data suggested that the POD based on microbial OTUs markers achieved a powerful diagnostic potential for early HCC cohort from the non-HCC cohort.

C）LEfSe

LEfSe分析示意图

图表描述：

Different structures of adenoma tissue and control tissue microbiota. (A) Taxnomic representation of statistically and biologically consistent diffrernces between Adenoma Tissue and Control Tissue. Differences are represented by the color of the most abundant class (Red indicating Adenoma Tissue, yellow non-significant and green Control Tissue). The diameter of each cicle’s diameter is proportional to the taxon’s abundance. (B) Histogram of the LDA scores for differentially abundant genera.

结果描述：

According to LEfSe analysis, the greatest differences in taxa between the two communities were displayed. ……were have more influence in XX group. ……

温馨提示：

一些差异分析，可以直接在文章中进行描述不用可视化图表体现；

有必要体现在可视化图表中时，需要根据结果在图中进行标注；

在差异分析结果中，分析方法多，结果文件多，因此要根据文章成文情况和撰写内容，依据实际情况进行结果分析及提取。

一些差异分析及模型并非一次即可分析成型，需要基于已有结果，逐步调整分析方案，方能获得理想结果。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 199,902评论 5赞 468
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 84,037评论 2赞 377
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 146,978评论 0赞 332
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 53,867评论 1赞 272
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 62,763评论 5赞 360
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,104评论 1赞 277
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,565评论 3赞 390
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,236评论 0赞 254
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,379评论 1赞 294
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,313评论 2赞 317
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,363评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,034评论 3赞 315
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,637评论 3赞 303
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,719评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,952评论 1赞 255
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,371评论 2赞 346
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 41,948评论 2赞 341

微生物多样性（扩增子/16S rDNA测序）—差异分析方法描述

推荐阅读更多精彩内容