今天我们继续扫盲,学习一些基础的知识和概念。
Gene enrichment and covariation analysis
其实我们在做TCR分析的时候,应该也是实验组 + 对照组进行分析,其中做重要的就是我们要寻找实验组在接受病原刺激后TCR重排选择基因的偏好性。Gene usage preferences were quantified by calculating a normalized Jensen–Shannon divergence (JSD) between the observed gene segment frequencies for each repertoire and background gene frequencies calculated from large-scale repertoire profiling studies,这里其实就是相对于正常的样本,疾病样本在TCR重排基因选择的偏好性,当然,这里用到的是JS散度,大家可以参考文章KL散度、JS散度、Wasserstein距离,JSD 是 Kullback-Leibler 散度的对称版本,further normalize the JSD values by dividing them by the mean Shannon entropy(香农熵,又叫信息熵,大家参考我之前的文章10X单细胞(10X空间转录组)基础算法之KL散度) of the two distributions being compared, which helps to correct for variation in total gene number across segments。To set lower significance thresholds for the JSD heat maps(that is, the values below which the mapped colour is a uniform dark blue)。
we compared the 2–4 different background repertoire datasets(这里就设置成我们的对照样本) for each chain/organism to one another and took the largest observed JSD value across all comparisons.
Covariation(协变,协方差) between gene usage in different segments was quantified using the adjusted mutual information
,a variant of the mutual information metric that corrects for the numbers and frequencies of the observed genes (mutual information between pairs of distributions tends to increase with the number of observation classes)。当然,这个在单细胞数据中其实应该用到的不多。
CDR3 motif discovery.
used a simple, depth-first search procedure to identify over-represented sequence patterns in the CDR3 amino sequences of each repertoire.Motifs were represented as fixed-length patterns consisting of fully-specified amino acid positions, wild card positions, and amino acid group positions
,The score of a motif was calculated using a chi-squared formalism:
where ‘observed’ represents the number of times the motif was observed in the repertoire sequences and ‘expected’ represents an estimate of the expected number of observations based on a background set of TCR sequences with V and J gene compositions that match the observed repertoire(这里的背景我们设置为单细胞的对照样本)。(这一部分才是最为关键的地方)。
Starting with two-position motifs scoring above a seed threshold, each motif was iteratively extended by adding new specified positions (that is, replacing an internal wild card or lengthening the motif at either end) that increased the motif score.The set of identified motifs were sorted by motif score and filtered for redundancy。Finally, motifs scoring above a threshold were extended to include near-neighbour TCRs using a stringent distance threshold; this allowed us to capture additional pattern instances that were not captured by our limited set of amino acid groupings. The final set of motifs for each repertoire were visualized using the TCR logo representation。(看来这才是TCR分析正确的打开方式)。
TCRdiv 多样性的衡量(也很重要)
为了衡量多样性,generalizes Simpson’s diversity index by accounting for TCR similarity as well as exact identity(关于辛普森多样性指数,大家可以百度百科一下)。辛普森多样性可以被认为是衡量从混合总体中抽取两个独立样本中相同物种或类别的项目的概率,或者换句话说,如果样本是返回 1 的两个抽取样本的函数的期望值 相同,否则为 0 。We instead estimate the expected value of a Gaussian function(高斯函数,确实需要很多的数学知识) of the inter-sample distance that returns 1 if the two samples are identical and exp(− (TCRdist(a,b) / s.d.)2) otherwise, where the s.d. was taken to be 18.45 for single-chain distances and twice that for paired analyses based on empirical assessments of receptor distance distributions for multiple epitopes。Taking the inverse of this estimate gives a diversity measure (TCRdiv) that can be interpreted as an effective population size for similarity-weighted sharing.
(这部分有点难以理解,大家需要多一些耐心和学习了).
这部分的代码在tcr-dist,作者已经都封装好了,我们用一下就可以,感兴趣大家可以多多学习一下。
到目前为止,算是把基础说完了,接下来的分析,就要更上一层楼了。
生活很好,有你更好