scATAC数据分析流程（下）

HiChIP meta-virtual 4C (metav4C) analysis for Cicero co-accessibility links

光预测获得links还不行，最好还有Hi-C这类染色质构象数据的验证，是否实际实验结果支持这些links是真实存在的。在此之前，作者曾开发了HiChIP技术捕获染色质构象，因此他们利用了以前HiChIP的结果：

Briefly, we used published H3K27ac HiChIP data from primary T cell subsets (naïve, Th17 and Treg) to support predicted Cicero co-accessibility links.

Overlap of Cicero co-accessibility links with GTEx eQTLs

除了应用HiChIP数据，作者还利用了eQTLs数据证明他们找到的peaks-gene links:

eQTLs from the Genotype-Tissue Expression project were used to support the scATAC-seq-defined Cicero co-accessibility links as previously described

previously described参考Science上的文章https://science.sciencemag.org/content/362/6413/eaav1898

Constructing ATAC-seq pseudo-bulk replicates of maximal variance

降采样、随机抽样的方法生成pseudoreplicates，同时要保证pseudo样本里的变异程度尽可能大

We wanted to perform analyses that treated each cluster as a bulk ATAC-seq sample but required a method that could create replicates that convey close to the true population variance within a cluster and potential batch effects.

Identification of cluster-specific peaks and gene scores through feature binarization

从上一步获得的pseudo样本中捕获具有cluster特异性的peaks作为feature，展现不同cluster的特异性：

Unique_peaks.png

Pseudotime analysis

To order cells in pseudotime, we sought to identify a trajectory and then align single cells across the trajectory

应用UMAP的降维空间，计算每个cluster在各个维度的平均坐标值，过滤掉在各个cluster内那些距离平均坐标点的欧氏距离在top5%范围的细胞。这样一来就可以把细胞mapping回UMAP空间，计算它们到各个cluster平均坐标的距离。

拟时分析需要有一个拟时间序列，也就是pseudotime vector，这个向量的构建是通过每个细胞距离不同cluster的距离占总距离的分位数确定的，同时赋予每个细胞time component。

最后在UMAP空间中绘制continuous trajectory，并使曲线平滑，把每个细胞mapping上去

To further support longer trajectories in pseudotime, we evaluated the significance of the trajectory by its cluster ordering

Pseudotime.png

同时作者还计算了trajectory的显著性：

trajectory_plot_distance.png

不过上面两张图我是用TSNE空间画的，和文章原图有区别，不一定准确，因为参数的选取非常重要，这里我只是按照代码运行的结果

Barnyard mixing analysis

为了评估multiplets在不同cell loading中的比例，作者选择混合human (GM12878)和mouse (A20)的样本进行文库制备，loading分别为500, 1,000, 5,000 以及10,000 cells。细胞的过滤标准和 scATAC数据分析流程（上）一样。

比对时用到的参考基因组为hg19和mm10，如果是multiplets，就很有可能其reads同时比对到hg19和mm10上。

We labeled a scATAC-seq profile as a multiplet if less than 95% of the unique nuclear fragments aligned to either hg19 or mm10.

混合样本是50:50制备的，也会存在同一个multiplet只有人或鼠的DNA的情况，所以作者把能通过比对检测到的multiplets rate乘以2，作为一个总multiplets rate的估计值。

作者通过down sample细胞和unique fragments的方法探究这两种因素对于ATAC-seq peak recovery的影响：

把mixing experiments中GM12878和A20的fragments都混合在一起，再先对细胞降采样然后再对fragments降采样，使得平均每个细胞都有相同的unique fragments数。然后对混合的fragments进行peaks calling

We did this analysis by merging all GM12878 and A20 fragments from the mixing experiments into one fragments file. Next, we down-sampled the fragments file first by the number of cells and then by the number of fragments to make the unique fragments per cell match the desired output.

最后使用countoverlap函数（作者手写的）计算降采样样本的peaks和total fragments的peaks比例，得到recovery率

Analysis of fresh versus frozen PBMCs

比较不同样本制备方法对于scATAC的影响。

To do this, we performed scATAC-seq on PBMCs that were freshly isolated, frozen or frozen and sorted for live cells.

用ROC曲线衡量不同方法的效果

Spike-in analysis

Barnyard mixing analysis中用的是不同物种同一类型细胞探究文库制备效果，这个与之相似，不过是用同一物种的单核细胞和T细胞 mixing experiment，检测scATAC下游分析的准确性和潜在批次效应

We tested the sensitivity and performance of our analysis workflow by performing scATAC-seq on monocyte and T cell mixtures at various loadings

引用：[https://doi.org/10.1038/s41587-019-0206-z]