系列回顾:
ArchR官网教程学习笔记1:Getting Started with ArchR
ArchR官网教程学习笔记2:基于ArchR推测Doublet
ArchR官网教程学习笔记3:创建ArchRProject
ArchR官网教程学习笔记4:ArchR的降维
ArchR官网教程学习笔记5:ArchR的聚类
ArchR官网教程学习笔记6:单细胞嵌入(Single-cell Embeddings)
ArchR官网教程学习笔记7:ArchR的基因评分和Marker基因
ArchR官网教程学习笔记8:定义与scRNA-seq一致的聚类
ArchR官网教程学习笔记9:ArchR的伪批量重复
ArchR官网教程学习笔记10:ArchR的call peak
ArchR官网教程学习笔记11:鉴定Marker峰
ArchR官网教程学习笔记12:Motif和Feature富集
ArchR官网教程学习笔记13:ChromVAR偏差富集
ArchR官网教程学习笔记14:ArchR的Footprinting分析
在开始之前,先要对本章笔记说明一下。本章的中间“Peak2GeneLinkage”部分里的代码,因为按照官网的代码运行总是报错,所以我去ArchR的github上直接问的developer(见https://github.com/GreenleafLab/ArchR/issues/442)。根据developer说的,是代码本身的bug,所以他们在新版ArchR里修复了这个bug。(如何安装新版的ArchR下文里有提到)所以这就是对于任何我们想学习的R包或者软件,一定要自己亲自运行一下,否则即便你使用的官网示例数据和官方代码,仍然有可能报错。
ArchR的主要优势之一是它能够集成多层次的信息,从而提供新的角度。这可以采取ATAC-seq分析的形式,如识别峰共可接近性以预测调控相互作用,或整合scRNA-seq数据的分析,如通过peak-to-gene连锁分析预测增强子活性。在任何一种情况下,ArchR都可以轻松地从你的scATAC-seq数据中获得更深入的见解。
(一)创建低重叠的细胞聚集(Creating Low-Overlapping Aggregates of Cells)
ArchR方便了许多涉及特征相关性的综合分析。用稀疏的单细胞数据进行这些计算会导致相关性分析中大量的背景信号。为了规避这一挑战,我们采用了Cicero的方法,在这些分析之前创建低重叠的单细胞聚集。为了减少偏差,我们过滤与任何其他聚合重叠度超过80%的聚合( aggregates )。为了提高这种方法的速度,我们开发了一个优化的迭代重叠检查过程的实现,并使用“Rcpp”包在c++中实现了快速的特征相关性。这些优化方法在ArchR中用于计算峰共可接近性、peak-to-gene的连锁和其他连锁分析。这些低重叠聚合的使用是在后台进行的,但为了清楚起见,我们在这里提到它。
(二)共可接近性分析
协同可接近性是指跨多个单细胞的两个峰之间的可接近性相关性。换句话说,当A峰在单细胞内可接近时,B峰通常也可接近。下面我们将直观地说明这一概念,显示enhancer E3通常与promoter P可共同接近。
关于共可接近性分析需要注意的一件事是,它通常将特定细胞类型的峰标识为共可接近的。这是因为这些峰通常在单个细胞类型中都可以接近,而在所有其他细胞类型中通常都不能被接近。这产生了很强的相关性,但并不一定意味着这些峰之间存在调控关系。
为了计算ArchR中的协同可接近性,我们使用储存在ArchRProject中协同可接近性峰信息的addCoAccessibility()
函数。
> projHeme5 <- addCoAccessibility(
ArchRProj = projHeme5,
reducedDims = "IterativeLSI"
)
我们可以通过getCoAccessibility()
函数从ArchRProject检索这个协同可接近性信息,如果returnLoops = FALSE
,该函数将返回一个DataFrame对象:
> cA <- getCoAccessibility(
ArchRProj = projHeme5,
corCutOff = 0.5,
resolution = 1,
returnLoops = FALSE
)
DataFrame包含了一些重要的信息。queryHits
和subjectHits
列表示是相关的两个峰的索引。correlation列给出了这两个峰之间的可接近性的数值相关性:
> cA
DataFrame with 66069 rows and 11 columns
queryHits subjectHits seqnames correlation Variability1 Variability2
<integer> <integer> <Rle> <numeric> <numeric> <numeric>
1 7 11 chr1 0.674068131604548 0.00454332740253749 0.0306386713792158
2 11 7 chr1 0.674068131604548 0.0306386713792158 0.00454332740253749
3 20 21 chr1 0.555556855508682 0.0263183019286269 0.00550013127461839
4 21 20 chr1 0.555556855508682 0.00550013127461839 0.0263183019286269
5 44 55 chr1 0.556849339818455 0.0328129362145881 0.0276923110094824
... ... ... ... ... ... ...
66065 140763 140761 chrX 0.544862329728945 0.00105524816663243 0.00103018596537361
66066 140819 140825 chrX 0.52077899590599 0.0082391986288855 0.00123800124356212
66067 140825 140819 chrX 0.52077899590599 0.00123800124356212 0.0082391986288855
66068 140825 140826 chrX 0.536143354146812 0.00123800124356212 0.000926987273521345
66069 140826 140825 chrX 0.536143354146812 0.000926987273521345 0.00123800124356212
TStat Pval FDR VarQuantile1 VarQuantile2
<numeric> <numeric> <numeric> <numeric> <numeric>
1 14.8753870882021 1.57568490812292e-41 2.25369520385051e-39 0.497425136231847 0.922787369881207
2 14.8753870882021 1.57568490812297e-41 2.25369520385051e-39 0.922786833059823 0.497426746695999
3 12.2600711823064 2.64734422862833e-30 1.16921566549878e-28 0.897581458618855 0.553941154713533
4 12.2600711823064 2.64734422862833e-30 1.16921566549878e-28 0.553940081070765 0.897582532261623
5 12.2885938249175 2.02269876477295e-30 9.03837469990896e-29 0.933085751311052 0.906216767401199
... ... ... ... ... ...
66065 12.0240635693672 2.42373148023141e-29 9.76842969452669e-28 0.113733662512206 0.109280192310893
66066 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25 0.664993931234254 0.14330607891167
66067 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25 0.143306615733054 0.66499607851979
66068 11.8316525493 1.4493708075588e-28 5.40902049408846e-27 0.143306615733054 0.092152369234337
66069 11.8316525493 1.44937080755876e-28 5.40902049408846e-27 0.0921534428771049 0.14330607891167
这个共可接近性数据DataFrame还具有一个metadata组件,该metadata组件包含相关峰的GRanges
对象。上面提到的queryHits
和subject
的索引适用于这个GRanges对象:
> metadata(cA)[[1]]
GRanges object with 140865 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
Mono chr1 752495-752995 *
CD4.N chr1 757871-758371 *
CD4.N chr1 762690-763190 *
GMP chr1 773670-774170 *
B chr1 801006-801506 *
... ... ... ...
NK chrX 154807254-154807754 *
Mono chrX 154840907-154841407 *
Mono chrX 154841994-154842494 *
NK chrX 154862043-154862543 *
GMP chrX 154996991-154997491 *
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
上面的代码里,returnLoops = FALSE
。如果我们设置returnLoops = TRUE
,那么getCoAccessibility()
将以循环track的形式返回co-accessibility信息。在这个GRanges
对象中,IRanges
的开始和结束映射到相互作用的两个不同的可共同接近的峰。分辨率参数设置这些循环的碱基对分辨率。当resolution = 1时,会创建连接每个峰中心的loop:
> cA <- getCoAccessibility(
ArchRProj = projHeme5,
corCutOff = 0.5,
resolution = 1,
returnLoops = TRUE
)
> cA[[1]]
GRanges object with 33035 ranges and 9 metadata columns:
seqnames ranges strand | correlation Variability1
<Rle> <IRanges> <Rle> | <numeric> <numeric>
[1] chr1 845576-856623 * | 0.674068131604548 0.00454332740253749
[2] chr1 894703-895398 * | 0.555556855508682 0.0263183019286269
[3] chr1 968510-1004214 * | 0.556849339818455 0.0328129362145881
[4] chr1 974306-975108 * | 0.572146747607589 0.0093636446649745
[5] chr1 1072037-1107432 * | 0.541661438734924 0.00679983990004832
... ... ... ... . ... ...
[33031] chrX 153276031-153306080 * | 0.782117525371192 0.0140283037671134
[33032] chrX 153306080-153342756 * | 0.538505319617638 0.00154044262912208
[33033] chrX 153591527-153594212 * | 0.544862329728945 0.00103018596537361
[33034] chrX 153941234-153959650 * | 0.52077899590599 0.0082391986288855
[33035] chrX 153959650-153960343 * | 0.536143354146812 0.000926987273521345
Variability2 TStat Pval FDR
<numeric> <numeric> <numeric> <numeric>
[1] 0.0306386713792158 14.8753870882021 1.57568490812292e-41 2.25369520385051e-39
[2] 0.00550013127461839 12.2600711823064 2.64734422862833e-30 1.16921566549878e-28
[3] 0.0276923110094824 12.2885938249175 2.02269876477295e-30 9.03837469990896e-29
[4] 0.00607430839411806 12.6261781901177 8.16876834093376e-32 4.18323084851364e-30
[5] 0.013335574910477 11.9534260620727 4.68165878439767e-29 1.83531999911093e-27
... ... ... ... ...
[33031] 0.012680152275175 17.2598293746197 1.96185228502518e-52 1.30520421001205e-49
[33032] 0.012680152275175 11.8837765839787 8.94202281836675e-29 3.40710822672161e-27
[33033] 0.00105524816663243 12.0240635693672 2.42373148023141e-29 9.76842969452669e-28
[33034] 0.00123800124356212 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25
[33035] 0.00123800124356212 11.8316525493 1.44937080755876e-28 5.40902049408846e-27
VarQuantile1 VarQuantile2 value
<numeric> <numeric> <numeric>
[1] 0.497425136231847 0.922787369881207 0.674068131604548
[2] 0.897581458618855 0.553941154713533 0.555556855508682
[3] 0.933085751311052 0.906216767401199 0.556849339818455
[4] 0.696956276435098 0.581836004288129 0.572146747607589
[5] 0.613238444785505 0.773817288547399 0.541661438734924
... ... ... ...
[33031] 0.783751168257537 0.763801275165515 0.782117525371192
[33032] 0.190906567848586 0.763801275165515 0.538505319617638
[33033] 0.109280729132277 0.113733125690822 0.544862329728945
[33034] 0.664993931234254 0.14330607891167 0.52077899590599
[33035] 0.0921534428771049 0.14330607891167 0.536143354146812
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
相反,如果我们将loop的分辨率降低到resolution = 1000,这将有助于过度绘制协同可接近性相互作用。下面,我们看到GRanges对象中的条目总数比上面的少:
> cA <- getCoAccessibility(
ArchRProj = projHeme5,
corCutOff = 0.5,
resolution = 1000,
returnLoops = TRUE
)
> cA[[1]]
GRanges object with 31594 ranges and 9 metadata columns:
seqnames ranges strand | correlation Variability1
<Rle> <IRanges> <Rle> | <numeric> <numeric>
[1] chr1 845500-856500 * | 0.674068131604548 0.00454332740253749
[2] chr1 894500-895500 * | 0.555556855508682 0.0263183019286269
[3] chr1 968500-1004500 * | 0.556849339818455 0.0328129362145881
[4] chr1 974500-975500 * | 0.572146747607589 0.0093636446649745
[5] chr1 1072500-1079500 * | 0.653852889650734 0.0171476615215405
... ... ... ... . ... ...
[31590] chrX 153276500-153306500 * | 0.782117525371192 0.0140283037671134
[31591] chrX 153306500-153342500 * | 0.538505319617638 0.00154044262912208
[31592] chrX 153591500-153594500 * | 0.544862329728945 0.00103018596537361
[31593] chrX 153941500-153959500 * | 0.52077899590599 0.0082391986288855
[31594] chrX 153959500-153960500 * | 0.536143354146812 0.000926987273521345
Variability2 TStat Pval FDR
<numeric> <numeric> <numeric> <numeric>
[1] 0.0306386713792158 14.8753870882021 1.57568490812292e-41 2.25369520385051e-39
[2] 0.00550013127461839 12.2600711823064 2.64734422862833e-30 1.16921566549878e-28
[3] 0.0276923110094824 12.2885938249175 2.02269876477295e-30 9.03837469990896e-29
[4] 0.00607430839411806 12.6261781901177 8.16876834093376e-32 4.18323084851364e-30
[5] 0.0405524614744011 14.4292755824871 1.50202844679026e-39 1.72121316754705e-37
... ... ... ... ...
[31590] 0.012680152275175 17.2598293746197 1.96185228502518e-52 1.30520421001205e-49
[31591] 0.012680152275175 11.8837765839787 8.94202281836675e-29 3.40710822672161e-27
[31592] 0.00105524816663243 12.0240635693672 2.42373148023141e-29 9.76842969452669e-28
[31593] 0.00123800124356212 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25
[31594] 0.00123800124356212 11.8316525493 1.44937080755876e-28 5.40902049408846e-27
VarQuantile1 VarQuantile2 value
<numeric> <numeric> <numeric>
[1] 0.497425136231847 0.922787369881207 0.674068131604548
[2] 0.897581458618855 0.553941154713533 0.555556855508682
[3] 0.933085751311052 0.906216767401199 0.556849339818455
[4] 0.696956276435098 0.581836004288129 0.572146747607589
[5] 0.82168833546183 0.961734834930109 0.653852889650734
... ... ... ...
[31590] 0.783751168257537 0.763801275165515 0.782117525371192
[31591] 0.190906567848586 0.763801275165515 0.538505319617638
[31592] 0.109280729132277 0.113733125690822 0.544862329728945
[31593] 0.664993931234254 0.14330607891167 0.52077899590599
[31594] 0.0921534428771049 0.14330607891167 0.536143354146812
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
类似地,如果我们进一步降低分辨率,使resolution = 10000,我们将识别出更少的协同可接近性相互作用:
> cA <- getCoAccessibility(
ArchRProj = projHeme5,
corCutOff = 0.5,
resolution = 10000,
returnLoops = TRUE
)
> cA[[1]]
GRanges object with 21257 ranges and 9 metadata columns:
seqnames ranges strand | correlation Variability1
<Rle> <IRanges> <Rle> | <numeric> <numeric>
[1] chr1 845000-855000 * | 0.674068131604548 0.00454332740253749
[2] chr1 895000 * | 0.555556855508682 0.0263183019286269
[3] chr1 965000-1005000 * | 0.556849339818455 0.0328129362145881
[4] chr1 975000 * | 0.572146747607589 0.0093636446649745
[5] chr1 1075000 * | 0.653852889650734 0.0171476615215405
... ... ... ... . ... ...
[21253] chrX 153275000-153305000 * | 0.782117525371192 0.0140283037671134
[21254] chrX 153305000-153345000 * | 0.538505319617638 0.00154044262912208
[21255] chrX 153595000 * | 0.544862329728945 0.00103018596537361
[21256] chrX 153945000-153955000 * | 0.52077899590599 0.0082391986288855
[21257] chrX 153955000-153965000 * | 0.536143354146812 0.000926987273521345
Variability2 TStat Pval FDR
<numeric> <numeric> <numeric> <numeric>
[1] 0.0306386713792158 14.8753870882021 1.57568490812292e-41 2.25369520385051e-39
[2] 0.00550013127461839 12.2600711823064 2.64734422862833e-30 1.16921566549878e-28
[3] 0.0276923110094824 12.2885938249175 2.02269876477295e-30 9.03837469990896e-29
[4] 0.00607430839411806 12.6261781901177 8.16876834093376e-32 4.18323084851364e-30
[5] 0.0405524614744011 14.4292755824871 1.50202844679026e-39 1.72121316754705e-37
... ... ... ... ...
[21253] 0.012680152275175 17.2598293746197 1.96185228502518e-52 1.30520421001205e-49
[21254] 0.012680152275175 11.8837765839787 8.94202281836675e-29 3.40710822672161e-27
[21255] 0.00105524816663243 12.0240635693672 2.42373148023141e-29 9.76842969452669e-28
[21256] 0.00123800124356212 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25
[21257] 0.00123800124356212 11.8316525493 1.44937080755876e-28 5.40902049408846e-27
VarQuantile1 VarQuantile2 value
<numeric> <numeric> <numeric>
[1] 0.497425136231847 0.922787369881207 0.674068131604548
[2] 0.897581458618855 0.553941154713533 0.555556855508682
[3] 0.933085751311052 0.906216767401199 0.556849339818455
[4] 0.696956276435098 0.581836004288129 0.572146747607589
[5] 0.82168833546183 0.961734834930109 0.653852889650734
... ... ... ...
[21253] 0.783751168257537 0.763801275165515 0.782117525371192
[21254] 0.190906567848586 0.763801275165515 0.538505319617638
[21255] 0.109280729132277 0.113733125690822 0.544862329728945
[21256] 0.664993931234254 0.14330607891167 0.52077899590599
[21257] 0.0921534428771049 0.14330607891167 0.536143354146812
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
一旦我们在ArchRProject中添加了可接近性信息,我们就可以在绘制browser tracks时使用它作为loop track。我们通过plotBrowserTrack()
函数的loop
参数来实现这一点。这里,我们使用getCoAccessibility()
的默认参数,包括corCutOff = 0.5
、resolution = 1000
和returnLoops = TRUE
:
> markerGenes <- c(
"CD34", #Early Progenitor
"GATA1", #Erythroid
"PAX5", "MS4A1", #B-Cell Trajectory
"CD14", #Monocytes
"CD3D", "CD8A", "TBX21", "IL7R" #TCells
)
> p <- plotBrowserTrack(
ArchRProj = projHeme5,
groupBy = "Clusters2",
geneSymbol = markerGenes,
upstream = 50000,
downstream = 50000,
loops = getCoAccessibility(projHeme5)
)
画图:
> grid::grid.newpage()
> grid::grid.draw(p$CD14)
保存图片:
> plotPDF(plotList = p,
name = "Plot-Tracks-Marker-Genes-with-CoAccessibility.pdf",
ArchRProj = projHeme5,
addDOC = FALSE, width = 5, height = 5)
(三)ArchR的Peak2GeneLinkage分析
这一部分请注意!你需要安装最新版的ArchR,如果你安装的是1.0.0,这一部分是要报错的!安装新版的代码:
devtools::install_github("GreenleafLab/ArchR", ref="release_1.0.1", repos = BiocManager::repositories())
安装后为了确定确实是新版了,检查一下:
packageVersion("ArchR")
[1] ‘1.0.1’
与共可接近性相似,ArchR也可以鉴定所谓的“peak-to-gene links”。peak-to-gene links和共可接近性之间的主要区别是,共可接近性是一种仅ATAC-seq分析,寻找可接近性在两个峰之间的相关性,而peak-to-gene linkage利用整合的scRNA-seq数据来寻找峰可接近性和基因表达之间的相关性。这些代表了一个类似问题的正交方法。然而,由于peak-to-gene链接与scATAC-seq和scRNA-seq数据相关,我们通常认为这些links与基因调控相互作用更相关。
为了识别ArchR中的peak-to-gene links,我们使用addPeak2GeneLinks()
函数:
> projHeme5 <- addPeak2GeneLinks(
ArchRProj = projHeme5,
reducedDims = "IterativeLSI"
)
然后,我们可以检索这些peak-to-gene links,其方式类似于使用getPeak2GeneLinks()
函数检索协同可接近性相互作用。正如我们在前面看到的,这个函数允许用户指定相关性的cutoff和linkages的分辨率,首先我们尝试使用returnLoops = FALSE:
> p2g <- getPeak2GeneLinks(
ArchRProj = projHeme5,
corCutOff = 0.45,
resolution = 1,
returnLoops = FALSE
)
当returnLoops
设置为false时,该函数将返回一个DataFrame对象,类似于getCoAccessibility()
返回的DataFrame对象。主要区别在于scATAC-seq峰的索引存储在idxATAC
列中,而scRNA-seq基因的索引存储在idxRNA
列中。
这个peak-to-gene linkage的dataframe还有一个metadata组件,该组件包含相关峰的GRanges对象。上述idxATAC的各项指标均适用于该对象。
> metadata(p2g)[[1]]
GRanges object with 140865 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 752495-752995 *
[2] chr1 757871-758371 *
[3] chr1 762690-763190 *
[4] chr1 773670-774170 *
[5] chr1 801006-801506 *
... ... ... ...
[140861] chrX 154807254-154807754 *
[140862] chrX 154840907-154841407 *
[140863] chrX 154841994-154842494 *
[140864] chrX 154862043-154862543 *
[140865] chrX 154996991-154997491 *
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
如果我们设置returnLoops = TRUE
,那么getPeak2GeneLinks()
将返回一个loop track GRanges对象,该对象连接peak和gene。在共可接近性方面,IRanges对象的开始和结束代表了峰和被连接基因的位置。当resolution = 1时,它将峰的中心连接到基因的单碱基TSS上。
> p2g <- getPeak2GeneLinks(
ArchRProj = projHeme5,
corCutOff = 0.45,
resolution = 1,
returnLoops = TRUE
)
> p2g[[1]]
GRanges object with 151425 ranges and 2 metadata columns:
seqnames ranges strand | value
<Rle> <IRanges> <Rle> | <numeric>
[1] chr1 752745-901877 * | 0.45482431317995
[2] chr1 752745-935552 * | 0.323287626544222
[3] chr1 752745-948847 * | 0.452241013027625
[4] chr1 762902-845576 * | 0.399410417826805
[5] chr1 762902-846795 * | 0.277657378194161
... ... ... ... . ...
[151421] chrX 154299547-154322956 * | 0.339960030862308
[151422] chrX 154299547-154417246 * | 0.184827972853627
[151423] chrX 154299547-154445180 * | 0.234457969710443
[151424] chrX 154322956-154444701 * | 0.272574364121956
[151425] chrX 154444701-154493813 * | 0.262058933222826
FDR
<numeric>
[1] 3.64903693825748e-25
[2] 1.36105300270992e-12
[3] 7.3782745193054e-25
[4] 3.72871411563353e-19
[5] 1.80208567788527e-09
... ...
[151421] 7.03940344317308e-14
[151422] 9.55454380815343e-05
[151423] 5.1496329871895e-07
[151424] 3.70925182517263e-09
[151425] 1.57206595271312e-08
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
我们也可以通过设置resolution = 1000
来降低这些links的分辨率。这主要用于在browser tracks时绘制links,因为在某些情况下,许多峰的附近都链接到同一基因,这可能很难可视化。
> p2g <- getPeak2GeneLinks(
ArchRProj = projHeme5,
corCutOff = 0.45,
resolution = 1000,
returnLoops = TRUE
)
> p2g[[1]]
GRanges object with 37320 ranges and 2 metadata columns:
seqnames ranges strand | value
<Rle> <IRanges> <Rle> | <numeric>
[1] chr1 752500-901500 * | 0.45482431317995
[2] chr1 752500-948500 * | 0.452241013027625
[3] chr1 762500-856500 * | 0.510133486947831
[4] chr1 762500-901500 * | 0.465953111794918
[5] chr1 762500-999500 * | 0.622693553092798
... ... ... ... . ...
[37316] chrX 153881500-153980500 * | 0.686663730276686
[37317] chrX 153941500-153979500 * | 0.461993816785105
[37318] chrX 153946500-153991500 * | 0.493712263022568
[37319] chrX 153980500-153991500 * | 0.54404072487661
[37320] chrX 153991500-154012500 * | 0.454090683118072
FDR
<numeric>
[1] 3.64903693825748e-25
[2] 7.3782745193054e-25
[3] 2.15908329887624e-32
[4] 1.63066517085523e-26
[5] 4.50860678157444e-52
... ...
[37316] 2.08980297150101e-67
[37317] 4.99062691501854e-26
[37318] 4.19132973592781e-30
[37319] 1.54059545203608e-37
[37320] 4.45896733710535e-25
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
继续降低分辨率,甚至可能会减少总的peak-to-gene links鉴定数量。
> p2g <- getPeak2GeneLinks(
ArchRProj = projHeme5,
corCutOff = 0.45,
resolution = 10000,
returnLoops = TRUE
)
> p2g[[1]]
GRanges object with 30123 ranges and 2 metadata columns:
seqnames ranges strand | value
<Rle> <IRanges> <Rle> | <numeric>
[1] chr1 755000-905000 * | 0.45482431317995
[2] chr1 755000-945000 * | 0.452241013027625
[3] chr1 765000-855000 * | 0.510133486947831
[4] chr1 765000-905000 * | 0.465953111794918
[5] chr1 765000-995000 * | 0.622693553092798
... ... ... ... . ...
[30119] chrX 153885000-153985000 * | 0.686663730276686
[30120] chrX 153945000-153975000 * | 0.461993816785105
[30121] chrX 153945000-153995000 * | 0.493712263022568
[30122] chrX 153985000-153995000 * | 0.54404072487661
[30123] chrX 153995000-154015000 * | 0.454090683118072
FDR
<numeric>
[1] 3.64903693825748e-25
[2] 7.3782745193054e-25
[3] 2.15908329887624e-32
[4] 1.63066517085523e-26
[5] 4.50860678157444e-52
... ...
[30119] 2.08980297150101e-67
[30120] 4.99062691501854e-26
[30121] 4.19132973592781e-30
[30122] 1.54059545203608e-37
[30123] 4.45896733710535e-25
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
(1)用browser tracks绘制peak-to-gene links
为了将这些peak-to-gene链接绘制成browser track,我们使用了与前一节中显示的可接近性相同的分析流程。这里我们使用plotBrowserTrack()
函数:
> markerGenes <- c(
"CD34", #Early Progenitor
"GATA1", #Erythroid
"PAX5", "MS4A1", #B-Cell Trajectory
"CD14", #Monocytes
"CD3D", "CD8A", "TBX21", "IL7R" #TCells
)
> p <- plotBrowserTrack(
ArchRProj = projHeme5,
groupBy = "Clusters2",
geneSymbol = markerGenes,
upstream = 50000,
downstream = 50000,
loops = getPeak2GeneLinks(projHeme5)
)
画某一个基因的peak-to-gene links:
> grid::grid.newpage()
> grid::grid.draw(p$CD14)
保存图片:
> plotPDF(plotList = p,
name = "Plot-Tracks-Marker-Genes-with-Peak2GeneLinks.pdf",
ArchRProj = projHeme5,
addDOC = FALSE, width = 5, height = 5)
(2)peak-to-gene links分析的热图
为了可视化所有peak-to-gene links的对应关系,我们可以绘制一个peak-to-gene热图,其中包含两个并排的热图,一个用于我们的scATAC-seq数据,另一个用于我们的scRNA-seq数据。为此,我们使用plotPeak2GeneHeatmap()
:
> p <- plotPeak2GeneHeatmap(ArchRProj = projHeme5, groupBy = "Clusters2")
> p
根据传递给参数“k”的值,使用k-means聚类对heatmap行进行聚类,参数“k”默认为25,如下图所示:
(四)鉴定正向调控分子
ATAC-seq允许对在含有DNA结合motif的位点上染色质可接近性发生巨大变化的TFs进行无偏差的鉴定。然而,当通过位置权重矩阵(PWMs)进行聚合时,TFs家族(例如GATA因子)在其结合motif上具有相似的特征。
这种motif相似性使得识别可能在预测结合位点上引起染色质可接近性变化的特定TFs具有挑战性。为了规避这一难题,我们之前用ATAC-seq和RNA-seq来鉴定TFs的基因表达与其相应motif可接近性的改变呈正相关。我们称这些TFs为“正向调控分子”。然而,这种分析依赖于匹配的基因表达数据,这可能在所有的实验中并不容易得到。为了克服这种依赖性,ArchR可以识别出推断的基因评分与其chromVAR TF偏差z-score相关的TFs。为了实现这一点,ArchR将低重叠细胞聚集的TF motif的染色质偏差z分数与TF基因活性分数关联起来。当使用与ArchR结合的scRNA-seq整合时,可以使用TF的基因表达,而不是推断的基因活性评分。
(1)鉴定偏移的TF motif
识别正向调控的TF的第一部分是识别偏移的TF motif。我们在前一章中进行了此分析,为所有的motifs创建了一个chromVAR偏差和偏差z-score的MotifMatrix。我们可以通过使用getGroupSE()
函数来获得这些按cluster平均的数据,该函数返回一个SummarizedExperiment:
> seGroupMotif <- getGroupSE(ArchRProj = projHeme5, useMatrix = "MotifMatrix", groupBy = "Clusters2")
因为这个SummarizedExperiment
对象来自MotifMatrix
,所以有两个seqname -“deviations”和“z”-对应于原始偏离和来自chromVAR的偏离z-score。
> seGroupMotif
class: SummarizedExperiment
dim: 1740 11
metadata(0):
assays(1): MotifMatrix
rownames(1740): f1 f2 ... f1739 f1740
rowData names(3): seqnames idx name
colnames(11): B CD4.M ... PreB Progenitor
colData names(20): TSSEnrichment ReadsInTSS ... FRIP nCells
我们可以把这个SummarizedExperiment取子集化,只取偏差z-score:
> seZ <- seGroupMotif[rowData(seGroupMotif)$seqnames=="z",]
然后我们可以确定所有cluster之间z-score的最大delta。这将有助于分层motifs的(基于不同cluster之间的偏差程度):
> rowData(seZ)$maxDelta <- lapply(seq_len(ncol(seZ)), function(x){
rowMaxs(assay(seZ) - assay(seZ)[,x])
}) %>% Reduce("cbind", .) %>% rowMaxs
(2)鉴定TF motifs和TF基因分数/表达的相关性
为了识别TFs motif可接近性与其自身基因活性(通过基因得分或基因表达)的相关性,我们使用correlateMatrices()
函数并提供我们感兴趣的两个矩阵,在本例中为GeneScoreMatrix
和MotifMatrix
。如前所述,这些相关性是在reducedDims
参数中指定的低维空间中确定的许多低重叠的细胞集合中确定的:
> corGSM_MM <- correlateMatrices(
ArchRProj = projHeme5,
useMatrix1 = "GeneScoreMatrix",
useMatrix2 = "MotifMatrix",
reducedDims = "IterativeLSI"
)
此函数返回一个DataFrame对象,该对象包含来自每个矩阵的元素以及跨低重叠细胞聚合的相关性:
> corGSM_MM
DataFrame with 825 rows and 14 columns
GeneScoreMatrix_name MotifMatrix_name cor
<array> <array> <numeric>
1 HES4 HES4_95 0.0964222952027804
2 HES5 HES5_98 0.389610876712439
3 PRDM16 PRDM16_211 0.510854729692758
4 TP73 TP73_705 0.528529747990005
5 TP73-AS1 TP73_705 -0.122421749585615
... ... ... ...
821 TFDP3 TFDP3_309 -0.114905134028301
822 ZNF75D ZNF75D_272 -0.0780911768193981
823 ZIC3 ZIC3_215 0.319287832283537
824 SOX3 SOX3_759 -0.0830413458998048
825 MECP2 MECP2_645 0.143645914605916
padj pval GeneScoreMatrix_seqnames
<numeric> <numeric> <Rle>
1 1 0.0330289085440269 chr1
2 2.91738527794111e-16 3.56648566985466e-19 chr1
3 6.00864881004139e-31 7.34553644259338e-34 chr1
4 1.26528717256258e-33 1.54680583442858e-36 chr1
5 1 0.0067196460862283 chr1
... ... ... ...
821 1 0.0109948840206202 chrX
822 1 0.0845143633274378 chrX
823 3.87543709985036e-10 4.73769816607624e-13 chrX
824 1 0.0665364665390267 chrX
825 1 0.00144806861367396 chrX
GeneScoreMatrix_start GeneScoreMatrix_end GeneScoreMatrix_strand
<array> <array> <array>
1 935552 934342 2
2 2461684 2460184 2
3 2985742 3355185 1
4 3569129 3652765 1
5 3663937 3652548 2
... ... ... ...
821 132352376 132350697 2
822 134429965 134419723 2
823 136648346 136654259 1
824 139587225 139585152 2
825 153363188 153287264 2
GeneScoreMatrix_idx GeneScoreMatrix_matchName MotifMatrix_seqnames
<array> <array> <Rle>
1 15 HES4 z
2 74 HES5 z
3 82 PRDM16 z
4 89 TP73 z
5 90 TP73 z
... ... ... ...
821 697 TFDP3 z
822 728 ZNF75D z
823 753 ZIC3 z
824 765 SOX3 z
825 874 MECP2 z
MotifMatrix_idx MotifMatrix_matchName
<array> <array>
1 95 HES4
2 98 HES5
3 211 PRDM16
4 705 TP73
5 705 TP73
... ... ...
821 309 TFDP3
822 272 ZNF75D
823 215 ZIC3
824 759 SOX3
825 645 MECP2
我们可以使用GeneIntegrationMatrix
代替GeneScoreMatrix
进行同样的分析:
> corGIM_MM <- correlateMatrices(
ArchRProj = projHeme5,
useMatrix1 = "GeneIntegrationMatrix",
useMatrix2 = "MotifMatrix",
reducedDims = "IterativeLSI"
)
> corGIM_MM
DataFrame with 798 rows and 14 columns
GeneIntegrationMatrix_name MotifMatrix_name cor
<array> <array> <numeric>
1 HES4 HES4_95 -0.171384934997311
2 HES5 HES5_98 -0.169840367432083
3 PRDM16 PRDM16_211 0.252222396399267
4 TP73 TP73_705 -0.254496075318124
5 HES2 HES2_19 -0.320907300946354
... ... ... ...
794 TFDP3 TFDP3_309 NA
795 ZNF75D ZNF75D_272 0.281512276845826
796 ZIC3 ZIC3_215 NA
797 SOX3 SOX3_759 NA
798 MECP2 MECP2_645 0.189446439244001
padj pval GeneIntegrationMatrix_seqnames
<numeric> <numeric> <Rle>
1 0.0745474147720114 0.000139863817583511 chr1
2 0.0857722946054064 0.000160923629653671 chr1
3 8.31213157194671e-06 1.55949935683803e-08 chr1
4 6.10714657063681e-06 1.14580611081366e-08 chr1
5 1.89680123127737e-10 3.55872651271552e-13 chr1
... ... ... ...
794 NA NA chrX
795 1.24170853148747e-07 2.32965953374759e-10 chrX
796 NA NA chrX
797 NA NA chrX
798 0.0132063045846304 2.47773069130027e-05 chrX
GeneIntegrationMatrix_start GeneIntegrationMatrix_end
<array> <array>
1 935552 934342
2 2461684 2460184
3 2985742 3355185
4 3569129 3652765
5 6484730 6472498
... ... ...
794 132352376 132350697
795 134429965 134419723
796 136648346 136654259
797 139587225 139585152
798 153363188 153287264
GeneIntegrationMatrix_strand GeneIntegrationMatrix_idx
<array> <array>
1 2 8
2 2 53
3 1 59
4 1 64
5 2 81
... ... ...
794 2 562
795 2 576
796 1 595
797 2 602
798 2 680
GeneIntegrationMatrix_matchName MotifMatrix_seqnames MotifMatrix_idx
<array> <Rle> <array>
1 HES4 z 95
2 HES5 z 98
3 PRDM16 z 211
4 TP73 z 705
5 HES2 z 19
... ... ... ...
794 TFDP3 z 309
795 ZNF75D z 272
796 ZIC3 z 215
797 SOX3 z 759
798 MECP2 z 645
MotifMatrix_matchName
<array>
1 HES4
2 HES5
3 PRDM16
4 TP73
5 HES2
... ...
794 TFDP3
795 ZNF75D
796 ZIC3
797 SOX3
798 MECP2
(3)添加Maximum Delta Deviation到相关性Dataframe
对于每一个相关分析,我们可以用我们在步骤1中计算出的cluster间观察到的最大delta注释每个motif。
> corGSM_MM$maxDelta <- rowData(seZ)[match(corGSM_MM$MotifMatrix_name, rowData(seZ)$name), "maxDelta"]
> corGIM_MM$maxDelta <- rowData(seZ)[match(corGIM_MM$MotifMatrix_name, rowData(seZ)$name), "maxDelta"]
(4)鉴定正向调控TF
我们可以利用所有这些信息来鉴定正向调控的TF。在下面的例子中,我们认为motif与基因评分(或基因表达)的相关性大于0.5,且调整后的p值小于0.01,且偏差z-score的聚类间最大差异位于前四分位数的TFs为正调控因子。
我们应用这些标准来进行选择,并做一些文本处理来提取TF名称:
> corGSM_MM <- corGSM_MM[order(abs(corGSM_MM$cor), decreasing = TRUE), ]
> corGSM_MM <- corGSM_MM[which(!duplicated(gsub("\\-.*","",corGSM_MM[,"MotifMatrix_name"]))), ]
> corGSM_MM$TFRegulator <- "NO"
> corGSM_MM$TFRegulator[which(corGSM_MM$cor > 0.5 & corGSM_MM$padj < 0.01 & corGSM_MM$maxDelta > quantile(corGSM_MM$maxDelta, 0.75))] <- "YES"
> sort(corGSM_MM[corGSM_MM$TFRegulator=="YES",1])
[1] "ATOH1" "BCL11A" "CEBPA-DT" "CEBPB" "CEBPD" "CREB1"
[7] "CREB3L4" "EBF1" "EGR2" "EOMES" "ERF" "ESR1"
[13] "ETS1" "FUBP1" "GABPA" "GATA1" "GATA2" "GATA5"
[19] "GATA6" "IRF1" "JDP2" "KLF11" "KLF2" "LYL1"
[25] "MECOM" "MITF" "MYOD1" "NFE2" "NFIA" "NFIC"
[31] "NFIX" "PAX5" "POU2F1" "PRDM16" "RUNX2" "SIX5"
[37] "SMAD1" "SP4" "SPI1" "SPIB" "TAL1" "TCF15"
[43] "TCF23" "TFAP2C" "TWIST1" "YY1"
从基因分数和motif偏差z-score中识别出这些正向调控的TF,我们可以在点图中突出它们。
> p <- ggplot(data.frame(corGSM_MM), aes(cor, maxDelta, color = TFRegulator)) +
geom_point() +
theme_ArchR() +
geom_vline(xintercept = 0, lty = "dashed") +
scale_color_manual(values = c("NO"="darkgrey", "YES"="firebrick3")) +
xlab("Correlation To Gene Score") +
ylab("Max TF Motif Delta") +
scale_y_continuous(
expand = c(0,0),
limits = c(0, max(corGSM_MM$maxDelta)*1.05)
)
> p
我们可以对从GeneIntegrationMatrix
中得到的相关进行同样的分析:
> corGIM_MM <- corGIM_MM[order(abs(corGIM_MM$cor), decreasing = TRUE), ]
> corGIM_MM <- corGIM_MM[which(!duplicated(gsub("\\-.*","",corGIM_MM[,"MotifMatrix_name"]))), ]
> corGIM_MM$TFRegulator <- "NO"
> corGIM_MM$TFRegulator[which(corGIM_MM$cor > 0.5 & corGIM_MM$padj < 0.01 & corGIM_MM$maxDelta > quantile(corGIM_MM$maxDelta, 0.75))] <- "YES"
> sort(corGIM_MM[corGIM_MM$TFRegulator=="YES",1])
[1] "BACH1" "CEBPA" "CEBPB" "CEBPD" "CTCF" "EBF1" "EOMES"
[8] "ETS1" "FOS" "FOSB" "FOSL1" "FOSL2" "GATA1" "GATA2"
[15] "IRF1" "IRF2" "IRF9" "JDP2" "KLF2" "MITF" "NFE2"
[22] "NFIA" "NFIB" "NFIC" "NFIX" "NFKB2" "NR4A1" "NRF1"
[29] "PAX5" "RARA" "RUNX1" "SP4" "SPI1" "STAT2" "TCF3"
[36] "TCF4" "TGIF2"
> p <- ggplot(data.frame(corGIM_MM), aes(cor, maxDelta, color = TFRegulator)) +
geom_point() +
theme_ArchR() +
geom_vline(xintercept = 0, lty = "dashed") +
scale_color_manual(values = c("NO"="darkgrey", "YES"="firebrick3")) +
xlab("Correlation To Gene Expression") +
ylab("Max TF Motif Delta") +
scale_y_continuous(
expand = c(0,0),
limits = c(0, max(corGIM_MM$maxDelta)*1.05)
)
> p