这一次的资料给出了详细的计算公式,并且有R做模拟验证的code。
在我实际处理NGS data时候,从测序数据中自己写脚本remove PCR duplicates,常常是根据read的坐标。如果是single end read,就看5' 端read map 到reference上的位置,map到一模一样位置的reads集合会随机保留一条read信息,丢掉其他的。如果是paired end reads,看一对reads map到reference后,两个5’端坐标是不是完全一致,完全一致的那些read pairs,随机保留一对,其他的扔掉。
下面介绍的资料是用类似的思路,直接从fragment在基因组的坐标分布来计算duplication rate。
原文:Expected number of read duplicates
先直接来看公式,
设G为genome size,N为read 总数目,
For single end reads,
PCR duplicate rate = N - (sum(dpois(1:n, N/G/2) * G) * 2)
- dpois(1:n, N/G/2) 是基因组上某一个位置,被1,2,... ,n条reads covered到的概率。
- **dpois(1:n, N/G/2) * G **是基因组上,有多少位置被1条read cover到,多少个位置被2条reads cover到...
- **sum(dpois(1:n, N/G/2) * G) **是基因组上总共有多少个位置被至少一条read读到。
For paired end reads,
如果打碎基因组DNA后,得到的fragment最小长度为a,最大为b,fragment长度服从[a,b]见的均匀分布,那么考虑到不同size的fragments,我们需要“延长”基因组:
G2<- G + G * (b - a)
PCR duplicate rate = N - (sum(dpois(1:n, N / G2 / 2) * G2) * 2
如果fragment长度服从类似泊松分布怎么办呢?下面通过模拟来估算duplicate rate。
library(data.table)
G<- 10000 #genome size
N<- 20000 #total read count
a<- 1 # Min fragment length
b<- 10 # Max fragment length
L<- 15 # Mean fragment length, if assume uniform fragment size distribution
n<- 10 # A number sufficiently large (ideally infinite)
set.seed(1)
# generate 5' coordinate of one mapped read
pos_fp1<- sample(1:G, size= N, replace= TRUE)
set.seed(2)
# pos_fp2<- pos_fp1 + rpois(n= N, lambda= L)
# generate 5' coordinate of the other read of the same fragment
pos_fp2<- pos_fp1 + sample(a:b, size= N, replace= TRUE)
pos_fp2<- ifelse(pos_fp2 > G, G, pos_fp2)
set.seed(3)
# generate '+' or '-' strand for each read
strand<- sample(c('-', '+'), size= N, replace= TRUE)
# make "reads", from small coord to large
reads<- data.table(pos_fp1, pos_fp2, strand)[order(pos_fp1, pos_fp2),]
## Observed number of reads for SE:
## ================================
# for each 5' coordinate, count the number of reads as 'depth',
# make "depth_se", from small depth to large
depth_se<- reads[, list(depth= .N), by= list(pos_fp1, strand)][order(depth)]
dups<- table(depth_se$depth)
dups<- data.table(depth= as.numeric(names(dups)), n_pos=as.vector(dups))
dups$nreads<- dups$depth * dups$n_pos
dups$ndups<- (dups$depth - 1) * dups$n_pos
sum(dups$ndups)
7373
## Analytical way SE
## =================
N - (sum(dpois(1:n, N/G/2) * G) * 2)
7357.589
## Observed number of reads for PE:
## ================================
# for each pair of 5' coordinates, count the number of read pairs as 'depth',
depth_pe<- reads[, list(depth= .N), by= list(pos_fp1, pos_fp2, strand)][order(depth)]
N - nrow(depth_pe)
972
## Analytical way PE
## =================
G2<- G + G * (b - a)
N - (sum(dpois(1:n, N / G2 / 2) * G2) * 2)
967.4836