前面已经说了很多的基础概念和分析算法,但是大家应该注意到了,前面的分析都限制在TCRαβ序列,分析还是具有一定的限制性,今天我们稍微做一下扩展,优化TCR分析距离的同时,分析γδ TCRs,and at-scale computation with sparse data representations and parallelized, byte-compiled code.文献在TCR meta-clonotypes for biomarker discovery with tcrdist3: identification of public, HLA restricted SARS-CoV-2 associated TCR features,这些文献都是具有承前启后的作用,这个专题,内容真的太多了。
来,看看分析框架
1、实验性抗原富集可以发现具有生化相似neighbors的 TCR
Searching for identical TCRs within a repertoire - arising either from clonal expansion or convergent nucleotide encoding of amino acids in the CDR3 - is a common strategy for identifying functionally important receptors。(这也是唯一实用的策略),然而,在缺乏实验富集程序的情况下,在大量样本中观察到具有相同氨基酸 TCR 序列的 T 细胞是很少见的。例如,在来自脐带血样本的 10,000 个 β 链 TCR 中,少于 1% 的 TCR 氨基酸序列被多次观察到,包括可能的克隆扩增(疾病确实会导致TCR的特异性扩增,这是研究的核心).
- 图注:TCR repertoire subsets obtained by single-cell sorting with peptide-MHC tetramers。
2、TCR biochemical neighborhood density is heterogeneous in antigen-enriched repertoires
We next investigated the proportion of unique TCRs with at least one biochemically similar neighbor among TCRs with the same putative antigen specificity.We and others have shown that a single peptide-MHC epitope is often recognized by many distinct TCRs with closely related amino acid sequences(识别抗原的TCR序列具有多样性,多对一的关系,这就复杂了),这个时候就必须找序列之间的相似性(也就是前面提到的TCR instance),以寻求共性。We observed the highest density neighborhoods within repertoires that were sorted based on peptide-MHC tetramer binding(看来刺激的作用很明显)。these observations suggest that biochemical neighborhood density is highly heterogeneous among TCRs and that it may depend on mechanisms of antigen-recognition as well as receptor V(D)J recombination biases。(按照这个情况,这个难以研究)。
3、Meta-clonotype radius can be tuned to balance a biomarker’s sensitivity and specificity
基于 TCR 的生物标志物的效用取决于 TCR 的抗原特异性 ,a key constraint on distance-based clustering is the presence of similar TCR sequences that may lack the ability to recognize the target antigen.(说白了就行要定义相似性的半径),To be useful, a meta-clonotype definition should be broad enough to capture multiple biochemically similar TCRs with shared antigen-recognition, but not excessively broad as to include a high proportion of non-specific TCRs, which might be found in unenriched background repertoires that are largely antigen-naïve(半径的大小要合适)。但是TCR“邻居”的相似性密度是异质的。
An ideal radius-defined meta-clonotype would include a high density of TCRs in antigen experienced individuals indicative of shared antigen specificity, yet a low density of TCRs among an antigen-naïve background.接下来就是寻找抗原转移性的TCR序列了。我们来看看分析的代码(TCRdist3)。
第一部分代码,TCRdist
看看输入的数据格式,跟我们10X分析出来的结果很类似
来,现场教大家写代码
默认参数
"""
If you just want a 'tcrdistances' using pre-set default setting.
You can access distance matrices:
tr.pw_alpha - alpha chain pairwise distance matrix
tr.pw_beta - alpha chain pairwise distance matrix
tr.pw_cdr3_a_aa - cdr3 alpha chain distance matrix
tr.pw_cdr3_b_aa - cdr3 beta chain distance matrix
"""
import pandas as pd
from tcrdist.repertoire import TCRrep
df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
db_file = 'alphabeta_gammadelta_db.tsv')
tr.pw_alpha
tr.pw_beta
tr.pw_cdr3_a_aa
tr.pw_cdr3_b_aa
调整一个默认参数
"""
If you want 'tcrdistances' with changes over some parameters.
For instance you want to change the gap penalty on CDR3s to 5.
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
compute_distances = False,
db_file = 'alphabeta_gammadelta_db.tsv')
tr.kargs_a['cdr3_a_aa']['gap_penalty'] = 5
tr.kargs_b['cdr3_b_aa']['gap_penalty'] = 5
tr.compute_distances()
tr.pw_alpha
tr.pw_beta
人为完全控制距离的计算(对代码的水平要求有点高)
"""
If want a 'tcrdistances' AND you want control over EVERY parameter.
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
compute_distances = False,
db_file = 'alphabeta_gammadelta_db.tsv')
metrics_a = {
"cdr3_a_aa" : pw.metrics.nb_vector_tcrdist,
"pmhc_a_aa" : pw.metrics.nb_vector_tcrdist,
"cdr2_a_aa" : pw.metrics.nb_vector_tcrdist,
"cdr1_a_aa" : pw.metrics.nb_vector_tcrdist}
metrics_b = {
"cdr3_b_aa" : pw.metrics.nb_vector_tcrdist,
"pmhc_b_aa" : pw.metrics.nb_vector_tcrdist,
"cdr2_b_aa" : pw.metrics.nb_vector_tcrdist,
"cdr1_b_aa" : pw.metrics.nb_vector_tcrdist }
weights_a= {
"cdr3_a_aa" : 3,
"pmhc_a_aa" : 1,
"cdr2_a_aa" : 1,
"cdr1_a_aa" : 1}
weights_b = {
"cdr3_b_aa" : 3,
"pmhc_b_aa" : 1,
"cdr2_b_aa" : 1,
"cdr1_b_aa" : 1}
kargs_a = {
'cdr3_a_aa' :
{'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight': 1,
'gap_penalty':4,
'ntrim':3,
'ctrim':2,
'fixed_gappos': False},
'pmhc_a_aa' : {
'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight':1,
'gap_penalty':4,
'ntrim':0,
'ctrim':0,
'fixed_gappos':True},
'cdr2_a_aa' : {
'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight': 1,
'gap_penalty':4,
'ntrim':0,
'ctrim':0,
'fixed_gappos':True},
'cdr1_a_aa' : {
'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight':1,
'gap_penalty':4,
'ntrim':0,
'ctrim':0,
'fixed_gappos':True}
}
kargs_b= {
'cdr3_b_aa' :
{'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight': 1,
'gap_penalty':4,
'ntrim':3,
'ctrim':2,
'fixed_gappos': False},
'pmhc_b_aa' : {
'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight': 1,
'gap_penalty':4,
'ntrim':0,
'ctrim':0,
'fixed_gappos':True},
'cdr2_b_aa' : {
'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight':1,
'gap_penalty':4,
'ntrim':0,
'ctrim':0,
'fixed_gappos':True},
'cdr1_b_aa' : {
'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight':1,
'gap_penalty':4,
'ntrim':0,
'ctrim':0,
'fixed_gappos':True}
}
tr.metrics_a = metrics_a
tr.metrics_b = metrics_b
tr.weights_a = weights_a
tr.weights_b = weights_b
tr.kargs_a = kargs_a
tr.kargs_b = kargs_b
只考虑不匹配的计算
"""
If you want "tcrdistances" using a different metric.
Here we illustrate the use a metric that uses the
Needleman-Wunsch algorithm to align sequences and then
calculate the number of mismatching positions (pw.metrics.nw_hamming_metric)
This method doesn't rely on Numba so it can run faster using multiple cpus.
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
import multiprocessing
df = pd.read_csv("dash.csv")
df = df.head(100) # for faster testing
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
use_defaults=False,
compute_distances = False,
cpus = 1,
db_file = 'alphabeta_gammadelta_db.tsv')
metrics_a = {
"cdr3_a_aa" : pw.metrics.nw_hamming_metric ,
"pmhc_a_aa" : pw.metrics.nw_hamming_metric ,
"cdr2_a_aa" : pw.metrics.nw_hamming_metric ,
"cdr1_a_aa" : pw.metrics.nw_hamming_metric }
metrics_b = {
"cdr3_b_aa" : pw.metrics.nw_hamming_metric ,
"pmhc_b_aa" : pw.metrics.nw_hamming_metric ,
"cdr2_b_aa" : pw.metrics.nw_hamming_metric ,
"cdr1_b_aa" : pw.metrics.nw_hamming_metric }
weights_a = {
"cdr3_a_aa" : 1,
"pmhc_a_aa" : 1,
"cdr2_a_aa" : 1,
"cdr1_a_aa" : 1}
weights_b = {
"cdr3_b_aa" : 1,
"pmhc_b_aa" : 1,
"cdr2_b_aa" : 1,
"cdr1_b_aa" : 1}
kargs_a = {
'cdr3_a_aa' :
{'use_numba': False},
'pmhc_a_aa' : {
'use_numba': False},
'cdr2_a_aa' : {
'use_numba': False},
'cdr1_a_aa' : {
'use_numba': False}
}
kargs_b = {
'cdr3_b_aa' :
{'use_numba': False},
'pmhc_b_aa' : {
'use_numba': False},
'cdr2_b_aa' : {
'use_numba': False},
'cdr1_b_aa' : {
'use_numba': False}
}
tr.metrics_a = metrics_a
tr.metrics_b = metrics_b
tr.weights_a = weights_a
tr.weights_b = weights_b
tr.kargs_a = kargs_a
tr.kargs_b = kargs_b
tr.compute_distances()
tr.pw_cdr3_b_aa
tr.pw_beta
自定义距离度量
"""
If you want a tcrdistance, but you want to use your own metric.
(A valid metric takes two strings and returns a numerical distance).
def my_own_metric(s1,s2):
return Levenshtein.distance(s1,s2)
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
import multiprocessing
df = pd.read_csv("dash.csv")
df = df.head(100) # for faster testing
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
use_defaults=False,
compute_distances = False,
cpus = 1,
db_file = 'alphabeta_gammadelta_db.tsv')
metrics_a = {
"cdr3_a_aa" : my_own_metric ,
"pmhc_a_aa" : my_own_metric ,
"cdr2_a_aa" : my_own_metric ,
"cdr1_a_aa" : my_own_metric }
metrics_b = {
"cdr3_b_aa" : my_own_metric ,
"pmhc_b_aa" : my_own_metric ,
"cdr2_b_aa" : my_own_metric,
"cdr1_b_aa" : my_own_metric }
weights_a = {
"cdr3_a_aa" : 1,
"pmhc_a_aa" : 1,
"cdr2_a_aa" : 1,
"cdr1_a_aa" : 1}
weights_b = {
"cdr3_b_aa" : 1,
"pmhc_b_aa" : 1,
"cdr2_b_aa" : 1,
"cdr1_b_aa" : 1}
kargs_a = {
'cdr3_a_aa' :
{'use_numba': False},
'pmhc_a_aa' : {
'use_numba': False},
'cdr2_a_aa' : {
'use_numba': False},
'cdr1_a_aa' : {
'use_numba': False}
}
kargs_b = {
'cdr3_b_aa' :
{'use_numba': False},
'pmhc_b_aa' : {
'use_numba': False},
'cdr2_b_aa' : {
'use_numba': False},
'cdr1_b_aa' : {
'use_numba': False}
}
tr.metrics_a = metrics_a
tr.metrics_b = metrics_b
tr.weights_a = weights_a
tr.weights_b = weights_b
tr.kargs_a = kargs_a
tr.kargs_b = kargs_b
tr.compute_distances()
tr.pw_cdr3_b_aa
tr.pw_beta
I want tcrdistances, but I hate OOP
"""
If you don't want to use OOP, but you I still want a multi-CDR
tcrdistances on a single chain, using you own metric
def my_own_metric(s1,s2):
return Levenshtein.distance(s1,s2)
"""
import multiprocessing
import pandas as pd
from tcrdist.rep_funcs import _pws, _pw
df = pd.read_csv("dash2.csv")
metrics_b = {
"cdr3_b_aa" : my_own_metric ,
"pmhc_b_aa" : my_own_metric ,
"cdr2_b_aa" : my_own_metric ,
"cdr1_b_aa" : my_own_metric }
weights_b = {
"cdr3_b_aa" : 1,
"pmhc_b_aa" : 1,
"cdr2_b_aa" : 1,
"cdr1_b_aa" : 1}
kargs_b = {
'cdr3_b_aa' :
{'use_numba': False},
'pmhc_b_aa' : {
'use_numba': False},
'cdr2_b_aa' : {
'use_numba': False},
'cdr1_b_aa' : {
'use_numba': False}
}
dmats = _pws(df = df ,
metrics = metrics_b,
weights = weights_b,
kargs = kargs_b ,
cpu = 1,
uniquify= True,
store = True)
print(dmats.keys())
仅考虑CDR3
"""
If you hate object oriented programming, just show me the functions.
No problem.
Maybe you only care about the CDR3 on the beta chain.
def my_own_metric(s1,s2):
return Levenshtein.distance(s1,s2)
"""
import multiprocessing
import pandas as pd
from tcrdist.rep_funcs import _pws, _pw
df = pd.read_csv("dash2.csv")
#
dmat = _pw( metric = my_own_metric,
seqs1 = df['cdr3_b_aa'].values,
ncpus=2,
uniqify=True,
use_numba=False)
I want tcrdistances but I want to keep my variable names
"""
You want a 'tcrdistance' but you don't want to bother with the tcrdist3 framework.
Note that the columns names are completely arbitrary under this
framework, so one can directly compute a tcrdist on a
AIRR, MIXCR, VDJTools, or other formated file without any
reformatting.
"""
import multiprocessing
import pandas as pd
import pwseqdist as pw
from tcrdist.rep_funcs import _pws, _pw
df_airr = pd.read_csv("dash_beta_airr.csv")
# Choose the metrics you want to apply to each CDR
metrics = { 'cdr3_aa' : pw.metrics.nb_vector_tcrdist,
'cdr2_aa' : pw.metrics.nb_vector_tcrdist,
'cdr1_aa' : pw.metrics.nb_vector_tcrdist}
# Choose the weights that are right for you.
weights = { 'cdr3_aa' : 3,
'cdr2_aa' : 1,
'cdr1_aa' : 1}
# Provide arguments for the distance metrics
kargs = { 'cdr3_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':3, 'ctrim':2, 'fixed_gappos':False},
'cdr2_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True},
'cdr1_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True}}
# Here are your distance matrices
from tcrdist.rep_funcs import _pws
dmats = _pws(df = df_airr,
metrics = metrics,
weights= weights,
kargs=kargs,
cpu = 1,
store = True)
dmats['tcrdist']
I want to use TCRrep but I want to keep my variable names
"""
If you already have a clones file and want
to compute 'tcrdistances' on a DataFrame with
custom columns names.
Set:
1. Assign TCRrep.clone_df
2. set infer_cdrs = False,
3. compute_distances = False
4. deduplicate = False
5. customize the keys for metrics, weights, and kargs with the lambda
customize = lambda d : {new_cols[k]:v for k,v in d.items()}
6. call .calculate_distances()
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
new_cols = {'cdr3_a_aa':'c3a', 'pmhc_a_aa':'pa', 'cdr2_a_aa':'c2a','cdr1_a_aa':'c1a',
'cdr3_b_aa':'c3b', 'pmhc_b_aa':'pb', 'cdr2_b_aa':'c2b','cdr1_b_aa':'c1b'}
df = pd.read_csv("dash2.csv").rename(columns = new_cols)
tr = TCRrep(
cell_df = df,
clone_df = df, #(1)
organism = 'mouse',
chains = ['alpha','beta'],
infer_all_genes = True,
infer_cdrs = False, #(2)s
compute_distances = False, #(3)
deduplicate=False, #(4)
db_file = 'alphabeta_gammadelta_db.tsv')
customize = lambda d : {new_cols[k]:v for k,v in d.items()} #(5)
tr.metrics_a = customize(tr.metrics_a)
tr.metrics_b = customize(tr.metrics_b)
tr.weights_a = customize(tr.weights_a)
tr.weights_b = customize(tr.weights_b)
tr.kargs_a = customize(tr.kargs_a)
tr.kargs_b = customize(tr.kargs_b)
tr.compute_distances() #(6)
# Notice that pairwise results now have custom names
tr.pw_c3b
tr.pw_c3a
tr.pw_alpha
tr.pw_beta
####### I want distances from 1 TCR to many TCRs
"""
If you just want a 'tcrdistances' of some target seqs against another set.
(1) cell_df is asigned the first 10 cells in dash.csv
(2) compute tcrdistances with default settings.
(3) compute rectangular distance between clone_df and df2.
(4) compute rectangular distance between clone_df and any
arbtirary df3, which need not be associated with the TCRrep object.
(5) compute rectangular distance with only a subset of the TCRrep.clone_df
"""
import pandas as pd
from tcrdist.repertoire import TCRrep
df = pd.read_csv("dash.csv")
df2 = pd.read_csv("dash2.csv")
df = df.head(10) #(1)
tr = TCRrep(cell_df = df, #(2)
df2 = df2,
organism = 'mouse',
chains = ['alpha','beta'],
db_file = 'alphabeta_gammadelta_db.tsv')
assert tr.pw_alpha.shape == (10,10)
assert tr.pw_beta.shape == (10,10)
tr.compute_rect_distances() # (3)
assert tr.rw_alpha.shape == (10,1924)
assert tr.rw_beta.shape == (10,1924)
df3 = df2.head(100)
tr.compute_rect_distances(df = tr.clone_df, df2 = df3) # (4)
assert tr.rw_alpha.shape == (10,100)
assert tr.rw_beta.shape == (10,100)
tr.compute_rect_distances( df = tr.clone_df.iloc[0:2,], # (5)
df2 = df3)
assert tr.rw_alpha.shape == (2,100)
assert tr.rw_beta.shape == (2,100)
个性化程度真的高,也确实很难
生活很好,有你更好,下一篇我们继续分享TCRdist3的分析代码