不能直接输入vcf真的很烦啊,氨基酸序列fasta太长了还会直接Segmentation Fault
过滤获得PASS的vcf
我只用了gatk里的af-onl-gnomAD文件过滤应该还可以annovar基于筛选的注释,然后用vcftools之类的东西筛选一下
vcftools --gzvcf /path/to/sample.filter.vcf.gz --remove-filtered-all --recode --stdout | gzip -c > /path/to/sample_PASS_only.vcf.gz
(应该还可以annovar基于筛选的注释再筛选一下,但是那样就需要把annovar生成的vcf用convert2annovar.pl
转换成avinput格式再用annotate_variation.pl
进行refGene注释再coding_change.pl
转换成氨基酸序列,配对vcf包含正常组织和肿瘤组织至少2个样本,convert2annovar.pl
会让选择全都转换成分开的文件还是只转换第1个,肿瘤是第2个及以后的,所以如果这样的话可能需要删掉正常样本那一列?)
cat input.vcf |awk -v OFS='\t' '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$11}' > output.vcf
annovar转换成氨基酸fasta
annovartable_annovar.pl
注释的时候不要选--remove
,就可以保留中间文件sample.exonic_variant_function
coding_change.pl --includesnp --onlyAltering path/to/sample.exonic_variant_function /path/to/annovar/humandb/hg38_refGene.txt /path/to/annovar/humandb/hg38_refGeneMrna.fa --outfile /path/to/outdir/sample.fasta
转换fasta序列
修改序列ID格式
sed -i 's/>line.*NM_/>NM_/g' /path/to/outdir/*fasta
sed -i 's/ /;/g' /path/to/outdir/*fasta
去掉不需要的序列,此处参考NeoPredPipe脚本vcf_manipulate.py,不直接用这个软件是因为报错太多不想处理了-_- |||
#!/usr/bin/env python
import os
from Bio import SeqIO
# 指定需要处理的文件夹路径
folder_path = '/path/to/outdir/fastaFiles_1/'
id_list = ['WILDTYPE', 'immediate-stopgain', 'stop-loss']
# 遍历文件夹中的所有 fasta 文件
for file_name in os.listdir(folder_path):
file_path = os.path.join(folder_path, file_name)
# 打开文件并读取所有序列
records = list(SeqIO.parse(file_path, 'fasta'))
# 遍历每个序列,如果 ID 不包含上述关键词,将其保留到新列表中
filtered_records = []
for record in records:
if not any(keyword in record.id for keyword in id_list):
filtered_records.append(record)
# 将筛选后的序列写回文件
if len(filtered_records) > 0: # 如果有筛选结果才更新文件
with open(file_path, 'w') as f:
SeqIO.write(filtered_records, f, 'fasta')
节选突变附近的序列,此处参考NeoPredPipe脚本NeoPredPipe.py
#!/usr/bin/env python
import sys
import os
import subprocess
from Bio import SeqIO
folder_path = '/public/home/xieruoqi/data/1.CR/neoantigen/fastaFiles_1/'
def ExtractSeq(record, pos, n, frameshift=False):
'''
Extracts the proper range of amino acids from the sequence and the epitope length
:param seq: Sequence of mutated amino acid sequence.
:param pos: Location of altered amino acid.
:param n: Length of epitope for predictions.
:return: Returns an amino acid sequence of appropriate lengths.
'''
seq = str(record.seq)
if pos<n: # Cases where start is less than n
miniseq = seq[0:pos+(n)]
elif len(seq)-pos < n: # Cases where end is less than n
miniseq = seq[pos - (n - 1):len(seq)]
else:
miniseq = seq[pos-(n-1):pos+(n)]
if frameshift:
if pos<n:
miniseq = seq[0:len(seq)] # When start is not n away from pos, we go until the end of the sequence
else:
miniseq = seq[pos-(n-1):len(seq)] # All other cases, we still go until the end of the sequence
return(miniseq)
epitopeLens = [8,9,10,11]
for file_name in os.listdir(folder_path):
file_path = os.path.join(folder_path, file_name)
eps = {n: 0 for n in epitopeLens}
epsIndels = {n: 0 for n in epitopeLens}
for n in epitopeLens:
mySeqs = []
mySeqsIndels = []
for record in SeqIO.parse(file_path, 'fasta'):
if 'silent' not in record.id.lower() and 'startloss' not in record.id.lower():
try:
pos = int(record.id.replace(";;",";").split(";")[5].split('-')[0])-1
except ValueError:
try:
pos = int(record.id.replace(";;",";").split(";")[6].split('-')[0])-1
except IndexError:
sys.exit("ERROR: Could not process fasta line in reformatted fasta: %s" % (record.id))
if 'dup' in record.id.lower() or 'del' in record.id.lower() or 'ins' in record.id.lower() or 'fs' in record.id:
miniseq = ExtractSeq(record, pos, n, True)
mySeqsIndels.append(">"+record.id[0:100]+"\n"+miniseq+"\n")
else:
miniseq = ExtractSeq(record, pos, n)
mySeqs.append(">"+record.id+"\n"+miniseq+"\n")
eps[n] = mySeqs
epsIndels[n] = mySeqsIndels
tmpFiles = {}
for n in epitopeLens:
tmpFasta = file_path.replace(".fasta",".tmp.%s.fasta"%(n))
tmpFastaIndels = file_path.replace(".fasta",".tmp.%s.Indels.fasta"%(n))
tmpFiles.update({n:tmpFasta})
tmpFiles.update({str(n)+'.Indels':tmpFastaIndels})
with open(tmpFasta, 'w') as outFile:
for line in eps[n]:
outFile.write(line)
with open(tmpFastaIndels, 'w') as outFile:
for line in epsIndels[n]:
outFile.write(line)
HLA分型提取及格式转换
先提取可能包含6号染色体HLA区域reads,不要用肿瘤样本,用正常组织样本
然后选一个HLA分型预测软件,这里用的是OptiType
大多数软件都要求输入fq
samtools view -hb /path/to/sample_blood.final.bam \
chr6:28510120-33480577 \
chr6_GL000250v2_alt:1066038-4433734 \
chr6_GL000251v2_alt:1283988-4540572 \
chr6_GL000252v2_alt:1063230-4372611 \
chr6_GL000253v2_alt:1062914-4548533 \
chr6_GL000254v2_alt:1062887-4416229 \
chr6_GL000255v2_alt:1063190-4323464 \
chr6_GL000256v2_alt:1106450-4577757 \
> /path/to/sample_blood.chr6.bam && \
samtools view -hb -@ 8 -f 12 /path/to/sample_blood.final.bam \
> /path/to/sample_blood.unmapped.bam && \
samtools merge /path/to/sample_blood.merge.bam /path/to/sample_blood.chr6.bam /path/to/sample_blood.unmapped.bam && \
samtools sort -n /path/to/sample_blood.merge.bam -@ 8 -o /path/to/sample_blood.hla.bam && \
samtools fastq -@ 8 /path/to/sample_blood.hla.bam \
-1 /path/to/sample_blood.hla.1.fq \
-2 /path/to/sample_blood.hla.2.fq \
-s /dev/null && \
rm /path/to/sample_blood.chr6.bam /path/to/sample_blood.unmapped.bam /path/to/sample_blood.merge.bam && \
OptiTypePipeline.py -i /path/to/sample_blood.hla.1.fq /path/to/sample_blood.hla.2.fq --dna -v -o /path/to/sample
然后把输出的tsv文件转换为netMHCpan
需要的格式
cat /path/to/sample/*/*.tsv |awk -v OFS=',' 'NR==2 {print $2,$3,$4,$5,$6,$7}' > /path/to/sample/sample.HLA.txt && \
sed -i "s/,/,HLA-/g" /path/to/sample/sample.HLA.txt && \
sed -i "s/*//g" /path/to/sample/sample.HLA.txt && \
sed -i "s/^/HLA-/g" /path/to/sample/sample.HLA.txt
然后就可以顺利运行netMHCpan啦,不过我还是不太理解为什么netMHCpan预测的Affinity对不上BindLevel