转载:
https://biozx.top/gdc.html
GDC(https://gdc.cancer.gov/) Application Programming Interface 简称API,是GDC开放对外的应用接口。有许多功能,包括数据查询、数据提交、文件下载、metadata、注释、BAM Slicing
等。下面主要介绍数据查询和下载功能。
GDC网站有详尽的使用教程(https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/#authentication),下面按照我的理解说一下。
参数介绍
一个查询请求需要包含以下参数:
filters 参数:限定查询;
format 参数:限定返回的文件格式,JSON, TSV, XML
fields 参数:限定返回文件中必须包含哪些列;
size 参数:限定最多返回多少条记录;
请求可以用HTTP GET or HTTP POST两种方法。但是GET会限定URL的长度
,所以我一般使用POST方法。
应用实例
以下是LAML
的gdc_manifest.2018-01-08T04_35_55.788687.txt的前几行,要找到htseq.counts.gz文件对应的样本id以及样本类型(肿瘤样本或normal样本)等信息,我们可以通过GDC的API接口实现这一功能:
id filename md5 size state
fdf76d41-8909-49be-83fb-d5ce8715b7e9 a3376c90-202c-42f6-a120-98d650e0765d.htseq.counts.gz eaf6cb895b36d591d28e2e153176ca7e 259927 live
3e1f3a96-6d9a-47aa-b319-d0e3e3fd12f9 d398a330-6a57-4172-9f2a-d6187fa2c71d.htseq.counts.gz 309d1459a02bce16766f6fe24e611930 253383 live
5cf34517-66c6-4b12-b83b-4cb71309f68a 4fd2de09-663c-4452-8cc8-1733a78bf71f.htseq.counts.gz 3b7ac338c1d8bf4721c5a95e41832702 259239 live
47c9d58f-d7b4-41fe-b500-d95b537dc21e ccb70fba-ed81-4b83-90e1-99375e0db559.htseq.counts.gz 1d1753b6af24831fd254ea031e33032e 258247 live
fe4f0b9a-46c8-4419-8315-0646418e3591 0d8702cc-d3db-4daf-94a8-b37103d771a6.htseq.counts.gz e18c8cf1ab7cb3bd82196451ab94d5f2 257574 live
实现的python脚本如下:
import requestsimport json
cases_endpt = 'https://api.gdc.cancer.gov/files'filt={
"op":"in",
"content":{
"field":"files.file_id",##file_id就是gdc_manifest的id列
"value":[
"fdf76d41-8909-49be-83fb-d5ce8715b7e9",
"3e1f3a96-6d9a-47aa-b319-d0e3e3fd12f9",
"5cf34517-66c6-4b12-b83b-4cb71309f68a",
"47c9d58f-d7b4-41fe-b500-d95b537dc21e",
"fe4f0b9a-46c8-4419-8315-0646418e3591"
]
}}params = {'filters':json.dumps(filt),
"format":"tsv",
"fields":"file_name,cases.samples.sample_type_id,cases.samples.sample_type,cases.samples.submitter_id",
"size":"100"}response = requests.get(cases_endpt, params = params)print(response.content)
#运行结果如下,得到了文件对应的样本id,以及样本类型等:
b'file_name cases.0.samples.0.submitter_id cases.0.samples.0.sample_type_id cases.0.samples.0.sample_type id
a3376c90-202c-42f6-a120-98d650e0765d.htseq.counts.gz TCGA-AB-2927-03A 3 Primary Blood Derived Cancer - Peripheral Blood fdf76d41-8909-49be-83fb-d5ce8715b7e9
d398a330-6a57-4172-9f2a-d6187fa2c71d.htseq.counts.gz TCGA-AB-2843-03A 3 Primary Blood Derived Cancer - Peripheral Blood 3e1f3a96-6d9a-47aa-b319-d0e3e3fd12f9
4fd2de09-663c-4452-8cc8-1733a78bf71f.htseq.counts.gz TCGA-AB-2859-03A 3 Primary Blood Derived Cancer - Peripheral Blood 5cf34517-66c6-4b12-b83b-4cb71309f68a
ccb70fba-ed81-4b83-90e1-99375e0db559.htseq.counts.gz TCGA-AB-2931-03A 3 Primary Blood Derived Cancer - Peripheral Blood 47c9d58f-d7b4-41fe-b500-d95b537dc21e
0d8702cc-d3db-4daf-94a8-b37103d771a6.htseq.counts.gz TCGA-AB-2897-03A 3 Primary Blood Derived Cancer - Peripheral Blood fe4f0b9a-46c8-4419-8315-0646418e3591