因为GEO和SRA数据库互通,GEO不存fastq数据,只存别人定量好的下游的FPKM的值;
TCGA原始的fastq数据需要有权限申请,但count数据是会有的;
所以下载原始数据需要到SRA(美国的NCBI)、ENA(欧洲)和DDBJ(日本)这三大数据库。
文章中列的是GEO编号,但是下载需要aspera链接,而GEO没有,但是GEO和SRA有关联,但是SRA也没有该下载链接,所以转求助于和SRA关联的ENA数据库。
示例数据演示
右键另打开,仍属于NCBI的数据库之一:
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778
https://www.ncbi.nlm.nih.gov/bioproject/PRJNA229998
SRA数据库:
https://www.ncbi.nlm.nih.gov/sra?term=SRP033351
点进去看:重点看Library
打开ENA数据库:
输入Bioproject编号(删掉前面的空格);
点击VIEW进行搜索:
点击Show Column Selection,勾选应选选项:
点击TSV进行下载:
study_accession:项目编号;
sample_accession:样本编号;
secondary_sample_accession:SRA数据库编号,因为三大核算数据库相联通;
experiment_accession:实验编号;
run_accession:run编号;
tax_id:物种名字;
sra_aspera:使用aspera下载数据的链接;
sra_md5:
(base) Mar23 23:11:03 ~
$ pwd
/trainee2/Mar23
(base) Mar23 23:22:31 ~
$ ln -s /teach/t_rna/data/airway/sra/filereport_read_run_PRJNA229998_tsv.txt ./
(base) Mar23 23:23:25 ~
$ ls
biosoft filereport_read_run_PRJNA229998_tsv.txt project_backup
catfile miniconda3 readme.txt
Data Miniconda3-latest-Linux-x86_64.sh t_linux
database pipline
download project
(base) Mar23 23:23:36 ~
$ ll
total 34380
drwxr-xr-x 13 Mar23 trainee 4096 Apr 6 23:23 ./
drwxr-xr-x 28 root root 4096 Apr 6 23:24 ../
-rw------- 1 Mar23 Mar23 40859 Apr 6 23:23 .bash_history
-rw-r--r-- 1 Mar23 root 4512 Mar 22 12:43 .bashrc
-rw-r--r-- 1 Mar23 Mar23 16384 Mar 27 21:37 .bashrc.swp
drwxrwxr-x 5 Mar23 Mar23 4096 Mar 27 21:54 biosoft/
drwx------ 2 Mar23 Mar23 4096 Mar 20 13:04 .cache/
-rw-rw-r-- 1 Mar23 Mar23 0 Mar 20 22:57 catfile
drwxrwxr-x 2 Mar23 Mar23 4096 Mar 22 12:42 .conda/
-rw-rw-r-- 1 Mar23 Mar23 255 Mar 26 22:51 .condarc
drwxrwxr-x 2 Mar23 Mar23 4096 Mar 25 22:27 .continuum/
drwxr-xr-x 3 Mar23 Mar23 4096 Apr 4 23:19 Data/
drwxrwxr-x 2 Mar23 Mar23 4096 Apr 2 23:15 database/
-rw-rw-r-- 1 Mar23 Mar23 35050467 Mar 27 01:21 download
lrwxrwxrwx 1 Mar23 Mar23 68 Apr 6 23:23 filereport_read_run_PRJNA229998_tsv.txt -> /teach/t_rna/data/airway/sra/filereport_read_run_PRJNA229998_tsv.txt
drwxrwxr-x 18 Mar23 Mar23 4096 Mar 26 22:30 miniconda3/
lrwxrwxrwx 1 Mar23 Mar23 48 Mar 23 19:43 Miniconda3-latest-Linux-x86_64.sh -> /teach/t_linux/Miniconda3-latest-Linux-x86_64.sh
drwx------ 2 Mar23 Mar23 4096 Mar 26 23:24 .ncbi/
drwxrwxr-x 2 Mar23 Mar23 4096 Apr 2 23:15 pipline/
-rw-r--r-- 1 Mar23 root 655 Mar 15 07:18 .profile
drwxrwxr-x 2 Mar23 Mar23 4096 Apr 4 23:16 project/
drwxrwxr-x 2 Mar23 Mar23 4096 Apr 2 23:16 project_backup/
-rw-r--r-- 1 Mar23 root 206 Mar 15 07:18 readme.txt
lrwxrwxrwx 1 Mar23 Mar23 13 Mar 20 20:34 t_linux -> /home/t_linux/
-rw------- 1 Mar23 Mar23 4628 Mar 27 23:03 .viminfo
-rw-rw-r-- 1 Mar23 Mar23 329 Mar 27 01:21 .wget-hsts
(base) Mar23 23:24:13 ~
$ mv filereport_read_run_PRJNA229998_tsv.txt Data/rawdata/sra/
(base) Mar23 23:25:20 ~
$ cd Data/rawdata/sra/
(base) Mar23 23:25:31 ~/Data/rawdata/sra
$ ls
filereport_read_run_PRJNA229998_tsv.txt
(base) Mar23 23:25:33 ~/Data/rawdata/sra
$ head -n 1 filereport_read_run_PRJNA229998_tsv.txt
study_accession sample_accession experiment_accession run_accession tax_id scientific_name fastq_md5 fastq_aspera submitted_ftp sra_bytes sra_md5 sra_ftp sra_aspera
(base) Mar23 23:25:59 ~/Data/rawdata/sra
$ head -n 1 filereport_read_run_PRJNA229998_tsv.txt | tr '\t' '\n'
study_accession
sample_accession
experiment_accession
run_accession
tax_id
scientific_name
fastq_md5
fastq_aspera
submitted_ftp
sra_bytes
sra_md5
sra_ftp
sra_aspera
(base) Mar23 23:26:21 ~/Data/rawdata/sra
$ head -n 1 filereport_read_run_PRJNA229998_tsv.txt | tr '\t' '\n'| cat -n
1 study_accession
2 sample_accession
3 experiment_accession
4 run_accession
5 tax_id
6 scientific_name
7 fastq_md5
8 fastq_aspera
9 submitted_ftp
10 sra_bytes
11 sra_md5
12 sra_ftp
13 sra_aspera
(base) Mar23 23:26:27 ~/Data/rawdata/sra
$ less -S filereport_read_run_PRJNA229998_tsv.txt | cut -f 13
sra_aspera
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/008/SRR1039508
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/009/SRR1039509
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/000/SRR1039510
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/001/SRR1039511
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/002/SRR1039512
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/003/SRR1039513
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/004/SRR1039514
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/005/SRR1039515
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/006/SRR1039516
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/007/SRR1039517
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/008/SRR1039518
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/009/SRR1039519
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/000/SRR1039520
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/001/SRR1039521
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/002/SRR1039522
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/003/SRR1039523
(base) Mar23 23:28:39 ~/Data/rawdata/sra
$ less -S filereport_read_run_PRJNA229998_tsv.txt | cut -f 13 | awk 'NR>1{print}'
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/008/SRR1039508
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/009/SRR1039509
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/000/SRR1039510
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/001/SRR1039511
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/002/SRR1039512
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/003/SRR1039513
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/004/SRR1039514
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/005/SRR1039515
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/006/SRR1039516
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/007/SRR1039517
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/008/SRR1039518
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/009/SRR1039519
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/000/SRR1039520
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/001/SRR1039521
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/002/SRR1039522
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/003/SRR1039523
(base) Mar23 23:28:56 ~/Data/rawdata/sra
$ less -S filereport_read_run_PRJNA229998_tsv.txt | cut -f 13 | awk 'NR>1{print}' >sra.url
(base) Mar23 23:29:52 ~/Data/rawdata/sra
$ cat -A sra.url
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/008/SRR1039508$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/009/SRR1039509$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/000/SRR1039510$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/001/SRR1039511$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/002/SRR1039512$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/003/SRR1039513$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/004/SRR1039514$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/005/SRR1039515$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/006/SRR1039516$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/007/SRR1039517$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/008/SRR1039518$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/009/SRR1039519$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/000/SRR1039520$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/001/SRR1039521$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/002/SRR1039522$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/003/SRR1039523$
意外发生时:ctrl+C退出(
](https://upload-images.jianshu.io/upload_images/17157412-96bd32a0e7d67328.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
*sed -i "s/\s:表示行尾
^:表示行首
(base) Mar23 21:16:19 ~/Data/rawdata/sra
$ cat filereport_read_run_PRJNA229998_tsv.txt | cut -f 11,4
run_accession sra_md5
SRR1039508 b55775f72aa66e2632adf9d5a5bf0e84
SRR1039509 436e98885ef79c0c722a8065aedc1bc4
SRR1039510 b7af76fb67fa0424f7fe763bb447e330
SRR1039511 97ee3a81fc3efdb96368ac5e283f31b1
SRR1039512 4498c5b7ecef41896eb86741aa92acde
SRR1039513 83262fe5042240e0f746e4c370e1a3ed
SRR1039514 c88901e8e32fb0b1f1751a6cb73fff64
SRR1039515 813c7d6f3ebb53f39c381ca5c09f70e3
SRR1039516 141e4b140ddd1b45e468255a4edf4609
SRR1039517 5fa424d477310838e1d65073908acb2c
SRR1039518 fc03abf20ea8a455e2595c4e15a6a78c
SRR1039519 8facbd57cafd8d5059ad992e5c027815
SRR1039520 dd724892de776dc5b3e30771d92e0916
SRR1039521 3f6532f491497ab9c7132b0624961a85
SRR1039522 80489e66e342ea35163025c68c2cb7ab
SRR1039523 3f05d4761772965d2d25997ff34db371
sra_md5要在前,run_accession要在后
(base) Mar23 21:16:39 ~/Data/rawdata/sra
$ cat filereport_read_run_PRJNA229998_tsv.txt | awk 'NR>1{print$11" "$4}'
b55775f72aa66e2632adf9d5a5bf0e84 SRR1039508
436e98885ef79c0c722a8065aedc1bc4 SRR1039509
b7af76fb67fa0424f7fe763bb447e330 SRR1039510
97ee3a81fc3efdb96368ac5e283f31b1 SRR1039511
4498c5b7ecef41896eb86741aa92acde SRR1039512
83262fe5042240e0f746e4c370e1a3ed SRR1039513
c88901e8e32fb0b1f1751a6cb73fff64 SRR1039514
813c7d6f3ebb53f39c381ca5c09f70e3 SRR1039515
141e4b140ddd1b45e468255a4edf4609 SRR1039516
5fa424d477310838e1d65073908acb2c SRR1039517
fc03abf20ea8a455e2595c4e15a6a78c SRR1039518
8facbd57cafd8d5059ad992e5c027815 SRR1039519
dd724892de776dc5b3e30771d92e0916 SRR1039520
3f6532f491497ab9c7132b0624961a85 SRR1039521
80489e66e342ea35163025c68c2cb7ab SRR1039522
3f05d4761772965d2d25997ff34db371 SRR1039523
数据完整性检验:md5值检验
(rna) Mar23 21:20:08 ~/Data/rawdata/sra
$ ln -s /teach/t_rna/data/airway/sra/SRR103951* ./ #将老师文件夹里的文件链接到当前目录
(rna) Mar23 21:21:13 ~/Data/rawdata/sra
$ ls
filereport_read_run_PRJNA229998_tsv.txt SRR1039510 SRR1039512
sra.url SRR1039511
(rna) Mar23 21:21:30 ~/Data/rawdata/sra
$ cat filereport_read_run_PRJNA229998_tsv.txt | awk 'NR>1{print$11" "$4}' >md5.txt
(rna) Mar23 21:23:22 ~/Data/rawdata/sra
$ cat md5.txt
b55775f72aa66e2632adf9d5a5bf0e84 SRR1039508
436e98885ef79c0c722a8065aedc1bc4 SRR1039509
b7af76fb67fa0424f7fe763bb447e330 SRR1039510
97ee3a81fc3efdb96368ac5e283f31b1 SRR1039511
4498c5b7ecef41896eb86741aa92acde SRR1039512
83262fe5042240e0f746e4c370e1a3ed SRR1039513
c88901e8e32fb0b1f1751a6cb73fff64 SRR1039514
813c7d6f3ebb53f39c381ca5c09f70e3 SRR1039515
141e4b140ddd1b45e468255a4edf4609 SRR1039516
5fa424d477310838e1d65073908acb2c SRR1039517
fc03abf20ea8a455e2595c4e15a6a78c SRR1039518
8facbd57cafd8d5059ad992e5c027815 SRR1039519
dd724892de776dc5b3e30771d92e0916 SRR1039520
3f6532f491497ab9c7132b0624961a85 SRR1039521
80489e66e342ea35163025c68c2cb7ab SRR1039522
3f05d4761772965d2d25997ff34db371 SRR1039523
(rna) Mar23 21:23:37 ~/Data/rawdata/sra
$ cat filereport_read_run_PRJNA229998_tsv.txt | awk 'NR>3&&NR<7{print$11" "$4}'
b7af76fb67fa0424f7fe763bb447e330 SRR1039510
97ee3a81fc3efdb96368ac5e283f31b1 SRR1039511
4498c5b7ecef41896eb86741aa92acde SRR1039512
(rna) Mar23 21:27:35 ~/Data/rawdata/sra
$ cat filereport_read_run_PRJNA229998_tsv.txt | awk 'NR>3&&NR<7{print$11" "$4}' >md5.txt
(rna) Mar23 21:30:56 ~/Data/rawdata/sra
$ ls
filereport_read_run_PRJNA229998_tsv.txt sra.url SRR1039511
md5.txt SRR1039510 SRR1039512
(rna) Mar23 21:31:11 ~/Data/rawdata/sra
$ ll
total 20
drwxrwxr-x 2 Mar23 Mar23 4096 Apr 10 21:23 ./
drwxrwxr-x 3 Mar23 Mar23 4096 Apr 4 23:19 ../
lrwxrwxrwx 1 Mar23 Mar23 68 Apr 6 23:23 filereport_read_run_PRJNA229998_tsv.txt -> /teach/t_rna/data/airway/sra/filereport_read_run_PRJNA229998_tsv.txt
-rw-rw-r-- 1 Mar23 Mar23 135 Apr 10 21:30 md5.txt
-rw-rw-r-- 1 Mar23 Mar23 816 Apr 6 23:29 sra.url
lrwxrwxrwx 1 Mar23 Mar23 39 Apr 10 21:21 SRR1039510 -> /teach/t_rna/data/airway/sra/SRR1039510
lrwxrwxrwx 1 Mar23 Mar23 39 Apr 10 21:21 SRR1039511 -> /teach/t_rna/data/airway/sra/SRR1039511
lrwxrwxrwx 1 Mar23 Mar23 39 Apr 10 21:21 SRR1039512 -> /teach/t_rna/data/airway/sra/SRR1039512
(rna) Mar23 21:32:21 ~/Data/rawdata/sra
$ cat md5.txt
b7af76fb67fa0424f7fe763bb447e330 SRR1039510
97ee3a81fc3efdb96368ac5e283f31b1 SRR1039511
4498c5b7ecef41896eb86741aa92acde SRR1039512
(rna) Mar23 21:33:40 ~/Data/rawdata/sra
$ md5sum -c md5.txt
SRR1039510: OK
SRR1039511: OK
SRR1039512: OK
上传md5值
生成md5值
md5值打印在屏幕上
(rna) Mar23 22:30:39 ~/Data/rawdata/sra
$ md5sum filereport_read_run_PRJNA229998_tsv.txt
553c8bb68676be08026e6dc5950c429f filereport_read_run_PRJNA229998_tsv.txt
(rna) Mar23 22:36:39 ~/Data/rawdata/sra
$ md5sum SRR*
b7af76fb67fa0424f7fe763bb447e330 SRR1039510
97ee3a81fc3efdb96368ac5e283f31b1 SRR1039511
4498c5b7ecef41896eb86741aa92acde SRR1039512
md5值保存在文件中
(rna) Mar23 22:37:36 ~/Data/rawdata/sra
$ md5sum SRR* >raw_md5.txt & #& 代表提交后台
[1] 28129
(rna) Mar23 22:39:23 ~/Data/rawdata/sra
$ jobs
[1]+ Done md5sum SRR* > raw_md5.txt
(rna) Mar23 22:39:51 ~/Data/rawdata/sra
$ ll
total 28
drwxrwxr-x 2 Mar23 Mar23 4096 Apr 10 22:39 ./
drwxrwxr-x 3 Mar23 Mar23 4096 Apr 4 23:19 ../
-rw-rw-r-- 1 Mar23 Mar23 45 Apr 10 22:25 CHECK
lrwxrwxrwx 1 Mar23 Mar23 68 Apr 6 23:23 filereport_read_run_PRJNA229998_tsv.txt -> /teach/t_rna/data/airway/sra/filereport_read_run_PRJNA229998_tsv.txt
-rw-rw-r-- 1 Mar23 Mar23 720 Apr 10 22:30 md5.txt
-rw-rw-r-- 1 Mar23 Mar23 135 Apr 10 22:39 raw_md5.txt
-rw-rw-r-- 1 Mar23 Mar23 816 Apr 6 23:29 sra.url
lrwxrwxrwx 1 Mar23 Mar23 39 Apr 10 21:21 SRR1039510 -> /teach/t_rna/data/airway/sra/SRR1039510
lrwxrwxrwx 1 Mar23 Mar23 39 Apr 10 21:21 SRR1039511 -> /teach/t_rna/data/airway/sra/SRR1039511
lrwxrwxrwx 1 Mar23 Mar23 39 Apr 10 21:21 SRR1039512 -> /teach/t_rna/data/airway/sra/SRR1039512
在上层目录中生成md5值
(rna) Mar23 22:40:08 ~/Data/rawdata/sra
$ cd ../
(rna) Mar23 22:41:30 ~/Data/rawdata
$ ls
sra
(rna) Mar23 22:41:40 ~/Data/rawdata
$ md5sum sra/SRR103951*
b7af76fb67fa0424f7fe763bb447e330 sra/SRR1039510
97ee3a81fc3efdb96368ac5e283f31b1 sra/SRR1039511
4498c5b7ecef41896eb86741aa92acde sra/SRR1039512
(rna) Mar23 22:43:57 ~/Data/rawdata
$ ls sra/SRR103951*
sra/SRR1039510 sra/SRR1039511 sra/SRR1039512
(rna) Mar23 22:43:57 ~/Data/rawdata
$ ls sra/SRR103951*
sra/SRR1039510 sra/SRR1039511 sra/SRR1039512
(rna) Mar23 22:44:46 ~/Data/rawdata
$ pwd
/trainee2/Mar23/Data/rawdata
(rna) Mar23 22:48:33 ~/Data/rawdata
$ cd sra/
(rna) Mar23 22:48:38 ~/Data/rawdata/sra
$ ls
CHECK raw_md5.txt SRR1039511
filereport_read_run_PRJNA229998_tsv.txt sra.url SRR1039512
md5.txt SRR1039510
(rna) Mar23 22:48:40 ~/Data/rawdata/sra
$ md5sum SRR103951*
b7af76fb67fa0424f7fe763bb447e330 SRR1039510
97ee3a81fc3efdb96368ac5e283f31b1 SRR1039511
4498c5b7ecef41896eb86741aa92acde SRR1039512