GEO:Gene Expression Omnibus
当文章有NGS data 投稿时候要把你的原始data 上传到GEO类似的网站上去,以公开你的原始数据,评审在审稿的时候也可以检测你的data,但是在发表接收之前不会公开化。
今天在看GEO的data 上传方法,账户申请
原文请戳👇
Submitting high-throughput sequence data to GEO
GEO接收的数据类型包括:quantitative gene expression, gene regulation, epigenomics or other aspects of functional genomics。也就是一般的illumina测序的data都可以了。点击查看实例。
GEO process all components of your study, including the samples, project description, processed data files, and GEO submit the raw data files to the Sequence Read Archive (SRA) on your behalf.
上传数据的类型/格式:
GEOarchive spreadsheet 格式
需要submitted using the GEOarchive spreadsheet format GEO电子库的电子表格格式或者你的data已经在一个database里了
如果这样的话,然后可以在那个database里形成和输出一个SOFT text format,这里是SOFT text format 提交方法。
问题:我们的data在自己部门的服务器里面,也叫做一个电子库LS-Archive。算不算data base 呢? 可不可以输出SOFT text format呢?
以下是第一种GEOarchive spreadsheet格式上传方法的详细信息:
GEOarchive需要三个部分:
1. a metadata spreadsheet
点击可以下载 metadata spreadsheet 的 template 和例子👇
https://www.ncbi.nlm.nih.gov/geo/info/examples/seq_template_v2.1.xls
Metadata 指的是包括你所有研究(试验方法,参考的基因组信息和原始数据的名字)的叙述性信息。
2. Processed data files
这是GEO提交的必要的部分, 看一下说明的什么是final proceessed data。
The final processed data are defined as the data on which the conclusions in the related manuscript are based. We do not expect standard alignment files (e.g., BAM, SAM, BED) as processed data since conclusions are expected to be based on further-processed data.
如果你最终的proceessed处理data 就到跟基因组比对那里,你需要写emial给他们说明。当然处理后的data没有特殊的标准要求。
常见的处理过的data类型如下:
Expression profiling analysis usually generates quantitative data for features of interest.
包括 genes, transcripts, exons, miRNA, or some other genetic entity,会有以下两种processed data files:
- raw counts of sequencing reads for the features of interest, and/or
- normalized abundance measurements, e.g., output from Cufflinks, Cuffdiff, DESeq, edgeR, etc.
ChIP-Seq data might include peak files with quantitative data, tag density files, etc. Common formats include WIG, bigWig, bedGraph.
参考的基因组信息,一般常用的物种基因组hg38,hg19,mm10, etc。在提交这部分时候,需要把文件的格式和内容包括在metadata spreadsheet中的processing fields里。
3. Raw data files
也是必要的GEO上传的内容。他们会把raw data 上传到 SRA。
raw data file定义:
The raw data files should be the original files containing reads and quality scores, as generated by the sequencing instrument (unless the raw files are barcoded/multiplexed, see below for further instructions).
raw data 格式:
常用的,FASTQ。
其他的请点这里SRA文件格式要求。
Barcode/Multiplexed Data:
fastq file要求 Reads should not be trimmed.
这里的fastq file 是 de-multiplex后的,已经按照不同的barcode分配好的。没分配之前的是bcl格式文件。在metadata spreadsheet里的"library construction protocol”部分还要包括你library barcode information。
Paired-end Experiments:
测序分单端和双端,如果是双端测序,就会得到两个fastq文件(r1 和r2)。如果用的是双端测序,还要提供average insert size of the molecules sequenced (excluding linkers, adapters, etc...)也就是library的平均长度,bioanalyzer里有。
MD5 Checksums: 这是什么鬼?
要求provide MD5 checksums for their raw data files. 以用来核实真伪的。
Unix: md5sum <file>
OS X: md5 <file>
Windows: Application required. Many are available for free download.
Data File Compression: raw data 压缩
Individual files can be compressed to speed transfer, but this is not required. 不要单个文件压缩。
Acceptable compression formats are gzip and bzip2 (i.e. files ending with a .gz or .bz2 extension). gzip和bzip2是接受的压缩文件
Never compress binary files (e.g., BAM, bigWig, bigBed), and DO NOT upload ZIP archives (files with a .zip extension). 永远别压缩二进制文件。
上传你的文件
上传之前:
- 注意: 你的上传数据将会超过1TB你需要先写email给GEO
包括: a list of files and MD5 checksums before you begin transferring files (if your files are compressed, the checksums should be for the compressed files).
*上传之前在电脑里建一个文件夹:a folder named using your GEO username (/johndoe) which includes all required submission files. Transfer the folder using the FTP instructions below.
*强烈建议你压缩你的raw data:
We strongly recommend that submitters compress their raw data files (e.g., FASTQ, qseq, seq, csfasta, qual) using gzip or bzip2 to shorten the ftp transfer time. Do not compress with WinZip. Do not tar archive single files. Do not compress binary files (e.g., BAM, bigWig, bigBed). 这个车轱辘话又说了一遍。
上传之后:给他们写个邮件通知
TO:geo@ncbi.nlm.nih.gov
Subject:ftp upload
with the following information:
- GEO account username (johndoe);
- Names of the directory and files deposited;
- Public release date (required - up to 3 years from now - see FAQ).
You should expect to receive an e-mail from a curator within 5 business days after you send us the notification (see FAQ).
FTP instructions:
- 请戳这里进行注册:https://www.ncbi.nlm.nih.gov/geo/submitter/
只有当你需要上传的时候需要注册账号,下载别人的data你不需要账号。
注册之后会看到FTP上传的说明。是和“My NCBI account”同样的账号。
注册账号,不是简单一步,而是需要再次劲道上面链接里填写详细的Investgter和个人信息。一开始到最后能收到三封邮件。
抢两封分别是确认NCBI账号,以及GEO账号
第三封,是你填写好信息后的GEO成功注册了的,三个月之内不上传,这个账号会被删掉。以下是第三封邮件的内容
Greetings!
You have successfully created a GEO account with the following User ID:runuply
GEO accounts are required only for submitting data. Please be aware that your GEO account will be deleted automatically in three months if data have not been received.Instruction pages for common submission types:
[1] Affymetrix chip data: https://www.ncbi.nlm.nih.gov/geo/info/geo_affy.html
[2] Agilent array data: https://www.ncbi.nlm.nih.gov/geo/info/geo_agil.html
[3] Illumina beadarray data: https://www.ncbi.nlm.nih.gov/geo/info/geo_illu.html
[4] NimbleGen array data: https://www.ncbi.nlm.nih.gov/geo/info/geo_nimb.html
[5] Next-Generation Sequence (NGS) data: https://www.ncbi.nlm.nih.gov/geo/info/seq.html
[6] RT-PCR data: https://www.ncbi.nlm.nih.gov/geo/info/geo_rtpcr.html
[7] Traditional SAGE data: https://www.ncbi.nlm.nih.gov/geo/info/geo_sage.html
[8] Custom arrays or other data types, and a general discussion of submission options and formats: https://www.ncbi.nlm.nih.gov/geo/info/submission.html
Please consult our FAQ before writing with questions:https://www.ncbi.nlm.nih.gov/geo/info/faq.html
We look forward to receiving your submission.
Sincerely,
The GEO Team
第五个就是NGS data。
戳了第三个邮件第五个的链接
https://www.ncbi.nlm.nih.gov/geo/info/seq.html 账号已经是登陆状态了。
Submitting data 这是入口。
Submitting high-throughput sequence data to GEO 同第五个链接,还是介绍,跟注册面的一样,不过FTP说明会不同了,之前有省略。
https://www.ncbi.nlm.nih.gov/geo/info/spreadsheet.html
Linux的 readme。
如何获取 MD5 Checksum in Linux
https://www.tecmint.com/generate-verify-check-files-md5-checksum-linux/
例如:
$ md5sum AThi10009_GGACTCCT_L005_R1_001.fastq.gz
24d16c528ef4db5b189f1a45cebdf2f6 AThi10009_GGACTCCT_L005_R1_001.fastq.gz
24d16c528ef4db5b189f1a45cebdf2f6 就是我的一个文件的 MD5 checksum, 无论你怎么改名字文件的这串代码是不会变的。
如果是mac系统就直接用
md5
代替 md5sum