Dragon star Day 2 Pt.2 关于基因组组装、人类基因组计划、七桥问题、图论

Dragon star Day 2 Pt.2

关于基因组组装、人类基因组计划、七桥问题、图论

Dragonstar2019 by Kai Wang

  1. Alignment of short/long-read sequencing data
  2. Genome assembly by short/long-read sequencing

Part Ⅱ Genome assembly by short/long-read sequencing

1 genome assembly

1.1 What is genome assembly

  • Genome sequence produced after chromosomes have been fragmented
  • Those fragments have been sequenced
  • The resulting sequences have been put back together

1.2 Why genome assembly

  • Key to understanding genetic variation
  • The more accurate the reference genome, the easier it is to map reads and interpret the functional impacts of genetic mutations
  • Accurate and complete genome assembly is important for human genetic
    studies

1.3 What types of genome assembly?

1.3.1 de novo assembly

The entire assembled sequence is resolved from raw sequence data without comparison to a reference genome sequence.

Assemblers need copious sequencing data and informatic exertion to put the genome back together. Image: Kelly Howe, Lawrence Berkeley Laboratory

Involves sequencing a novel genome from scratch, without the aid of external data.

https://era7bioinformatics.com/en/page.cfm?id=1500

1.3.2 Comparative assembly
  • You have a “reference” genome to guide the assembly process
  • Or sequence a chimp genome using the existing human assembly as a guide

Each read is aligned to the reference genome. In this approach, the new completely different sequences are lost.

1.3.3 De Novo Assembly paradigms

2 Graph theory

  • Graph

    • Mathematical structures used to model pairwise relations between objects
    • A graph is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines).

    图论(英语:Graph theory)是组合数学的一个分支,和其他数学分支,如群论、矩阵论、拓扑学有着密切关系。是图论的主要研究对象。图是由若干给定的顶点及连接两顶点的所构成的图形,这种图形通常用来描述某些事物之间的某种特定关系。顶点用于代表事物,连接两顶点的边则用于表示两个事物间具有这种关系。

    https://zh.wikipedia.org/wiki/图论

    A distinction is made between undirected graphs, where edges link two vertices symmetrically, and directed graphs, where edges link two vertices asymmetrically.

    即无序图对称地连接两个边,有序图不对称地连接两个边。

2.1 Directed graph

A directed graph is an ordered pair G = (V, E) comprising:

  • V a set of vertices (also called nodes or points);
  • E ⊆ {(x, y) | (x, y) ∈ V2 ∧ xy} a set of edges

https://en.wikipedia.org/wiki/Graph_theory#Directed_graph

  • G(V, E): a set of vertices (V) and a set of directed edges (E)

  • Vertex and edge may have weights associated with them

  • Here, V= {Read1, Read2, Read3}, E = {(Read1, Read2), (Read2, Read3),
    (Read3, Read1)} with weights of 5, 6 and 4.

2.2 Overlap-based approach (OLC)

  • Overlap graph

    • Nodes are initially formed by the sequences of the individual reads
    • Edges are represented by the sequence overlaps between these reads.
  • Constructing the overlap graph

    • Create the overlap graph for the following string, where overlap means suffix/prefix match of >=5bp

      TTATGCCATGGGATGAA

    • Assume that we generated a few short reads from this string: TTATGCCATG, CCATGGGATG, TGGGATGAA

    • Vertices/nodes are formed by the sequences of the individual reads; edges are represented by the sequence overlaps between these reads

    • Sometimes graphs can be reduced by eliminating transitive edges

3 Comparison of two approaches for large genomes such as human genomes

3.1 U.S. Human Genome Project

  • Primary goals were to

    • Discover the complete set of human genes and make them accessible for further biological study
    • Determine the complete sequence of DNA bases in the human genome
  • Two thoughts: whole-genome shotgun versus physical mapping followed by BAC-by-BAC sequencing

3.2 Draft human genome

两股力量:

  • Public effort: BAC based sequencing - Performed by a consortium of government labs and universities.

    • Break genome into ~100-300kb pieces
    • Create Bacterial Artificial Chromosomes (BACs) from each piece
    • Assemble each piece by Sanger sequencing
    • Then stitch overlapping BACs together to draft genome
  • Private effort: whole genome random shotgun sequencing - Performed by Celera, which is a company.

    • In 1998, Craig Venter announced that he was forming Celera that within three years would sequence human genome

双方发文互怼,竞争之下,2001年分别在Nature (2001.02.15), Science (2001.02.16)发布成果,时间仅相差一天,比原定计划提前了两年。

3.3 How Perl Saved the Human Genome Project

Random web quotes: “Perl and the human genome are almost perfectly matched; both are almost incomprehensible, with no central design, accreted haphazardly over a long time.”

"Perl一夜之间火起来,人类基因组完成后又在一夜之间被抛弃。" 🌚

3.4 Back to today

  • Gigabase genomes can be easily sequenced and assembled using PacBio/Nanopore long-read assemblers
  • Whole-genome shotgun is the predominant approach to be used today for assembling these large genomes

4 Lander-Waterman statistics in genome assembly

  • Let

    N = # of reads,

    L = length of a read (average length),

    G = genome length,

    c = coverage = N∗L/G

  • Main questions to address:

    • What’s the fraction of genome that are covered by reads?
    • How many contigs are generated?

4.1 Coverage at a position can be modelled by Poisson distribution

P(k, λ) = e^{-λ} *λ^k/k!

  • Probability that a base is NOT covered = P(0,c) = e^{-c}
  • For genome size G, # of uncovered bases = G* P(0,c) = G * e^{-c}

4.2 What’s the fraction of genome that are covered by reads?

  • 等同于 "how many positions have coverage>0?"

  • For genome size G, number of positions with read coverage:

    G * (1-P(0,c) ) = G * (1-e^{-c})

4.3 How many contigs are generated?

4.4 Assembling large genomes today

Since 2013, de novo assembly of large genomes has shifted from short-read sequencing to synthetic or true long-read sequencing.

Jung et al, Trends in Plant Science, 2019

4.5 Canu assembler for long-read assembly

基于Celara开发的软件

Koren et al, Genome Research, 2017

4.6 Wtdbg2 assembler for long-read assembly

Wtdbg2 groups 256 base pairs into a bin. (并不是直接用ATCG)

Ruan et al, BioRxiv, 2019

5 Seven Bridges of Königsberg Problem

柯尼斯堡七桥问题(Seven Bridges of Königsberg)是图论中的著名问题。这个问题是基于一个现实生活中的事例:当时东普鲁士柯尼斯堡(今日俄罗斯加里宁格勒)市区跨普列戈利亚河两岸,河中心有两个小岛。小岛与河的两岸有七条桥连接。在所有桥都只能走一遍的前提下,如何才能把这个地方所有的桥都走遍?

https://zh.wikipedia.org/wiki/柯尼斯堡七桥问题

  • In 1736, the Swiss mathematician Leonhard Euler demonstrated that the answer is no 就是高数里欧拉公式的那个欧拉

  • In graph theory

    • An Eulerian trail (or Eulerian path) is a trail in a finite graph which visits every edge exactly once (allowing for revisiting vertices).

      莱昂哈德·欧拉在1735年提出,并没有方法能圆满解决这个问题,他更在第二年发表在论文《柯尼斯堡的七桥》中,证明符合条件的走法并不存在,也顺带提出和解决了一笔画问题

      https://zh.wikipedia.org/wiki/柯尼斯堡七桥问题

    • Similarly, an Eulerian circuit or Eulerian cycle is an Eulerian trail which starts and ends on the same vertex.

5.1 Euler’s Theorem on directed graphs

  • For directed graphs, the cycle will need to follow the direction of the edges

    • Indegree(v) = # edges coming into v
    • Outdegree(v) = # edges leaving v
  • A directed graph has an Eulerian path if and only if:
    indegree(v)=outdegree(v) for all but 2 nodes (x and y), where indegree(x)=outdegree(x)+1, and indegree(y)=outdegree(y)-1.

    至多有一个节点 进>出,至多有一个节点 出>进

  • A directed graph has an Eulerian cycle if and only if indegree(v)=outdegree(v) for all nodes

    所有节点的degree数各自相等

5.2 Euler’s Theorem on undirected graphs

  • For undirected graphs, the degree of a vertex is the number of edges that are incident to the vertex

  • An undirected graph has an Eulerian trail if and only if exactly zero or two vertices have odd degree

    必须有0个或2个节点的degree为奇数

  • An undirected graph has an Eulerian cycle if and only if every vertex has even degree

    所有节点的degree为偶数

5.3 The Königsberg graph is not Eulerian

For the existence of Eulerian trails

  • It is necessary that zero or two vertices have an odd degree ❌ ABCD的degree都为奇数

  • If there are no vertices of odd degree, all Eulerian trails are circuits ❌

  • If there are exactly two vertices of odd degree, all Eulerian trails start at one of them and end at the other ❌

所以欧拉认为七桥问题无解

5.4 Hamiltonian vs Eulerian path

  • Hamiltonian path: a path that travels to every node exactly once and ends at the starting node
  • Eulerian path: a path that visits all edges of a graph exactly once
  • When analyzing sequencing data, instead of assigning each k-mer contained in some read to a node, we will assign each such k-mer to an edge. This allows the construction of a 'de Bruijn graph'
  • The vertices are (k-1)-mers that appear in some read, and edges defined by overlap of k-2 nucleotides

6 Constructing the de Bruijn Graph

Create the de Bruijn graph for the following string, using k=3

TTATGCCATGGGATGAA

6.1 Error correction in de Bruijn graph

  • Erroneous reads (marked in red) introduce k false k-mers to the graph, resulting in additional spurious branches, so-called "bubbles" and "tips"

  • Influence of k-mer length on assembly

    k太大导致两条路径没有重叠 or k太小导致可能的结果太多

  • What is the appropriate k-mer?

    • Tools such as kmerGenie and Velvet Advisor have been developed to predict the optimal value for k for a given read dataset, based on read length and k-mer frequencies
    • Other strategies may include the merging of assemblies produced with different k-mer sizes or the use of a multi k-mer assembly method

6.2 A fun hypothetical case study: Frederick Sanger’s insulin sequencing study

"Sanger是个很有趣的人,没有学校要他,但是他家里很有钱,自己进实验室自己掏钱买试剂,得了诺贝尔奖后这辈子也就这么回事了,也不发文章也不看文献...."

*穷鬼爆笑

用 de Bruijn graph 的方法,也可以将二肽及三肽片段组装出胰岛素contig.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 206,214评论 6 481
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 88,307评论 2 382
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 152,543评论 0 341
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 55,221评论 1 279
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 64,224评论 5 371
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,007评论 1 284
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,313评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,956评论 0 259
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 43,441评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,925评论 2 323
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,018评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,685评论 4 322
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,234评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,240评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,464评论 1 261
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,467评论 2 352
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,762评论 2 345

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,292评论 0 10
  • Introduction What is Bowtie 2? Bowtie 2 is an ultrafast a...
    wzz阅读 5,566评论 0 5
  • 今天要做的事 数学预习,英语作文总结,背单词,写行策论文(定主题),英语套题,商英第二章,天鹅臂加拉伸。主要是英语...
    林木之鹿阅读 252评论 0 0
  • 长大以后,团年是很难的。 母亲这边的亲戚都只能过年见一见,不用说一起聚多久,见一见已是很难。 兄弟姊妹...
    蜀素zz阅读 212评论 0 0
  • 学长,你我以后就是陌生人了。
    烧心阅读 102评论 0 1