Dragon star Day 2 Pt.2
关于基因组组装、人类基因组计划、七桥问题、图论
- Alignment of short/long-read sequencing data
- Genome assembly by short/long-read sequencing
Part Ⅱ Genome assembly by short/long-read sequencing
1 genome assembly
1.1 What is genome assembly
- Genome sequence produced after chromosomes have been fragmented
- Those fragments have been sequenced
- The resulting sequences have been put back together
1.2 Why genome assembly
- Key to understanding genetic variation
- The more accurate the reference genome, the easier it is to map reads and interpret the functional impacts of genetic mutations
- Accurate and complete genome assembly is important for human genetic
studies
1.3 What types of genome assembly?
1.3.1 de novo assembly
The entire assembled sequence is resolved from raw sequence data without comparison to a reference genome sequence.
Assemblers need copious sequencing data and informatic exertion to put the genome back together. Image: Kelly Howe, Lawrence Berkeley Laboratory
Involves sequencing a novel genome from scratch, without the aid of external data.
1.3.2 Comparative assembly
- You have a “reference” genome to guide the assembly process
- Or sequence a chimp genome using the existing human assembly as a guide
Each read is aligned to the reference genome. In this approach, the new completely different sequences are lost.
1.3.3 De Novo Assembly paradigms
-
Overlap-layout-consensus (OLC) methods
The overlap layout consensus (OLC) method is rather intuitive and consists of a three steps process.
- pairwise overlaps
- layout the reads
- consensus
A General bioinformatics background of sequence assembly. https://doi.org/10.1371/journal.pone.0169662.s001
OLC - String Graph Assemblers
- Construct overlap graph directly from reads, eliminating redundant reads; trace path for assembly
- Examples: SGA, Fermi
-
de Bruijn (k-mer) graph
Especially useful for assembly from short reads
Stacking overlapping sequences of genomic fragments of a defined size (the k-mer), generated by breaking each read into k-mer size.
de Bruijn graphs were originally proposed as a solution to the so-called "superstring problem”"in 1946.
A General bioinformatics background of sequence assembly. https://doi.org/10.1371/journal.pone.0169662.s001
-
OLC versus de Bruijn graph
- OLC: pairwise alignment between reads are detected and merged
- de Buijn: reads are decomposed into k-mers and merged
2 Graph theory
-
Graph
- Mathematical structures used to model pairwise relations between objects
- A graph is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines).
图论(英语:Graph theory)是组合数学的一个分支,和其他数学分支,如群论、矩阵论、拓扑学有着密切关系。图是图论的主要研究对象。图是由若干给定的顶点及连接两顶点的边所构成的图形,这种图形通常用来描述某些事物之间的某种特定关系。顶点用于代表事物,连接两顶点的边则用于表示两个事物间具有这种关系。
A distinction is made between undirected graphs, where edges link two vertices symmetrically, and directed graphs, where edges link two vertices asymmetrically.
即无序图对称地连接两个边,有序图不对称地连接两个边。
2.1 Directed graph
A directed graph is an ordered pair G = (V, E) comprising:
G(V, E): a set of vertices (V) and a set of directed edges (E)
Vertex and edge may have weights associated with them
-
Here, V= {Read1, Read2, Read3}, E = {(Read1, Read2), (Read2, Read3),
(Read3, Read1)} with weights of 5, 6 and 4.
2.2 Overlap-based approach (OLC)
-
Overlap graph
- Nodes are initially formed by the sequences of the individual reads
- Edges are represented by the sequence overlaps between these reads.
-
Constructing the overlap graph
-
Create the overlap graph for the following string, where overlap means suffix/prefix match of >=5bp
TTATGCCATGGGATGAA
Assume that we generated a few short reads from this string: TTATGCCATG, CCATGGGATG, TGGGATGAA
Vertices/nodes are formed by the sequences of the individual reads; edges are represented by the sequence overlaps between these reads
-
Sometimes graphs can be reduced by eliminating transitive edges
-
3 Comparison of two approaches for large genomes such as human genomes
3.1 U.S. Human Genome Project
-
Primary goals were to
- Discover the complete set of human genes and make them accessible for further biological study
- Determine the complete sequence of DNA bases in the human genome
Two thoughts: whole-genome shotgun versus physical mapping followed by BAC-by-BAC sequencing
3.2 Draft human genome
两股力量:
-
Public effort: BAC based sequencing - Performed by a consortium of government labs and universities.
- Break genome into ~100-300kb pieces
- Create Bacterial Artificial Chromosomes (BACs) from each piece
- Assemble each piece by Sanger sequencing
- Then stitch overlapping BACs together to draft genome
-
Private effort: whole genome random shotgun sequencing - Performed by Celera, which is a company.
- In 1998, Craig Venter announced that he was forming Celera that within three years would sequence human genome
双方发文互怼,竞争之下,2001年分别在Nature (2001.02.15), Science (2001.02.16)发布成果,时间仅相差一天,比原定计划提前了两年。
3.3 How Perl Saved the Human Genome Project
Random web quotes: “Perl and the human genome are almost perfectly matched; both are almost incomprehensible, with no central design, accreted haphazardly over a long time.”
"Perl一夜之间火起来,人类基因组完成后又在一夜之间被抛弃。" 🌚
3.4 Back to today
- Gigabase genomes can be easily sequenced and assembled using PacBio/Nanopore long-read assemblers
- Whole-genome shotgun is the predominant approach to be used today for assembling these large genomes
4 Lander-Waterman statistics in genome assembly
-
Let
N = # of reads,
L = length of a read (average length),
G = genome length,
c = coverage =
-
Main questions to address:
- What’s the fraction of genome that are covered by reads?
- How many contigs are generated?
4.1 Coverage at a position can be modelled by Poisson distribution
- Probability that a base is NOT covered =
- For genome size G, # of uncovered bases =
4.2 What’s the fraction of genome that are covered by reads?
等同于 "how many positions have coverage>0?"
-
For genome size G, number of positions with read coverage:
4.3 How many contigs are generated?
-
等同于 "how many gaps are there?", "how many reads do not have overlap > θ with any other reads"
https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/sequencing.pdf
-
For each read, we calculate the probability that zero reads start at (1-θ)L
因为:,
-
Theory is different from reality !
If there is no repeat, no polymorphism, no region bias, and no sequence error, sequence assembly could be very easy
https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/sequencing.pdf
4.4 Assembling large genomes today
Since 2013, de novo assembly of large genomes has shifted from short-read sequencing to synthetic or true long-read sequencing.
4.5 Canu assembler for long-read assembly
基于Celara开发的软件
4.6 Wtdbg2 assembler for long-read assembly
Wtdbg2 groups 256 base pairs into a bin. (并不是直接用ATCG)
5 Seven Bridges of Königsberg Problem
柯尼斯堡七桥问题(Seven Bridges of Königsberg)是图论中的著名问题。这个问题是基于一个现实生活中的事例:当时东普鲁士柯尼斯堡(今日俄罗斯加里宁格勒)市区跨普列戈利亚河两岸,河中心有两个小岛。小岛与河的两岸有七条桥连接。在所有桥都只能走一遍的前提下,如何才能把这个地方所有的桥都走遍?
https://zh.wikipedia.org/wiki/柯尼斯堡七桥问题
In 1736, the Swiss mathematician Leonhard Euler demonstrated that the answer is no 就是高数里欧拉公式的那个欧拉
-
In graph theory
-
An Eulerian trail (or Eulerian path) is a trail in a finite graph which visits every edge exactly once (allowing for revisiting vertices).
莱昂哈德·欧拉在1735年提出,并没有方法能圆满解决这个问题,他更在第二年发表在论文《柯尼斯堡的七桥》中,证明符合条件的走法并不存在,也顺带提出和解决了一笔画问题。
https://zh.wikipedia.org/wiki/柯尼斯堡七桥问题
-
Similarly, an Eulerian circuit or Eulerian cycle is an Eulerian trail which starts and ends on the same vertex.
-
5.1 Euler’s Theorem on directed graphs
-
For directed graphs, the cycle will need to follow the direction of the edges
- Indegree(v) = # edges coming into v
- Outdegree(v) = # edges leaving v
-
A directed graph has an Eulerian path if and only if:
indegree(v)=outdegree(v) for all but 2 nodes (x and y), where indegree(x)=outdegree(x)+1, and indegree(y)=outdegree(y)-1.至多有一个节点 进>出,至多有一个节点 出>进
-
A directed graph has an Eulerian cycle if and only if indegree(v)=outdegree(v) for all nodes
所有节点的degree数各自相等
5.2 Euler’s Theorem on undirected graphs
For undirected graphs, the degree of a vertex is the number of edges that are incident to the vertex
-
An undirected graph has an Eulerian trail if and only if exactly zero or two vertices have odd degree
必须有0个或2个节点的degree为奇数
-
An undirected graph has an Eulerian cycle if and only if every vertex has even degree
所有节点的degree为偶数
5.3 The Königsberg graph is not Eulerian
For the existence of Eulerian trails
It is necessary that zero or two vertices have an odd degree ❌ ABCD的degree都为奇数
If there are no vertices of odd degree, all Eulerian trails are circuits ❌
If there are exactly two vertices of odd degree, all Eulerian trails start at one of them and end at the other ❌
所以欧拉认为七桥问题无解
5.4 Hamiltonian vs Eulerian path
- Hamiltonian path: a path that travels to every node exactly once and ends at the starting node
- Eulerian path: a path that visits all edges of a graph exactly once
- When analyzing sequencing data, instead of assigning each k-mer contained in some read to a node, we will assign each such k-mer to an edge. This allows the construction of a 'de Bruijn graph'
- The vertices are (k-1)-mers that appear in some read, and edges defined by overlap of k-2 nucleotides
6 Constructing the de Bruijn Graph
Create the de Bruijn graph for the following string, using k=3
TTATGCCATGGGATGAA
6.1 Error correction in de Bruijn graph
-
Erroneous reads (marked in red) introduce k false k-mers to the graph, resulting in additional spurious branches, so-called "bubbles" and "tips"
-
Influence of k-mer length on assembly
k太大导致两条路径没有重叠 or k太小导致可能的结果太多
-
What is the appropriate k-mer?
- Tools such as kmerGenie and Velvet Advisor have been developed to predict the optimal value for k for a given read dataset, based on read length and k-mer frequencies
- Other strategies may include the merging of assemblies produced with different k-mer sizes or the use of a multi k-mer assembly method
6.2 A fun hypothetical case study: Frederick Sanger’s insulin sequencing study
"Sanger是个很有趣的人,没有学校要他,但是他家里很有钱,自己进实验室自己掏钱买试剂,得了诺贝尔奖后这辈子也就这么回事了,也不发文章也不看文献...."
用 de Bruijn graph 的方法,也可以将二肽及三肽片段组装出胰岛素contig.