Dragon star Day 2 Pt.2

关于基因组组装、人类基因组计划、七桥问题、图论

Dragonstar2019 by Kai Wang

Alignment of short/long-read sequencing data

Genome assembly by short/long-read sequencing

Part Ⅱ Genome assembly by short/long-read sequencing

1 genome assembly

1.1 What is genome assembly

Genome sequence produced after chromosomes have been fragmented
Those fragments have been sequenced
The resulting sequences have been put back together

1.2 Why genome assembly

Key to understanding genetic variation
The more accurate the reference genome, the easier it is to map reads and interpret the functional impacts of genetic mutations
Accurate and complete genome assembly is important for human genetic
studies

1.3 What types of genome assembly?

1.3.1 de novo assembly

The entire assembled sequence is resolved from raw sequence data without comparison to a reference genome sequence.

Assemblers need copious sequencing data and informatic exertion to put the genome back together. Image: Kelly Howe, Lawrence Berkeley Laboratory

Involves sequencing a novel genome from scratch, without the aid of external data.

https://era7bioinformatics.com/en/page.cfm?id=1500

1.3.2 Comparative assembly

You have a “reference” genome to guide the assembly process
Or sequence a chimp genome using the existing human assembly as a guide

Each read is aligned to the reference genome. In this approach, the new completely different sequences are lost.

1.3.3 De Novo Assembly paradigms

Overlap-layout-consensus (OLC) methods
The overlap layout consensus (OLC) method is rather intuitive and consists of a three steps process.
1. pairwise overlaps
2. layout the reads
3. consensus
A General bioinformatics background of sequence assembly. https://doi.org/10.1371/journal.pone.0169662.s001
OLC - String Graph Assemblers
- Construct overlap graph directly from reads, eliminating redundant reads; trace path for assembly
- Examples: SGA, Fermi
https://ocw.mit.edu/courses/biology/7-91j-foundations-of-computational-and-systems-biology-spring-2014/lecture-slides/MIT7_91JS14_Lecture6.pdf
de Bruijn (k-mer) graph
- Especially useful for assembly from short reads
- Stacking overlapping sequences of genomic fragments of a defined size (the k-mer), generated by breaking each read into k-mer size.
de Bruijn graphs were originally proposed as a solution to the so-called "superstring problem”"in 1946.

A General bioinformatics background of sequence assembly. https://doi.org/10.1371/journal.pone.0169662.s001
OLC versus de Bruijn graph
- OLC: pairwise alignment between reads are detected and merged
- de Buijn: reads are decomposed into k-mers and merged
Chaisson et al, Nature Reviews Genetics, 2015

2 Graph theory

Graph
- Mathematical structures used to model pairwise relations between objects
- A graph is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines).
图论（英语：Graph theory）是组合数学的一个分支，和其他数学分支，如群论、矩阵论、拓扑学有着密切关系。图是图论的主要研究对象。图是由若干给定的顶点及连接两顶点的边所构成的图形，这种图形通常用来描述某些事物之间的某种特定关系。顶点用于代表事物，连接两顶点的边则用于表示两个事物间具有这种关系。

https://zh.wikipedia.org/wiki/图论

A distinction is made between undirected graphs, where edges link two vertices symmetrically, and directed graphs, where edges link two vertices asymmetrically.

即无序图对称地连接两个边，有序图不对称地连接两个边。

2.1 Directed graph

A directed graph is an ordered pair G = (V, E) comprising:

V a set of vertices (also called nodes or points);

E ⊆ {(x, y) | (x, y) ∈ V2 ∧ x ≠ y} a set of edges

https://en.wikipedia.org/wiki/Graph_theory#Directed_graph

G(V, E): a set of vertices (V) and a set of directed edges (E)
Vertex and edge may have weights associated with them
Here, V= {Read1, Read2, Read3}, E = {(Read1, Read2), (Read2, Read3),
(Read3, Read1)} with weights of 5, 6 and 4.

2.2 Overlap-based approach (OLC)

Overlap graph
- Nodes are initially formed by the sequences of the individual reads
- Edges are represented by the sequence overlaps between these reads.
Constructing the overlap graph
- Create the overlap graph for the following string, where overlap means suffix/prefix match of >=5bp
  
  TTATGCCATGGGATGAA
- Assume that we generated a few short reads from this string: TTATGCCATG, CCATGGGATG, TGGGATGAA
- Vertices/nodes are formed by the sequences of the individual reads; edges are represented by the sequence overlaps between these reads
- Sometimes graphs can be reduced by eliminating transitive edges

3 Comparison of two approaches for large genomes such as human genomes

3.1 U.S. Human Genome Project

Primary goals were to
- Discover the complete set of human genes and make them accessible for further biological study
- Determine the complete sequence of DNA bases in the human genome
Two thoughts: whole-genome shotgun versus physical mapping followed by BAC-by-BAC sequencing

3.2 Draft human genome

两股力量：

Public effort: BAC based sequencing - Performed by a consortium of government labs and universities.
- Break genome into ~100-300kb pieces
- Create Bacterial Artificial Chromosomes (BACs) from each piece
- Assemble each piece by Sanger sequencing
- Then stitch overlapping BACs together to draft genome
Private effort: whole genome random shotgun sequencing - Performed by Celera, which is a company.
- In 1998, Craig Venter announced that he was forming Celera that within three years would sequence human genome

双方发文互怼，竞争之下，2001年分别在Nature (2001.02.15), Science (2001.02.16)发布成果，时间仅相差一天，比原定计划提前了两年。

3.3 How Perl Saved the Human Genome Project

Random web quotes: “Perl and the human genome are almost perfectly matched; both are almost incomprehensible, with no central design, accreted haphazardly over a long time.”

"Perl一夜之间火起来，人类基因组完成后又在一夜之间被抛弃。" 🌚

3.4 Back to today

Gigabase genomes can be easily sequenced and assembled using PacBio/Nanopore long-read assemblers
Whole-genome shotgun is the predominant approach to be used today for assembling these large genomes

4 Lander-Waterman statistics in genome assembly

Let

N = # of reads,

L = length of a read (average length),

G = genome length,

c = coverage = $N∗L/G$
Main questions to address:
- What’s the fraction of genome that are covered by reads?
- How many contigs are generated?

4.1 Coverage at a position can be modelled by Poisson distribution

$P(k, λ) = e^{-λ} *λ^k/k!$

Probability that a base is NOT covered = $P(0,c) = e^{-c}$
For genome size G, # of uncovered bases = $G* P(0,c) = G * e^{-c}$

4.2 What’s the fraction of genome that are covered by reads?

等同于 "how many positions have coverage>0?"
For genome size G, number of positions with read coverage:

$G * (1-P(0,c) ) = G * (1-e^{-c})$

4.3 How many contigs are generated?

等同于 "how many gaps are there?", "how many reads do not have overlap > θ with any other reads"

https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/sequencing.pdf
For each read, we calculate the probability that zero reads start at (1-θ)L

$e^{-(1-θ)c} = e^{-(1-θ)(NL/G)}$

因为： $P(0,c) = e^{-c}$ , $c= N∗L/G$
Theory is different from reality !

If there is no repeat, no polymorphism, no region bias, and no sequence error, sequence assembly could be very easy

https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/sequencing.pdf

4.4 Assembling large genomes today

Since 2013, de novo assembly of large genomes has shifted from short-read sequencing to synthetic or true long-read sequencing.

Jung et al, Trends in Plant Science, 2019

4.5 Canu assembler for long-read assembly

基于Celara开发的软件

Koren et al, Genome Research, 2017

4.6 Wtdbg2 assembler for long-read assembly

Wtdbg2 groups 256 base pairs into a bin. (并不是直接用ATCG)

Ruan et al, BioRxiv, 2019

5 Seven Bridges of Königsberg Problem

柯尼斯堡七桥问题（Seven Bridges of Königsberg）是图论中的著名问题。这个问题是基于一个现实生活中的事例：当时东普鲁士柯尼斯堡（今日俄罗斯加里宁格勒）市区跨普列戈利亚河两岸，河中心有两个小岛。小岛与河的两岸有七条桥连接。在所有桥都只能走一遍的前提下，如何才能把这个地方所有的桥都走遍？

https://zh.wikipedia.org/wiki/柯尼斯堡七桥问题

In 1736, the Swiss mathematician Leonhard Euler demonstrated that the answer is no 就是高数里欧拉公式的那个欧拉
In graph theory
- An Eulerian trail (or Eulerian path) is a trail in a finite graph which visits every edge exactly once (allowing for revisiting vertices).
  
  莱昂哈德·欧拉在1735年提出，并没有方法能圆满解决这个问题，他更在第二年发表在论文《柯尼斯堡的七桥》中，证明符合条件的走法并不存在，也顺带提出和解决了一笔画问题。
  
  https://zh.wikipedia.org/wiki/柯尼斯堡七桥问题
- Similarly, an Eulerian circuit or Eulerian cycle is an Eulerian trail which starts and ends on the same vertex.

5.1 Euler’s Theorem on directed graphs

For directed graphs, the cycle will need to follow the direction of the edges
- Indegree(v) = # edges coming into v
- Outdegree(v) = # edges leaving v
A directed graph has an Eulerian path if and only if:
indegree(v)=outdegree(v) for all but 2 nodes (x and y), where indegree(x)=outdegree(x)+1, and indegree(y)=outdegree(y)-1.

至多有一个节点 进>出，至多有一个节点 出>进
A directed graph has an Eulerian cycle if and only if indegree(v)=outdegree(v) for all nodes

所有节点的degree数各自相等

5.2 Euler’s Theorem on undirected graphs

For undirected graphs, the degree of a vertex is the number of edges that are incident to the vertex
An undirected graph has an Eulerian trail if and only if exactly zero or two vertices have odd degree

必须有0个或2个节点的degree为奇数
An undirected graph has an Eulerian cycle if and only if every vertex has even degree

所有节点的degree为偶数

5.3 The Königsberg graph is not Eulerian

For the existence of Eulerian trails

It is necessary that zero or two vertices have an odd degree ❌ ABCD的degree都为奇数
If there are no vertices of odd degree, all Eulerian trails are circuits ❌
If there are exactly two vertices of odd degree, all Eulerian trails start at one of them and end at the other ❌

所以欧拉认为七桥问题无解

5.4 Hamiltonian vs Eulerian path

Hamiltonian path: a path that travels to every node exactly once and ends at the starting node
Eulerian path: a path that visits all edges of a graph exactly once
When analyzing sequencing data, instead of assigning each k-mer contained in some read to a node, we will assign each such k-mer to an edge. This allows the construction of a 'de Bruijn graph'
The vertices are (k-1)-mers that appear in some read, and edges defined by overlap of k-2 nucleotides

6 Constructing the de Bruijn Graph

Create the de Bruijn graph for the following string, using k=3

TTATGCCATGGGATGAA

6.1 Error correction in de Bruijn graph

Erroneous reads (marked in red) introduce k false k-mers to the graph, resulting in additional spurious branches, so-called "bubbles" and "tips"
Influence of k-mer length on assembly

k太大导致两条路径没有重叠 or k太小导致可能的结果太多
What is the appropriate k-mer?
- Tools such as kmerGenie and Velvet Advisor have been developed to predict the optimal value for k for a given read dataset, based on read length and k-mer frequencies
- Other strategies may include the merging of assemblies produced with different k-mer sizes or the use of a multi k-mer assembly method

6.2 A fun hypothetical case study: Frederick Sanger’s insulin sequencing study

"Sanger是个很有趣的人，没有学校要他，但是他家里很有钱，自己进实验室自己掏钱买试剂，得了诺贝尔奖后这辈子也就这么回事了，也不发文章也不看文献...."

*穷鬼爆笑

用 de Bruijn graph 的方法，也可以将二肽及三肽片段组装出胰岛素contig.

Dragon star Day 2 Pt.2 关于基因组组装、人类基因组计划、七桥问题、图论