TODO: Basics + step by step + TinyURL + Pastebin
Basics:
A/C: 看Doc 理解基本概念,google + 实际工作经验 消化理解。
ETA 7号
实际完成时间:14号
Why System Design Interviews?
Points:
high-level design
guide and move the conversation forward
discussion with interviewer --> core
gather all requirements 因为interviewer 不会告诉你
Leading the conversation: candidate leads the discussion to go broad and deep -> take the interviewer with you step by step
Solving by breaking down: top-down and modularization
6.1 break problems into modules and solve them independently
6.2 each component -> sub-problem -> algorithm
6.3 NOTE:
大多数情况是没有以前6.2这么理想的情况的(得到具体solution)
真正重要的是:how you make progress on 解决问题和采用哪种strategyhandle bottleneck:
7.1 each solution is a kind of trade-off
7.2 talk about these trade-offs and to measure their impact on the system keeping all the constraints and use cases in mindSteps:
8.1 Scoping the problem: No assumptions, Ask, 限制,use cases
8.2 Abstract design: blocks of the system and the relationships between them.
8.3 Identify and address the bottlenecks by using the fundamental principles of scalable system design.Know the preference of the interviewer
9.1 focus on the right things while discussing the problem 心里有大局掌控的同时 要多问面试官是否想要多讨论一下 当前/某个 部分
Must-do.
- 花几分钟跟interviewer搞清楚 the full scope of the system
- 在完成High-level design之后,make sure that interviewer is OK with that. then move on to details --> scale
Super Important:
- NEVER assume things that are not stated!!!
LB:
LB is to distribute load to multiple resources
-
Where to ADD:
2.1 user <--> web server
2.2 web server <--> internal platform layer(e.g. app server or cache server)
2.3 internal platform layer <--> DB
Implement
3.1 Smart Clients
In a word: 纯自己develop,可以在各个layer,包括user和db server
3.2 Hardware Load Balancers
In a word: 纯硬件
3.3 Software Load Balancers
In a word: 用LB的软件,但是有时候需要software + hardware结合(如果如果无法控制user的host的port)
Cache
1. 各种Cache
1.1 Application server cache
request 到哪个node,哪个node就cache一下
pros: straightforward
cons: 如果LB是random分的request,会miss cache
1.2 Distributed cache
增加一个node来cache "cache是否available",利用hashing function,每个node上面只存自己需要负责存的cache。当查到自己这儿有相应的cache的时候,(在接到request的node去database那儿拿data前)发一个request告诉接到这个request 的node他这儿有cache
pros: 解决了application server cache的cons
cons: 如果有Missing node会很麻烦,即使利用"让多份copy of data到不同的nodes上"来解决这个con也会使整个cache变得很复杂
1.3 Global Cache
有两种global cache:
第一种的好处不用说
第二种的对于有的情况会有好处,比如:(1)cache的file很大 (2)cache的东西是static的,不希望被evict
1.4 Content Distribution Network (CDN)
个人理解(To be changed if it's wrong) 本质就是global cache
2. Cache Invalidation
三种strategies:
2.1 Write-through cache
两边(cache + DB)一起更新
cons: 特别慢,load大
2.2 Write-around cache
只存DB
cons: 慢, miss cache
2.3 Write-back cache
只存cache,定期更新
cons: 如果cache在DB更新前突然崩了,data可能会丢失
3. Cache eviction policies
FIFO, LIFO, LRU, MRU, LFU, RR
Sharding or Data Partitioning
-> break up a big database (DB) into many smaller parts
Partition Methods
1.1 Horizontal partitioning: range-based
cons: 分布不均
1.2 Vertical Partitioning: column-based
cons: 一般都得需要再partition
1.3 Directory Based Partitioning: mappingPartitioning Criteria
2.1 Key or Hash-based partitioning:
hash -> consistent hashing
2.2 List partitioning:
Only contains the content with specific values
2.3 Round-robin partitioning
2.4 Composite partitioningCommon Problems of Sharding
a. Joins and Denormalization:
problem: 不能join
solution: denormalization -> con: have to deal with data inconsistency(caused by denormalization)
b. Referential integrity:
problem: not support foreign key constraint
solution: application needs to (1)handle itself (2)clean up dangling references
c. Rebalancing:
problem: 有时候各种原因会导致各个shard间unbalance
solution: Hash-based partitioning with consistent hashing
Index
- used to improve the speed of data retrieval operations on the data store
- It's a data structure
Proxies
Used to filter requests or log requests or transform requests
?As a cache?
Collapse the same requests from a system-wide perspective
Collapse the requests that are 空间相同(spatially close together,如同一个DB)
在(1)high load的情况下(2)cache 有限的情况下 非常有用
Queues
LB有时候也会让request分配不均,产生可以本避免的latency。
用了Queue, client不需要就在那儿等着某个Server的response。
好处:
- asynchronously run tasks
- 可以有更灵活的retry机制, fault tolerance.
Queue的限制: size of data and the number of outstanding requests
Redundancy and Replication
Backup important data or service.
- Failover
- shared-nothing architecture
SQL vs. NoSQL
Differ in (1) the way they were built (2) the kind of info they stored (3) how they store it
SQL: structured, pre-defined schema(例子:电话簿 姓名-电话-地址)
NoSQL: unstructured, distributed, dynamic schema(例子:folder 姓名-一切关于这个人的东西,如地址、电话、FB点赞数)
NoSQL types:
- Key-Value Stores:
Stored in key-value pairs.
e.g. Redis, Dynamo - Document Databases:
Stored in documents. Documents are grouped by collections. Each document 可以有完全不同的数据结构。
e.g. MongoDB - Wide-Column Databases:
Colum families --> container of rows. 不需要提前了解columns, 每个row 不需要有同样的column.
常用于large dataset.
e.g. Cassandra, HBase. - Graph Databases:
用于存那些关系容易用graph来表示的数据
High level differences between SQL and NoSQL
- Storage:
SQL: row --> entity
NoSQL: different models - Schema
SQL: fixed schema <-- whole database
NoSQL: dynamic - Querying
SQL: SQL -> data
NoSQL: focus on collections, different DB --> different syntax - Scalability
SQL: vertically scalable
NoSQL: horizontally scalable - Reliability or ACID Compliancy
SQL: Good
NoSQL: Bad
SQL VS. NoSQL - Which one to use?
- When SQL:
(1) ensure ACID compliance
(2)data is structured and unchanging - When NoSQL:
(1) data with no or little structure.
(2) Used in Cloud Computing. Scalable.
(3) Rapid development.
CAP Theorem
三个中最多满足两个。
Consistent Hashing
如果想不起来了可以参考wiki 或者
https://www.educative.io/collection/page/5668639101419520/5649050225344512/5709068098338816
上面的flash
Long-Polling vs WebSockets vs Server-Sent Events
Ajax Polling
client 发request --> server 收到然后处理发response(不管有没有能回的都要回,所以会产生empty response)
一般都是固定频率的发request
cons: too many requests
Long polling
client 发一个request,但是NOT expect 马上回复 --> server收到request后,每当有东西需要回的时候再回 --> client收到response再马上发一个request
WebSocket
persistent connection between client and server
双方想什么时候发data就什么时候发
Server-Sent Events
只server想什么时候给client发就什么时候发
如果client想给server发data需要用别的protocol