代价估计和计划树
- 了解代价估计和计划树
简单查询的成本估算
Postgres的查询优化基于成本。成本是无量纲值,这些不是绝对的绩效指标,而是指比较操作相对绩效的指标。
主要有3种成本: start-up, run, total
start-up:获取到第一个目标tuple之前的成本
run:获取到所有tuple的成本
total:start-up和run的总和
postgres=# explain select * from a;
QUERY PLAN
-----------------------------------------------------
Seq Scan on a (cost=0.00..32.60 rows=2260 width=8)
(1 row)
如上面的例子,cost=0.00..32.60,其中start-up为0.00,total为32.60。
- 顺序扫描
顺序扫描的成功由函数 cost_seqscan()来估算。
顺序扫描,由于是扫描所有数据页,不需要准备工作,start-up的成本为0,run的成本计算公式为:
'run cost' = 'cpu run cost' + 'disk run cost'
=(cpu_tuple_cost + cpu_operator_cost) × Ntuple + seq_page_cost × Npage
其中seq_page_cost,cpu_tuple_cost,cpu_operator_cost的值在配置文件postgresql.conf中设置,默认值为 1.0, 0.01, and 0.0025。Ntuple,Npage可在pg_class中查询出来。
postgres=# create table test(id int,col int);
CREATE TABLE
postgres=# insert into test select gene
postgres=# insert into test select generate_series(1,100000),generate_series(1,100000);
INSERT 0 100000
postgres=# analyze;
ANALYZE
postgres=# select relpages, reltuples FROM pg_class WHERE relname='test';
relpages | reltuples
----------+-----------
443 | 100000
(1 row)
这里以查询语句 select * from test where id<10000; 为例子来估算顺序扫描的成本,可估算出run cost=(0.01+0.0025) * 100000+1.0 * 443=1693。在命令行查看语句的执行计划:
postgres=# explain select * from test where id<10000;
QUERY PLAN
----------------------------------------------------------
Seq Scan on test (cost=0.00..1693.00 rows=9885 width=8)
Filter: (id < 10000)
(2 rows)
- 索引扫描
索引扫描的成本由函数cost_index()来估算。
// 创建如下表带索引
postgres=# \d+ test
Table "public.test"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
id | integer | | not null | | plain | |
col | integer | | | | plain | |
Indexes:
"test_pkey" PRIMARY KEY, btree (id)
"test_idx" btree (col)
以查询语句 select * from test where col < 240; 来估算索引扫描成本。
由查询条件col<240,可走test_idx索引,查询pg_class可得Nindex,page=276,Nindex,tuple=100000。
postgres=# SELECT relpages, reltuples FROM pg_class WHERE relname = 'test_idx';
relpages | reltuples
----------+-----------
276 | 100000
(1 row)
- start-up成本
索引扫描start-up成本指读取索引页拿到目标表第一个tuple的成本,其估算公式为:
'start-up cost'={ceil(log2(Nindex,tuple)) + (Hindex + 1) x 50} x cpu_operator_cost
其中Hindex为索引树的高度。
// 用pageinspect查询索引高度,level为1,高度为1
postgres=# select * from bt_metap('test_idx');
magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples
--------+---------+------+-------+----------+-----------+-------------+-------------------------
340322 | 3 | 3 | 1 | 3 | 1 | 0 | -1
(1 row)
cpu_operator_cost默认值为0.0025,则
start-up cost={ceil(log2(100000)) + (1+1) * 50} * 0.0025=0.2925。
- run成本
索引扫描的run成本指表和索引cpu成本、IO成本之和:
'run cost'=('index cpu cost' + 'table cpu cost') + ('index IO cost' + 'table IO cost')
注:仅索引扫描不需估算 'table cpu cost' 和 'table IO cost'。
其中三个成本估算公式如下:
'index cpu cost'=Selectivity x Nindex,tuple x (cpu_index_tuple_cost+qual_op_cost),
'table cpu cost'=Selectivity x Ntuple x cpu_tuple_cost,
'index IO cost'=ceil(Selectivity x Nindex,page) x random_page_cost
其中 cpu_index_tuple_cost, cpu_tuple_cost, random_page_cost 在配置文件postgresql.conf的默认值分别为 0.005, 0.01, 4.0。qual_op_cost,粗略的来讲,是评估索引的成本,这里为0.0025。Selectivity 指where子句的索引的搜索范围的比例,它是从0到1的浮点数,如 (Selectivity x Ntuple) 指读取表tuple的数量, (Selectivity x Nindex,page) 指读取索引页的数量。
Selectivity
查询谓词的选择性使用 histogram_bounds 或 MCV(Most Common Value) 来估算,这两者可在pg_stats中查询出来。
表每个字段的MCV存储在pg_stats的most_common_vals 和 most_common_freqs字段,其中:
most_common_vals:指字段 MCV 列表
most_common_freqs:指MCV的频率列表
如查询语句是:select * from test where col < 240;
SELECT most_common_vals, most_common_freqs FROM pg_stats where tablename='test' and attname='col';
可查出表 test 字段 col 值 240 对应的频率,将该频率作为Selectivity值。
如果 MCV 没有查询出结果,则使用 histogram_bounds 来估算。
histogram_bounds:将字段的值分成近似相等的级别的值列表
查看表test字段col的histogram_bounds:
postgres=# SELECT histogram_bounds FROM pg_stats where tablename='test' and attname='col';
histogram_bounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
{2,1030,2030,3016,3967,4950,5839,6887,7880,8930,10053,11095,12058,13050,14052,15031,16055,17013,18042,19101,20216,21246,22275,23219,24254,25285,26293,27321,28441,29448,30569,31608,32
659,33671,34663,35620,36559,37519,38460,39413,40411,41393,42358,43368,44296,45366,46374,47321,48384,49354,50361,51391,52450,53510,54503,55498,56503,57470,58418,59409,60492,61416,62346
,63312,64390,65286,66356,67311,68350,69363,70277,71322,72283,73306,74362,75403,76297,77265,78300,79236,80238,81261,82321,83317,84335,85287,86225,87188,88206,89199,90201,91152,92060,92
999,94019,94978,96002,97009,97935,99004,99999}
(1 row)
histogram_bounds默认分为100个桶,桶的编号从0开始,histogram_bounds的值是桶的边界值,如第0个桶的histogram_bounds为2,则存储在0号桶的tuple最小值为2,第1个桶的histogram_bounds为1030,则存储在1号桶的tuple最小值为1030,则桶0存储的数据为 2 <= value < 1030。
查询条件 col < 240,240 存储在第0个桶,本例中通过线性插值,可计算:
所以,
index cpu cost = 0.002315175* 100000 * (0.005 + 0.0025) = 1.73638125
table cpu cost = 0.002315175* 100000 * 0.01 = 2.315175
index IO cost = ceil(0.002315175* 276) * 4.0 = 4.0
'table IO cost'估算公式为:
'table IO cost'=max_IO_cost+indexCorrelation2x(min_IO_cost-max_IO_cost)
max_IO_cost是IO成本的最坏情况,即随机扫描所有表页的成本,公式如下 :
max_IO_cost=Npage x random_page_cost
其中Npage=443,
max_IO_cost=443 * 4.0 = 1772
min_IO_cost是IO成本的最佳情况,即顺序扫描所选表页的成本,公式如下:
min_IO_cost=1 x random_page_cost + (ceil(Selectivity x Npage) - 1) x seq_page_cost
则:
min_IO_cost=1 * 4.0 + (ceil(0.002315175* 443) - 1) * 1.0 = 5.0
在本例中,indexCorrelation=1。
postgres=# SELECT tablename,attname, correlation FROM pg_stats WHERE tablename = 'test' and attname='col';
tablename | attname | correlation
-----------+---------+-------------
test | col | 1
(1 row)
所以,
table IO cost = 1772 + 12 * (5.0 - 1772) = 5.0
最后,
run cost = (1.73638125 + 2.315175) + (4.0 + 5.0) = 13.05155625
- total成本
total cost = 0.2925 + 13.05155625 = 13.34405625
使用explain查询执行计划为:
postgres=# explain select * from test where col < 240;
QUERY PLAN
------------------------------------------------------------------------
Index Scan using test_idx on test (cost=0.29..13.35 rows=232 width=8)
Index Cond: (col < 240)
(2 rows)
计划树
PostgreSQL Planner执行三个步骤,如下所示:
- 预处理
- 通过估算所有可能的访问路径的成本,获得最廉价的访问路径
- 从最廉价的路径创建计划树
访问路径是用于估计成本的处理单元,例如顺序扫描,索引扫描,排序和各种加入操作具有它们对应的路径。 访问路径仅在Planner使用以创建计划树。 访问路径最基本的数据结构是Path,它对应于顺序扫描,所有其他访问路径都基于它。
要处理上述步骤,Planner内部创建了一个PlannerInfo结构,并保存查询树,查询中包含的relations,访问路径等信息。
// Path 结构
typedef struct Path {
NodeTag type;
NodeTag pathtype; /* tag identifying scan/join method */
RelOptInfo* parent; /* the relation this path can build */
ParamPathInfo* param_info; /* parameterization info, or NULL if none */
/* estimated size/costs for path (see costsize.c for more info) */
double rows; /* estimated number of global result tuples */
double multiple;
Cost startup_cost; /* cost expended before fetching any tuples */
Cost total_cost; /* total cost (assuming all tuples fetched) */
Cost stream_cost; /* cost of actions invoked by stream but can't be parallelled in this path */
List* pathkeys; /* sort ordering of path's output */
List* distribute_keys; /* distribute key, Var list */
char locator_type;
Oid rangelistOid;
int dop; /* degree of parallelism */
/* pathkeys is a List of PathKey nodes; see above */
Distribution distribution;
int hint_value; /* Mark this path if be hinted, and hint kind. */
double innerdistinct; /* join inner rel distinct estimation value */
double outerdistinct; /* join outer rel distinct estimation value */
} Path;
// PlannerInfo 结构
typedef struct PlannerInfo {
NodeTag type;
Query* parse; /* the Query being planned */
PlannerGlobal* glob; /* global info for current planner run */
Index query_level; /* 1 at the outermost Query */
struct PlannerInfo* parent_root; /* NULL at outermost Query */
/*
* simple_rel_array holds pointers to "base rels" and "other rels" (see
* comments for RelOptInfo for more info). It is indexed by rangetable
* index (so entry 0 is always wasted). Entries can be NULL when an RTE
* does not correspond to a base relation, such as a join RTE or an
* unreferenced view RTE; or if the RelOptInfo hasn't been made yet.
*/
struct RelOptInfo** simple_rel_array; /* All 1-rel RelOptInfos */
int simple_rel_array_size; /* allocated size of array */
/*
* List of changed var that mutated during cost-based rewrite optimization, the
* element in the list is "struct RewriteVarMapping", for example:
* - inlist2join
* - pushjoin2union (will implemented)
* _ ...
*
*/
List* var_mappings;
Relids var_mapping_rels; /* all the relations that related to inlist2join */
/*
* simple_rte_array is the same length as simple_rel_array and holds
* pointers to the associated rangetable entries. This lets us avoid
* rt_fetch(), which can be a bit slow once large inheritance sets have
* been expanded.
*/
RangeTblEntry** simple_rte_array; /* rangetable as an array */
/*
* all_baserels is a Relids set of all base relids (but not "other"
* relids) in the query; that is, the Relids identifier of the final join
* we need to form.
*/
Relids all_baserels;
/*
* join_rel_list is a list of all join-relation RelOptInfos we have
* considered in this planning run. For small problems we just scan the
* list to do lookups, but when there are many join relations we build a
* hash table for faster lookups. The hash table is present and valid
* when join_rel_hash is not NULL. Note that we still maintain the list
* even when using the hash table for lookups; this simplifies life for
* GEQO.
*/
List* join_rel_list; /* list of join-relation RelOptInfos */
struct HTAB* join_rel_hash; /* optional hashtable for join relations */
/*
* When doing a dynamic-programming-style join search, join_rel_level[k]
* is a list of all join-relation RelOptInfos of level k, and
* join_cur_level is the current level. New join-relation RelOptInfos are
* automatically added to the join_rel_level[join_cur_level] list.
* join_rel_level is NULL if not in use.
*/
List** join_rel_level; /* lists of join-relation RelOptInfos */
int join_cur_level; /* index of list being extended */
List* init_plans; /* init SubPlans for query */
List* cte_plan_ids; /* per-CTE-item list of subplan IDs */
List* eq_classes; /* list of active EquivalenceClasses */
List* canon_pathkeys; /* list of "canonical" PathKeys */
List* left_join_clauses; /* list of RestrictInfos for
* mergejoinable outer join clauses
* w/nonnullable var on left */
List* right_join_clauses; /* list of RestrictInfos for
* mergejoinable outer join clauses
* w/nonnullable var on right */
List* full_join_clauses; /* list of RestrictInfos for
* mergejoinable full join clauses */
List* join_info_list; /* list of SpecialJoinInfos */
List* lateral_info_list; /* list of LateralJoinInfos */
List* append_rel_list; /* list of AppendRelInfos */
List* rowMarks; /* list of PlanRowMarks */
List* placeholder_list; /* list of PlaceHolderInfos */
List* query_pathkeys; /* desired pathkeys for query_planner(), and
* actual pathkeys afterwards */
List* group_pathkeys; /* groupClause pathkeys, if any */
List* window_pathkeys; /* pathkeys of bottom window, if any */
List* distinct_pathkeys; /* distinctClause pathkeys, if any */
List* sort_pathkeys; /* sortClause pathkeys, if any */
List* minmax_aggs; /* List of MinMaxAggInfos */
List* initial_rels; /* RelOptInfos we are now trying to join */
MemoryContext planner_cxt; /* context holding PlannerInfo */
double total_table_pages; /* # of pages in all tables of query */
double tuple_fraction; /* tuple_fraction passed to query_planner */
double limit_tuples; /* limit_tuples passed to query_planner */
bool hasInheritedTarget; /* true if parse->resultRelation is an
* inheritance child rel */
bool hasJoinRTEs; /* true if any RTEs are RTE_JOIN kind */
bool hasLateralRTEs; /* true if any RTEs are marked LATERAL */
bool hasHavingQual; /* true if havingQual was non-null */
bool hasPseudoConstantQuals; /* true if any RestrictInfo has
* pseudoconstant = true */
bool hasRecursion; /* true if planning a recursive WITH item */
/* Note: qualSecurityLevel is zero if there are no securityQuals */
Index qualSecurityLevel; /* minimum security_level for quals */
#ifdef PGXC
/* This field is used only when RemoteScan nodes are involved */
int rs_alias_index; /* used to build the alias reference */
/*
* In Postgres-XC Coordinators are supposed to skip the handling of
* row marks of type ROW_MARK_EXCLUSIVE & ROW_MARK_SHARE.
* In order to do that we simply remove such type
* of row marks from the list rowMarks. Instead they are saved
* in xc_rowMarks list that is then handeled to add
* FOR UPDATE/SHARE in the remote query
*/
List* xc_rowMarks; /* list of PlanRowMarks of type ROW_MARK_EXCLUSIVE & ROW_MARK_SHARE */
#endif
/* These fields are used only when hasRecursion is true: */
int wt_param_id; /* PARAM_EXEC ID for the work table */
struct Plan* non_recursive_plan; /* plan for non-recursive term */
/* These fields are workspace for createplan.c */
Relids curOuterRels; /* outer rels above current node */
List* curOuterParams; /* not-yet-assigned NestLoopParams */
Index curIteratorParamIndex;
bool isPartIteratorPlanning;
int curItrs;
List* subqueryRestrictInfo; /* Subquery RestrictInfo, which only be used in wondows agg. */
/* optional private data for join_search_hook, e.g., GEQO */
void* join_search_private;
/* Added post-release, will be in a saner place in 9.3: */
List* plan_params; /* list of PlannerParamItems, see below */
/* For count_distinct, save null info for group by clause */
List* join_null_info;
/* for GroupingFunc fixup in setrefs */
AttrNumber* grouping_map;
/* If current query level is correlated with upper level */
bool is_correlated;
/* data redistribution for DFS table.
* dataDestRelIndex is index into the range table. This variable
* will take effect on data redistribution state.
* The effective value must be greater than 0.
*/
Index dataDestRelIndex;
/* interesting keys of current query level */
ItstDisKey dis_keys;
/*
* indicate if the subquery planning root (PlannerInfo) is under or rooted from
* recursive-cte planning.
*/
bool is_under_recursive_cte;
/*
* indicate if the subquery planning root (PlannerInfo) is under recursive-cte's
* recursive branch
*/
bool is_under_recursive_tree;
bool has_recursive_correlated_rte; /* true if any RTE correlated with recursive cte */
int subquery_type;
Bitmapset *param_upper;
bool hasRownumQual;
} PlannerInfo;
- 预处理
创建计划树前,Planner对存储在PlannerInfo structure的查询树执行预处理。
这里介绍单表查询的主要预处理:
- 简化target lists,limit子句等
例:将'2+2'重写为'4' - 正常化Boolean表达式
例:'NOT (NOT a)' 重写为 'a' -
平整 AND/OR 表达式
AND/OR表达式在标准SQL里是二进制运算符,但在PG,它们是n-ary运行符,Planner始终假设所有嵌套的AND,OR表达式平整化。
如下例子展示将 '(id=1) OR (id=2) OR (id=3)' 平整化。
- 获取最廉价的访问路径
为了拿到最廉价的访问路径,Planner对所有可能的访问路径进行成本估算,并选择最廉价的访问路径。Planner执行了以下操作:
- 创建 RelOptInfo 结构来存储访问路径和对应的成本,RelOptInfo中的baserestrictinfo存储where条件相关信息,indexlist存储目标表存在的索引信息
typedef enum RelOptKind {
RELOPT_BASEREL,
RELOPT_JOINREL,
RELOPT_OTHER_MEMBER_REL,
RELOPT_DEADREL
} RelOptKind;
typedef struct RelOptInfo {
NodeTag type;
RelOptKind reloptkind;
/* all relations included in this RelOptInfo */
Relids relids; /* set of base relids (rangetable indexes) */
bool isPartitionedTable; /* is it a partitioned table? it is meaningless unless
it is a 'baserel' (reloptkind = RELOPT_BASEREL) */
PartitionFlag partflag;
/* size estimates generated by planner */
double rows; /* estimated number of global result tuples */
int width; /* estimated avg width of result tuples */
int encodedwidth; /* estimated avg width of encoded columns in result tuples */
AttrNumber encodednum; /* number of encoded column */
/* materialization information */
List* reltargetlist; /* Vars to be output by scan of relation */
List* distribute_keys; /* distribute key */
List* pathlist; /* Path structures */
List* ppilist; /* ParamPathInfos used in pathlist */
struct Path* cheapest_startup_path;
List* cheapest_total_path; /* contain all cheapest total paths from different distribute key */
struct Path* cheapest_unique_path;
List* cheapest_parameterized_paths;
/* information about a base rel (not set for join rels!) */
Index relid;
Oid reltablespace; /* containing tablespace */
RTEKind rtekind; /* RELATION, SUBQUERY, or FUNCTION */
AttrNumber min_attr; /* smallest attrno of rel (often <0) */
AttrNumber max_attr; /* largest attrno of rel */
Relids* attr_needed; /* array indexed [min_attr .. max_attr] */
int32* attr_widths; /* array indexed [min_attr .. max_attr] */
List* lateral_vars; /* LATERAL Vars and PHVs referenced by rel */
Relids lateral_relids; /* minimum parameterization of rel */
List* indexlist; /* list of IndexOptInfo */
RelPageType pages; /* local size estimates derived from pg_class */
double tuples; /* global size estimates derived from pg_class */
double multiple; /* how many dn skewed and biased be influenced by distinct. */
double allvisfrac;
struct PruningResult* pruning_result; /* pruning result for partitioned table with
baserestrictinfo,it is meaningless unless it
is a 'baserel' (reloptkind = RELOPT_BASEREL) */
int partItrs; /* the number of the partitions in pruning_result */
struct PruningResult* pruning_result_for_index_usable;
int partItrs_for_index_usable; /* the number of the partitions in pruning_result_for_seqscan */
struct PruningResult* pruning_result_for_index_unusable;
int partItrs_for_index_unusable; /* the number of the partitions in pruning_result_for_seqscan */
/* information about a partitioned table */
BucketInfo *bucketInfo;
/* use "struct Plan" to avoid including plannodes.h here */
struct Plan* subplan; /* if subquery */
PlannerInfo* subroot; /* if subquery */
List *subplan_params; /* if subquery */
/* use "struct FdwRoutine" to avoid including fdwapi.h here */
struct FdwRoutine* fdwroutine; /* if foreign table */
void* fdw_private; /* if foreign table */
/* used by various scans and joins: */
List* baserestrictinfo; /* RestrictInfo structures (if base
* rel) */
QualCost baserestrictcost; /* cost of evaluating the above */
Index baserestrict_min_security; /* min security_level found in
* baserestrictinfo */
List* joininfo; /* RestrictInfo structures for join clauses
* involving this rel */
bool has_eclass_joins; /* T means joininfo is incomplete */
RelOrientation orientation; /* the store type of base rel */
RelstoreType relStoreLocation; /* the relation store location. */
char locator_type; /* the location type of base rel */
Oid rangelistOid; /* oid of list/range distributed table, InvalidOid if not list/range table */
List* subplanrestrictinfo; /* table filter with correlated column involved */
ItstDisKey rel_dis_keys; /* interesting key info for current relation */
List* varratio; /* rel tuples ratio after join to different relation */
List* varEqRatio;
/*
* The alternative rel for cost-based query rewrite
*
* Note: Only base rel have valid pointer of this fields, set to NULL for alternative rel
*/
List* alternatives;
/*
* Rel opinter to base rel that in plannerinfo->simple_rel_array[x].
*
* Note: Only alternative rels has valid pointer of this field, set to NULL for the
* origin rel.
*/
RelOptInfo* base_rel;
unsigned int num_data_nodes = 0; //number of distributing data nodes
} RelOptInfo;
- 估算所有可能的访问路径的成本,将其加入到RelOptInfo 结构
i. 创建path,估算顺序扫描的成本,并将结果写到path,然后将path增加到RelOptInfo的pathlist里。
ii. 如果目标表存在索引,则逐个创建索引访问路径path,并估算索引扫描的成本保存到path,再将所有path增加到RelOptInfo的pathlist里。
iii. 如果支持位图扫描,创建位图扫描访问路径path,并估算位图扫描的成本保存到path,再将所有path增加到RelOptInfo的pathlist里。 - 从RelOptInfo的pathlist里获取最廉价的访问路径
- 如果需要,估算 LIMIT, ORDER BY, ARREGISFDD 的成本
以下以两个示例来说明Planner的执行。
- 示例1
首先创建一张表,不带索引,执行的查询语句带where条件和order by子句。
postgres=# \d+ tbl_1
Table "public.tbl_1"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
id | integer | | | | plain | |
data | integer | | | | plain | |
postgres=# select * from tbl_1 where id < 300 order by data;
(1) 创建RelOptInfo,加入到PlannerInfo的simple_rel_array
(2) 将where条件存储到RelOptInfo的baserestrictinfo中,另外RelOptInfo的indexlist为空,因为目标表没有索引
(3) 由于语句中有order by子句,增加排序的pathkey到PlannerInfo的sort_pathkeys
(4) 创建顺序扫描的path,并已估算好成本,再增加到RelOptInfo的pathlist
由于该例中目标表没有索引,所以Planner只估算顺序扫描的成本,顺序扫描即为该查询最廉价的访问路径。
(5) 创建新的RelOptInfo来处理ORDER BY
该RelOptInfo不包含where条件,即baserestrictinfo为空
(6) 创建sort path,估算成本后,加入到新的RelOptInfo的pathlist里。然后,将上面的顺序扫描的path连接到sort path的subpath中。subpath连接最廉价的访问路径。
- 示例2
创建表带两个索引,执行的查询语句带where条件
postgres=# \d+ tbl_2
Table "public.tbl_2"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
id | integer | | not null | | plain | |
data | integer | | | | plain | |
Indexes:
"tbl_2_pkey" PRIMARY KEY, btree (id)
"tbl_2_data_idx" btree (data)
postgres=# select * from tbl_2 where id < 240;
(1) 创建RelOptInfo,加入到PlannerInfo的simple_rel_array
(2) 将where条件存储到RelOptInfo的baserestrictinfo中,将目标表的索引逐个加入到RelOptInfo的indexlist中
(3) 创建顺序扫描的path,并已估算好成本,再增加到RelOptInfo的pathlist
(4) 创建索引扫描的访问路径 IndexPath,估算成本后,加入到RelOptInfo的pathlist中
该例中有两个索引,tbl_2_pkey,tbl_2_data_idx,需按顺序处理。首先处理tbl_2_pkey。
这里先创建tbl_2_pkey的indexpath,估算成本后,加入到RelOptInfo的pathlist中。由于tbl_2_pkey关联字段 'id',所以将where条件存储到indexclauses中。
另外,在将访问路径加入到RelOptInfo的pathlist时,会按total成本排序。在本例中,由于索引扫描的成本小于顺序扫描,所以索引扫描的访问路径在顺序扫描前面。
(5) 创建另一个索引的访问路径 IndexPath,估算成本后,加入到RelOptInfo的pathlist中
(6) 创建新的RelOptInfo
(7) 将最廉价的访问路径加入到新的RelOptInfo的pathlist中
在该例中,最廉价的访问路径是使用tbl_2_pkey索引扫描的访问路径。
- 创建计划树
Planner使用最廉价的访问路径生成计划树。
计划树的根存储为PlannedStmt结构。
/* ----------------
* PlannedStmt node
*
* The output of the planner is a Plan tree headed by a PlannedStmt node.
* PlannedStmt holds the "one time" information needed by the executor.
* ----------------
*/
typedef struct PlannedStmt {
NodeTag type;
CmdType commandType; /* select|insert|update|delete */
uint64 queryId; /* query identifier, uniquely indicate this plan in Runtime (copied from Query) */
bool hasReturning; /* is it insert|update|delete RETURNING? */
bool hasModifyingCTE; /* has insert|update|delete in WITH? */
bool canSetTag; /* do I set the command result tag? */
bool transientPlan; /* redo plan when TransactionXmin changes? */
bool dependsOnRole; /* is plan specific to current role? */
Plan* planTree; /* tree of Plan nodes */
List* rtable; /* list of RangeTblEntry nodes */
/* rtable indexes of target relations for INSERT/UPDATE/DELETE */
List* resultRelations; /* integer list of RT indexes, or NIL */
Node* utilityStmt; /* non-null if this is DECLARE CURSOR */
List* subplans; /* Plan trees for SubPlan expressions */
Bitmapset* rewindPlanIDs; /* indices of subplans that require REWIND */
List* rowMarks; /* a list of PlanRowMark's */
/*
* Notice: be careful to use relationOids as it may contain non-table OID
* in some scenarios, e.g. assignment of relationOids in fix_expr_common.
*/
List* relationOids; /* contain OIDs of relations the plan depends on */
List* invalItems; /* other dependencies, as PlanInvalItems */
int nParamExec; /* number of PARAM_EXEC Params used */
int num_streams; /* number of stream exist in plan tree */
int max_push_sql_num; /* number of split sql want push DN execute */
int gather_count; /* gather_count in query */
int num_nodes; /* number of data nodes */
NodeDefinition* nodesDefinition; /* all data nodes' defination */
int instrument_option; /* used for collect instrument data */
int num_plannodes; /* how many plan node in this planstmt */
int query_mem[2]; /* how many memory the query can use , memory in kb */
int assigned_query_mem[2]; /* how many memory the query is assigned */
bool is_dynmaic_smp;
int dynsmp_max_cpu; /* max avaliable cpu for this dn */
int dynsmp_avail_cpu; /* max avaliable cpu for this dn */
int dynsmp_cpu_util;
int dynsmp_active_statement;
double dynsmp_query_estimate_cpu_usge;
int dynsmp_plan_optimal_dop; /* the final optimized dop for the plan */
int dynsmp_plan_original_dop;
int dynsmp_dop_mem_limit; /* memory will put a limit on dop */
int dynsmp_min_non_spill_dop; /* optimal dop cannot greater than this */
int num_bucketmaps; /* Num of special-bucketmap stored in plannedstmt */
uint2* bucketMap[MAX_SPECIAL_BUCKETMAP_NUM]; /* the map information need to be get */
char* query_string; /* convey the query string to backend/stream thread of DataNode for debug purpose */
List* subplan_ids; /* in which plan id subplan should be inited */
List* initPlan; /* initplan in top plan node */
/* data redistribution for DFS table.
* dataDestRelIndex is index into the range table. This variable
* will take effect on data redistribution state.
*/
Index dataDestRelIndex;
int MaxBloomFilterNum;
int query_dop; /* Dop of current query. */
double plannertime; /* planner execute time */
/* set true in do_query_for_planrouter() for PlannedStmt sent to
* the compute pool
*/
bool in_compute_pool;
/* true if there is/are ForeignScan node(s) of OBS foreign table
* in plantree.
*/
bool has_obsrel;
List* plan_hint_warning; /* hint warning during plan generation, only used in CN */
List* noanalyze_rellist; /* relations and attributes that have no statistics, only used in CN */
int ng_num; /* nodegroup number */
NodeGroupQueryMem* ng_queryMem; /* each nodegroup's query mem */
bool ng_use_planA; /* true means I am a planA, default false */
bool isRowTriggerShippable; /* true if all row triggers are shippable. */
bool is_stream_plan;
bool multi_node_hint;
uint64 uniqueSQLId;
} PlannedStmt;
commandType:存储操作类型 select, insert, update, delete
rtable:存储关系列表条目
relationOids:存储查询相关表的oid
plantree:存储由计划节点组成的计划树,每个节点有特定的操作,如顺序扫描,排序,索引扫描
计划节点的基础节点是Plan结构,其他计划结构始终包含它。
typedef struct Plan {
NodeTag type;
int plan_node_id; /* node id */
int parent_node_id; /* parent node id */
RemoteQueryExecType exec_type;
/*
* estimated execution costs for plan (see costsize.c for more info)
*/
Cost startup_cost; /* cost expended before fetching any tuples */
Cost total_cost; /* total cost (assuming all tuples fetched) */
/*
* planner's estimate of result size of this plan step
*/
double plan_rows; /* number of global rows plan is expected to emit */
double multiple;
int plan_width; /* average row width in bytes */
int dop; /* degree of parallelism of current plan */
/*
* machine learning model estimations
*/
double pred_rows;
double pred_startup_time;
double pred_total_time;
long pred_max_memory;
/*
* MPPDB Recursive-Union Support
*
* - @recursive_union_plan_nodeid
* Pointing to its belonging RecursiveUnion's plan node id to indate if we are
* under RecursiveUnion
*
* - @recursive_union_controller
* Indicate if current Plan node is controller node in recursive-union steps
*
* - @control_plan_nodeid
* Normally, set on the top-plan node of a producer thread, to indicate which
* control-plan we need syn-up with
*
* - @is_sync_planode
* Indicate the current producer thread is the sync-up thread in recursive union,
* normally set on produer's top plan node
*
* Please note the above four variables is meaningless if a plan node is not under
* recursive-union's recursive part
*/
/*
* plan node id of RecursiveUnion node where current plan node belongs to, 0 for
* not under recursive-union
*/
int recursive_union_plan_nodeid;
/* flag to indicate if it is controller plan node */
bool recursive_union_controller;
/* plan node id of Controller plan node, 0 for not in control */
int control_plan_nodeid;
/* flag indicate if the current plan node is the sync node (for multi-stream case) */
bool is_sync_plannode;
/*
* Common structural data for all Plan types.
*/
List* targetlist; /* target list to be computed at this node */
List* qual; /* implicitly-ANDed qual conditions */
struct Plan* lefttree; /* input plan tree(s) */
struct Plan* righttree;
bool ispwj; /* is it special for partitionwisejoin? */
int paramno; /* the partition'sn that it is scaning */
List* initPlan; /* Init Plan nodes (un-correlated expr
* subselects) */
List* distributed_keys; /* distributed on which key */
ExecNodes* exec_nodes; /* List of Datanodes where to execute this plan */
/*
* Information for management of parameter-change-driven rescanning
*
* extParam includes the paramIDs of all external PARAM_EXEC params
* affecting this plan node or its children. setParam params from the
* node's initPlans are not included, but their extParams are.
*
* allParam includes all the extParam paramIDs, plus the IDs of local
* params that affect the node (i.e., the setParams of its initplans).
* These are _all_ the PARAM_EXEC params that affect this node.
*/
Bitmapset* extParam;
Bitmapset* allParam;
// For vectorized engine, plan produce vector output
//
bool vec_output;
/*
* @hdfs
* Mark the foreign scan whether has unique results on one of its
* output columns.
*/
bool hasUniqueResults;
/*
* Mark the plan whether includes delta table or not.
*/
bool isDeltaTable;
/* used to replace work_mem, maxmem in [0], and minmem in [1] */
int operatorMemKB[2];
/* allowed max mem after spread */
int operatorMaxMem;
bool parallel_enabled; /* Is it run in parallel? */
bool hasHashFilter; /* true for this plan has a hashfilter */
List* var_list; /* Need bloom filter var list. */
List* filterIndexList; /* Need used bloomfilter array index. */
/* used to replace work_mem */
int** ng_operatorMemKBArray; /* for multiple logic cluster */
int ng_num;
double innerdistinct; /* join inner rel distinct estimation value */
double outerdistinct; /* join outer rel distinct estimation value */
} Plan;
startup_cost, total_cost:操作的成本估算
plan_rows:Planner估算会扫描的行数
targetlist:存储查询树中包含的目标对象列表
qual:存储qual条件的列表
lefttree, righttree:添加子节点的节点
-
示例1
这里介绍上一节示例1的计划树,上一节示例1最廉价的访问路径树是sort path和scan path的集合,根路径是sort path,子路径是scan path。而计划树则是几乎是从最廉价的路径中生成。由下图(a) 可以看到,SortNode被加入到PlannedStmt的plantree中,而SeqScanNode则添加到SortNode的lefttree中。
SortNode中,lefttree指向SeqScanNode。
SeqScanNode中,qual持有where条件。
使用EXPLAIN查看执行计划:
postgres=# explain select * from tbl_1 where id < 300 order by data;
QUERY PLAN
----------------------------------------------------------------
Sort (cost=1704.33..1705.03 rows=279 width=8)
Sort Key: data
-> Seq Scan on tbl_1 (cost=0.00..1693.00 rows=279 width=8)
Filter: (id < 300)
(4 rows)
- 示例2
这里介绍上一节示例2的计划树,示例2最廉价的访问路径树是index path。所以计划树中只有IndexScanNode。该示例中,where条件是访问谓词,存储在IndexScanNode的indexqual中。
访问谓词:Index Cond,表达叶节点遍历的开始和停止条件。
使用EXPLAIN查看执行计划:
postgres=# explain select * from tbl_2 where id < 240;
QUERY PLAN
---------------------------------------------------------------------------
Index Scan using tbl_2_pkey on tbl_2 (cost=0.29..13.25 rows=226 width=8)
Index Cond: (id < 240)
(2 rows)
- 执行器如何执行
在单表查询中,执行器以从计划树的末尾到根部的顺序使用计划节点,然后调用执行相应节点处理的函数。
每个计划节点具有用于执行相应操作的函数,它们位于src/backend/executor/目录中。
了解执行器如何执行的最佳方法是读取EXPLAIN命令的输出,因为PostgreSQL的EXPLAIN几乎展示了计划树的内容。 如示例1的EXPLAIN输出:
postgres=# explain select * from tbl_1 where id < 300 order by data;
QUERY PLAN
----------------------------------------------------------------
Sort (cost=1704.33..1705.03 rows=279 width=8)
Sort Key: data
-> Seq Scan on tbl_1 (cost=0.00..1693.00 rows=279 width=8)
Filter: (id < 300)
(4 rows)
从末尾往上读,首先读到SeqScan节点,执行器执行顺序扫描操作。然后到Sort节点,执行器对顺序扫描的结果执行排序操作。