查询处理 - 代价估计和计划树

代价估计和计划树

了解代价估计和计划树

简单查询的成本估算

Postgres的查询优化基于成本。成本是无量纲值，这些不是绝对的绩效指标，而是指比较操作相对绩效的指标。
主要有3种成本: start-up, run, total
start-up：获取到第一个目标tuple之前的成本
run：获取到所有tuple的成本
total：start-up和run的总和

postgres=# explain select * from a;
                     QUERY PLAN
-----------------------------------------------------
 Seq Scan on a  (cost=0.00..32.60 rows=2260 width=8)
(1 row)

如上面的例子，cost=0.00..32.60，其中start-up为0.00，total为32.60。

顺序扫描
顺序扫描的成功由函数 cost_seqscan()来估算。
顺序扫描，由于是扫描所有数据页，不需要准备工作，start-up的成本为0，run的成本计算公式为：
'run cost' = 'cpu run cost' + 'disk run cost'
=(cpu_tuple_cost + cpu_operator_cost) × N_tuple + seq_page_cost × N_page
其中seq_page_cost，cpu_tuple_cost，cpu_operator_cost的值在配置文件postgresql.conf中设置，默认值为 1.0, 0.01, and 0.0025。N_tuple，N_page可在pg_class中查询出来。

postgres=# create table test(id int,col int);
CREATE TABLE
postgres=# insert into test select gene

postgres=# insert into test select generate_series(1,100000),generate_series(1,100000);
INSERT 0 100000
postgres=# analyze;
ANALYZE
postgres=# select relpages, reltuples FROM pg_class WHERE relname='test';
 relpages | reltuples
----------+-----------
      443 |    100000
(1 row)

这里以查询语句 select * from test where id<10000; 为例子来估算顺序扫描的成本，可估算出run cost=(0.01+0.0025) * 100000+1.0 * 443=1693。在命令行查看语句的执行计划：

postgres=# explain select * from test where id<10000;
                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on test  (cost=0.00..1693.00 rows=9885 width=8)
   Filter: (id < 10000)
(2 rows)

索引扫描
索引扫描的成本由函数cost_index()来估算。

// 创建如下表带索引
postgres=# \d+ test
                                   Table "public.test"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 id     | integer |           | not null |         | plain   |              |
 col    | integer |           |          |         | plain   |              |
Indexes:
    "test_pkey" PRIMARY KEY, btree (id)
    "test_idx" btree (col)

以查询语句 select * from test where col < 240; 来估算索引扫描成本。
由查询条件col<240，可走test_idx索引，查询pg_class可得N_index,page=276，N_index,tuple=100000。

postgres=# SELECT relpages, reltuples FROM pg_class WHERE relname = 'test_idx';
 relpages | reltuples
----------+-----------
      276 |    100000
(1 row)

start-up成本
索引扫描start-up成本指读取索引页拿到目标表第一个tuple的成本，其估算公式为：
'start-up cost'={ceil(log₂(N_index,tuple)) + (H_index + 1) x 50} x cpu_operator_cost
其中H_index为索引树的高度。

// 用pageinspect查询索引高度，level为1，高度为1
postgres=# select * from bt_metap('test_idx');
 magic  | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples
--------+---------+------+-------+----------+-----------+-------------+-------------------------
 340322 |       3 |    3 |     1 |        3 |         1 |           0 |                      -1
(1 row)

cpu_operator_cost默认值为0.0025，则
start-up cost={ceil(log₂(100000)) + (1+1) * 50} * 0.0025=0.2925。

run成本
索引扫描的run成本指表和索引cpu成本、IO成本之和：
'run cost'=('index cpu cost' + 'table cpu cost') + ('index IO cost' + 'table IO cost')
注：仅索引扫描不需估算 'table cpu cost' 和 'table IO cost'。
其中三个成本估算公式如下：
'index cpu cost'=Selectivity x N_index,tuple x (cpu_index_tuple_cost+qual_op_cost),
'table cpu cost'=Selectivity x Ntuple x cpu_tuple_cost,
'index IO cost'=ceil(Selectivity x N_index,page) x random_page_cost
其中 cpu_index_tuple_cost, cpu_tuple_cost, random_page_cost 在配置文件postgresql.conf的默认值分别为 0.005, 0.01, 4.0。qual_op_cost，粗略的来讲，是评估索引的成本，这里为0.0025。Selectivity 指where子句的索引的搜索范围的比例，它是从0到1的浮点数，如 (Selectivity x N_tuple) 指读取表tuple的数量， (Selectivity x N_index,page) 指读取索引页的数量。

Selectivity
查询谓词的选择性使用 histogram_bounds 或 MCV(Most Common Value) 来估算，这两者可在pg_stats中查询出来。
表每个字段的MCV存储在pg_stats的most_common_vals 和 most_common_freqs字段，其中：
most_common_vals：指字段 MCV 列表
most_common_freqs：指MCV的频率列表
如查询语句是：select * from test where col < 240;
SELECT most_common_vals, most_common_freqs FROM pg_stats where tablename='test' and attname='col';
可查出表 test 字段 col 值 240 对应的频率，将该频率作为Selectivity值。
如果 MCV 没有查询出结果，则使用 histogram_bounds 来估算。
histogram_bounds：将字段的值分成近似相等的级别的值列表
查看表test字段col的histogram_bounds：

postgres=# SELECT histogram_bounds  FROM pg_stats where tablename='test' and attname='col';

                                                                                                           histogram_bounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {2,1030,2030,3016,3967,4950,5839,6887,7880,8930,10053,11095,12058,13050,14052,15031,16055,17013,18042,19101,20216,21246,22275,23219,24254,25285,26293,27321,28441,29448,30569,31608,32
659,33671,34663,35620,36559,37519,38460,39413,40411,41393,42358,43368,44296,45366,46374,47321,48384,49354,50361,51391,52450,53510,54503,55498,56503,57470,58418,59409,60492,61416,62346
,63312,64390,65286,66356,67311,68350,69363,70277,71322,72283,73306,74362,75403,76297,77265,78300,79236,80238,81261,82321,83317,84335,85287,86225,87188,88206,89199,90201,91152,92060,92
999,94019,94978,96002,97009,97935,99004,99999}
(1 row)

histogram_bounds默认分为100个桶，桶的编号从0开始，histogram_bounds的值是桶的边界值，如第0个桶的histogram_bounds为2，则存储在0号桶的tuple最小值为2，第1个桶的histogram_bounds为1030，则存储在1号桶的tuple最小值为1030，则桶0存储的数据为 2 <= value < 1030。
查询条件 col < 240，240 存储在第0个桶，本例中通过线性插值，可计算：
$Selectivity = \frac{0 + (240 - hb[0])/(hb[1] - hb[0])}{100} = \frac{0 + (240 - 2)/(1030 - 2)}{100} = \frac{0 + 238/1028}{100} = 0.002315175$

所以，
index cpu cost = 0.002315175* 100000 * (0.005 + 0.0025) = 1.73638125
table cpu cost = 0.002315175* 100000 * 0.01 = 2.315175
index IO cost = ceil(0.002315175* 276) * 4.0 = 4.0

'table IO cost'估算公式为：
'table IO cost'=max_IO_cost+indexCorrelation²x(min_IO_cost-max_IO_cost)

max_IO_cost是IO成本的最坏情况，即随机扫描所有表页的成本，公式如下：
max_IO_cost=N_page x random_page_cost
其中N_page=443，
max_IO_cost=443 * 4.0 = 1772

min_IO_cost是IO成本的最佳情况，即顺序扫描所选表页的成本，公式如下：
min_IO_cost=1 x random_page_cost + (ceil(Selectivity x N_page) - 1) x seq_page_cost
则：
min_IO_cost=1 * 4.0 + (ceil(0.002315175* 443) - 1) * 1.0 = 5.0
在本例中，indexCorrelation=1。

postgres=# SELECT tablename,attname, correlation FROM pg_stats WHERE tablename = 'test' and attname='col';
 tablename | attname | correlation
-----------+---------+-------------
 test      | col     |           1
(1 row)

所以，
table IO cost = 1772 + 1₂ * (5.0 - 1772) = 5.0
最后，
run cost = (1.73638125 + 2.315175) + (4.0 + 5.0) = 13.05155625

total成本
total cost = 0.2925 + 13.05155625 = 13.34405625
使用explain查询执行计划为：

postgres=# explain select * from test where col < 240;
                               QUERY PLAN
------------------------------------------------------------------------
 Index Scan using test_idx on test  (cost=0.29..13.35 rows=232 width=8)
   Index Cond: (col < 240)
(2 rows)

计划树

PostgreSQL Planner执行三个步骤，如下所示：

预处理
通过估算所有可能的访问路径的成本，获得最廉价的访问路径
从最廉价的路径创建计划树
访问路径是用于估计成本的处理单元，例如顺序扫描，索引扫描，排序和各种加入操作具有它们对应的路径。访问路径仅在Planner使用以创建计划树。访问路径最基本的数据结构是Path，它对应于顺序扫描，所有其他访问路径都基于它。
要处理上述步骤，Planner内部创建了一个PlannerInfo结构，并保存查询树，查询中包含的relations，访问路径等信息。

// Path 结构
typedef struct Path {
    NodeTag type;

    NodeTag pathtype; /* tag identifying scan/join method */

    RelOptInfo* parent;        /* the relation this path can build */
    ParamPathInfo* param_info; /* parameterization info, or NULL if none */

    /* estimated size/costs for path (see costsize.c for more info) */
    double rows; /* estimated number of global result tuples */
    double multiple;
    Cost startup_cost; /* cost expended before fetching any tuples */
    Cost total_cost;   /* total cost (assuming all tuples fetched) */
    Cost stream_cost;  /* cost of actions invoked by stream but can't be parallelled in this path */

    List* pathkeys;        /* sort ordering of path's output */
    List* distribute_keys; /* distribute key, Var list */
    char locator_type;
    Oid rangelistOid;
    int dop; /* degree of parallelism */
    /* pathkeys is a List of PathKey nodes; see above */
    Distribution distribution;
    int hint_value;       /* Mark this path if be hinted, and hint kind. */
    double innerdistinct; /* join inner rel distinct estimation value */
    double outerdistinct; /* join outer rel distinct estimation value */
} Path;

// PlannerInfo 结构
typedef struct PlannerInfo {
    NodeTag type;

    Query* parse; /* the Query being planned */

    PlannerGlobal* glob; /* global info for current planner run */

    Index query_level; /* 1 at the outermost Query */

    struct PlannerInfo* parent_root; /* NULL at outermost Query */

    /*
     * simple_rel_array holds pointers to "base rels" and "other rels" (see
     * comments for RelOptInfo for more info).  It is indexed by rangetable
     * index (so entry 0 is always wasted).  Entries can be NULL when an RTE
     * does not correspond to a base relation, such as a join RTE or an
     * unreferenced view RTE; or if the RelOptInfo hasn't been made yet.
     */
    struct RelOptInfo** simple_rel_array; /* All 1-rel RelOptInfos */
    int simple_rel_array_size;            /* allocated size of array */

    /*
     * List of changed var that mutated during cost-based rewrite optimization, the
     * element in the list is "struct RewriteVarMapping", for example:
     * - inlist2join
     * - pushjoin2union (will implemented)
     * _ ...
     *
     */
    List* var_mappings;
    Relids var_mapping_rels; /* all the relations that related to inlist2join */

    /*
     * simple_rte_array is the same length as simple_rel_array and holds
     * pointers to the associated rangetable entries.  This lets us avoid
     * rt_fetch(), which can be a bit slow once large inheritance sets have
     * been expanded.
     */
    RangeTblEntry** simple_rte_array; /* rangetable as an array */

    /*
     * all_baserels is a Relids set of all base relids (but not "other"
     * relids) in the query; that is, the Relids identifier of the final join
     * we need to form.
     */
    Relids all_baserels;

    /*
     * join_rel_list is a list of all join-relation RelOptInfos we have
     * considered in this planning run.  For small problems we just scan the
     * list to do lookups, but when there are many join relations we build a
     * hash table for faster lookups.  The hash table is present and valid
     * when join_rel_hash is not NULL.  Note that we still maintain the list
     * even when using the hash table for lookups; this simplifies life for
     * GEQO.
     */
    List* join_rel_list;        /* list of join-relation RelOptInfos */
    struct HTAB* join_rel_hash; /* optional hashtable for join relations */

    /*
     * When doing a dynamic-programming-style join search, join_rel_level[k]
     * is a list of all join-relation RelOptInfos of level k, and
     * join_cur_level is the current level.  New join-relation RelOptInfos are
     * automatically added to the join_rel_level[join_cur_level] list.
     * join_rel_level is NULL if not in use.
     */
    List** join_rel_level; /* lists of join-relation RelOptInfos */
    int join_cur_level;    /* index of list being extended */

    List* init_plans; /* init SubPlans for query */

    List* cte_plan_ids; /* per-CTE-item list of subplan IDs */

    List* eq_classes; /* list of active EquivalenceClasses */

    List* canon_pathkeys; /* list of "canonical" PathKeys */

    List* left_join_clauses; /* list of RestrictInfos for
                              * mergejoinable outer join clauses
                              * w/nonnullable var on left */

    List* right_join_clauses; /* list of RestrictInfos for
                               * mergejoinable outer join clauses
                               * w/nonnullable var on right */

    List* full_join_clauses; /* list of RestrictInfos for
                              * mergejoinable full join clauses */

    List* join_info_list; /* list of SpecialJoinInfos */

    List* lateral_info_list;  /* list of LateralJoinInfos */

    List* append_rel_list; /* list of AppendRelInfos */

    List* rowMarks; /* list of PlanRowMarks */

    List* placeholder_list; /* list of PlaceHolderInfos */

    List* query_pathkeys; /* desired pathkeys for query_planner(), and
                           * actual pathkeys afterwards */

    List* group_pathkeys;    /* groupClause pathkeys, if any */
    List* window_pathkeys;   /* pathkeys of bottom window, if any */
    List* distinct_pathkeys; /* distinctClause pathkeys, if any */
    List* sort_pathkeys;     /* sortClause pathkeys, if any */

    List* minmax_aggs; /* List of MinMaxAggInfos */

    List* initial_rels; /* RelOptInfos we are now trying to join */

    MemoryContext planner_cxt; /* context holding PlannerInfo */

    double total_table_pages; /* # of pages in all tables of query */

    double tuple_fraction; /* tuple_fraction passed to query_planner */
    double limit_tuples;   /* limit_tuples passed to query_planner */

    bool hasInheritedTarget;     /* true if parse->resultRelation is an
                                  * inheritance child rel */
    bool hasJoinRTEs;            /* true if any RTEs are RTE_JOIN kind */
    bool hasLateralRTEs;         /* true if any RTEs are marked LATERAL */
    bool hasHavingQual;          /* true if havingQual was non-null */
    bool hasPseudoConstantQuals; /* true if any RestrictInfo has
                                  * pseudoconstant = true */
    bool hasRecursion;           /* true if planning a recursive WITH item */

    /* Note: qualSecurityLevel is zero if there are no securityQuals */
    Index qualSecurityLevel; /* minimum security_level for quals */

#ifdef PGXC
    /* This field is used only when RemoteScan nodes are involved */
    int rs_alias_index; /* used to build the alias reference */

    /*
     * In Postgres-XC Coordinators are supposed to skip the handling of
     * row marks of type ROW_MARK_EXCLUSIVE & ROW_MARK_SHARE.
     * In order to do that we simply remove such type
     * of row marks from the list rowMarks. Instead they are saved
     * in xc_rowMarks list that is then handeled to add
     * FOR UPDATE/SHARE in the remote query
     */
    List* xc_rowMarks; /* list of PlanRowMarks of type ROW_MARK_EXCLUSIVE & ROW_MARK_SHARE */
#endif

    /* These fields are used only when hasRecursion is true: */
    int wt_param_id;                 /* PARAM_EXEC ID for the work table */
    struct Plan* non_recursive_plan; /* plan for non-recursive term */

    /* These fields are workspace for createplan.c */
    Relids curOuterRels;  /* outer rels above current node */
    List* curOuterParams; /* not-yet-assigned NestLoopParams */

    Index curIteratorParamIndex;
    bool isPartIteratorPlanning;
    int curItrs;
    List* subqueryRestrictInfo; /* Subquery RestrictInfo, which only be used in wondows agg. */

    /* optional private data for join_search_hook, e.g., GEQO */
    void* join_search_private;

    /* Added post-release, will be in a saner place in 9.3: */
    List* plan_params; /* list of PlannerParamItems, see below */

    /* For count_distinct, save null info for group by clause */
    List* join_null_info;
    /* for GroupingFunc fixup in setrefs */
    AttrNumber* grouping_map;

    /* If current query level is correlated with upper level */
    bool is_correlated;

    /* data redistribution for DFS table.
     * dataDestRelIndex is index into the range table. This variable
     * will take effect on data redistribution state.
     * The effective value must be greater than 0.
     */
    Index dataDestRelIndex;

    /* interesting keys of current query level */
    ItstDisKey dis_keys;

    /*
     * indicate if the subquery planning root (PlannerInfo) is under or rooted from
     * recursive-cte planning.
     */
    bool is_under_recursive_cte;

    /*
     * indicate if the subquery planning root (PlannerInfo) is under recursive-cte's
     * recursive branch
     */
    bool is_under_recursive_tree;
    bool has_recursive_correlated_rte; /* true if any RTE correlated with recursive cte */

    int subquery_type;
    Bitmapset *param_upper;
    
    bool hasRownumQual;
} PlannerInfo;

预处理
创建计划树前，Planner对存储在PlannerInfo structure的查询树执行预处理。
这里介绍单表查询的主要预处理：

简化target lists,limit子句等
例：将'2+2'重写为'4'
正常化Boolean表达式
例：'NOT (NOT a)' 重写为 'a'
平整 AND/OR 表达式
AND/OR表达式在标准SQL里是二进制运算符，但在PG，它们是n-ary运行符，Planner始终假设所有嵌套的AND,OR表达式平整化。
如下例子展示将 '(id=1) OR (id=2) OR (id=3)' 平整化。

flattened OR expression.png

获取最廉价的访问路径
为了拿到最廉价的访问路径，Planner对所有可能的访问路径进行成本估算，并选择最廉价的访问路径。Planner执行了以下操作：

创建 RelOptInfo 结构来存储访问路径和对应的成本，RelOptInfo中的baserestrictinfo存储where条件相关信息，indexlist存储目标表存在的索引信息

typedef enum RelOptKind { 
RELOPT_BASEREL, 
RELOPT_JOINREL, 
RELOPT_OTHER_MEMBER_REL, 
RELOPT_DEADREL 
} RelOptKind;

typedef struct RelOptInfo {
    NodeTag type;

    RelOptKind reloptkind;

    /* all relations included in this RelOptInfo */
    Relids relids; /* set of base relids (rangetable indexes) */

    bool isPartitionedTable; /* is it a partitioned table? it is meaningless unless
                         it is a 'baserel' (reloptkind = RELOPT_BASEREL) */
    PartitionFlag partflag;

    /* size estimates generated by planner */
    double rows;           /* estimated number of global result tuples */
    int width;             /* estimated avg width of result tuples */
    int encodedwidth;      /* estimated avg width of encoded columns in result tuples */
    AttrNumber encodednum; /* number of encoded column */

    /* materialization information */
    List* reltargetlist;   /* Vars to be output by scan of relation */
    List* distribute_keys; /* distribute key */
    List* pathlist;        /* Path structures */
    List* ppilist;         /* ParamPathInfos used in pathlist */
    struct Path* cheapest_startup_path;
    List* cheapest_total_path; /* contain all cheapest total paths from different distribute key */
    struct Path* cheapest_unique_path;
    List* cheapest_parameterized_paths;

    /* information about a base rel (not set for join rels!) */
    Index relid;
    Oid reltablespace;   /* containing tablespace */
    RTEKind rtekind;     /* RELATION, SUBQUERY, or FUNCTION */
    AttrNumber min_attr; /* smallest attrno of rel (often <0) */
    AttrNumber max_attr; /* largest attrno of rel */
    Relids* attr_needed; /* array indexed [min_attr .. max_attr] */
    int32* attr_widths;  /* array indexed [min_attr .. max_attr] */
    List* lateral_vars;  /* LATERAL Vars and PHVs referenced by rel */
    Relids lateral_relids; /* minimum parameterization of rel */
    List* indexlist;     /* list of IndexOptInfo */
    RelPageType pages;   /* local size estimates derived from pg_class */
    double tuples;       /* global size estimates derived from pg_class */
    double multiple;     /* how many dn skewed and biased be influenced by distinct. */
    double allvisfrac;

    struct PruningResult* pruning_result; /* pruning result for partitioned table with
                                    baserestrictinfo,it is meaningless unless it
                                    is a 'baserel' (reloptkind = RELOPT_BASEREL) */
    int partItrs;                         /* the number of the partitions in pruning_result */
    struct PruningResult* pruning_result_for_index_usable;
    int partItrs_for_index_usable; /* the number of the partitions in pruning_result_for_seqscan */
    struct PruningResult* pruning_result_for_index_unusable;
    int partItrs_for_index_unusable; /* the number of the partitions in pruning_result_for_seqscan */
    /* information about a partitioned table */
    BucketInfo *bucketInfo;

    /* use "struct Plan" to avoid including plannodes.h here */
    struct Plan* subplan; /* if subquery */
    PlannerInfo* subroot; /* if subquery */
    List *subplan_params; /* if subquery */
    /* use "struct FdwRoutine" to avoid including fdwapi.h here */
    struct FdwRoutine* fdwroutine; /* if foreign table */
    void* fdw_private;             /* if foreign table */

    /* used by various scans and joins: */
    List* baserestrictinfo;          /* RestrictInfo structures (if base
                                      * rel) */
    QualCost baserestrictcost;       /* cost of evaluating the above */
    Index baserestrict_min_security; /* min security_level found in
                                      * baserestrictinfo */
    List* joininfo;                  /* RestrictInfo structures for join clauses
                                      * involving this rel */
    bool has_eclass_joins;           /* T means joininfo is incomplete */
    RelOrientation orientation;      /* the store type of base rel */
    RelstoreType relStoreLocation;   /* the relation store location. */
    char locator_type;               /* the location type of base rel */
    Oid rangelistOid;                /* oid of list/range distributed table, InvalidOid if not list/range table */
    List* subplanrestrictinfo;       /* table filter with correlated column involved */
    ItstDisKey rel_dis_keys;         /* interesting key info for current relation */
    List* varratio;                  /* rel tuples ratio after join to different relation */
    List* varEqRatio;

    /*
     * The alternative rel for cost-based query rewrite
     *
     * Note: Only base rel have valid pointer of this fields, set to NULL for alternative rel
     */
    List* alternatives;

    /*
     * Rel opinter to base rel that in plannerinfo->simple_rel_array[x].
     *
     * Note: Only alternative rels has valid pointer of this field, set to NULL for the
     * origin rel.
     */
    RelOptInfo* base_rel;

    unsigned int num_data_nodes = 0; //number of distributing data nodes
} RelOptInfo;

估算所有可能的访问路径的成本，将其加入到RelOptInfo 结构
i. 创建path，估算顺序扫描的成本，并将结果写到path，然后将path增加到RelOptInfo的pathlist里。
ii. 如果目标表存在索引，则逐个创建索引访问路径path，并估算索引扫描的成本保存到path，再将所有path增加到RelOptInfo的pathlist里。
iii. 如果支持位图扫描，创建位图扫描访问路径path，并估算位图扫描的成本保存到path，再将所有path增加到RelOptInfo的pathlist里。
从RelOptInfo的pathlist里获取最廉价的访问路径
如果需要，估算 LIMIT, ORDER BY, ARREGISFDD 的成本

以下以两个示例来说明Planner的执行。

示例1
首先创建一张表，不带索引，执行的查询语句带where条件和order by子句。

postgres=# \d+ tbl_1
                                   Table "public.tbl_1"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 id     | integer |           |          |         | plain   |              |
 data    | integer |           |          |         | plain   |              |

postgres=# select * from tbl_1 where id < 300 order by data;

示例1 - 1.png

(1) 创建RelOptInfo，加入到PlannerInfo的simple_rel_array
(2) 将where条件存储到RelOptInfo的baserestrictinfo中，另外RelOptInfo的indexlist为空，因为目标表没有索引
(3) 由于语句中有order by子句，增加排序的pathkey到PlannerInfo的sort_pathkeys
(4) 创建顺序扫描的path，并已估算好成本，再增加到RelOptInfo的pathlist
由于该例中目标表没有索引，所以Planner只估算顺序扫描的成本，顺序扫描即为该查询最廉价的访问路径。

示例1 - 2.png

(5) 创建新的RelOptInfo来处理ORDER BY
该RelOptInfo不包含where条件，即baserestrictinfo为空
(6) 创建sort path，估算成本后，加入到新的RelOptInfo的pathlist里。然后，将上面的顺序扫描的path连接到sort path的subpath中。subpath连接最廉价的访问路径。

示例2
创建表带两个索引，执行的查询语句带where条件

postgres=# \d+ tbl_2
                                   Table "public.tbl_2"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 id     | integer |           | not null |         | plain   |              |
 data   | integer |           |          |         | plain   |              |
Indexes:
    "tbl_2_pkey" PRIMARY KEY, btree (id)
    "tbl_2_data_idx" btree (data)

postgres=# select * from tbl_2 where id < 240;

示例2 - 1.png

(1) 创建RelOptInfo，加入到PlannerInfo的simple_rel_array
(2) 将where条件存储到RelOptInfo的baserestrictinfo中，将目标表的索引逐个加入到RelOptInfo的indexlist中
(3) 创建顺序扫描的path，并已估算好成本，再增加到RelOptInfo的pathlist

示例2 - 2.png

(4) 创建索引扫描的访问路径 IndexPath，估算成本后，加入到RelOptInfo的pathlist中
该例中有两个索引，tbl_2_pkey，tbl_2_data_idx，需按顺序处理。首先处理tbl_2_pkey。
这里先创建tbl_2_pkey的indexpath，估算成本后，加入到RelOptInfo的pathlist中。由于tbl_2_pkey关联字段 'id'，所以将where条件存储到indexclauses中。
另外，在将访问路径加入到RelOptInfo的pathlist时，会按total成本排序。在本例中，由于索引扫描的成本小于顺序扫描，所以索引扫描的访问路径在顺序扫描前面。
(5) 创建另一个索引的访问路径 IndexPath，估算成本后，加入到RelOptInfo的pathlist中

示例2 - 3.png

(6) 创建新的RelOptInfo
(7) 将最廉价的访问路径加入到新的RelOptInfo的pathlist中
在该例中，最廉价的访问路径是使用tbl_2_pkey索引扫描的访问路径。

创建计划树
Planner使用最廉价的访问路径生成计划树。
计划树的根存储为PlannedStmt结构。

/* ----------------
 *      PlannedStmt node
 *
 * The output of the planner is a Plan tree headed by a PlannedStmt node.
 * PlannedStmt holds the "one time" information needed by the executor.
 * ----------------
 */
typedef struct PlannedStmt {
    NodeTag type;

    CmdType commandType; /* select|insert|update|delete */

    uint64 queryId; /* query identifier,  uniquely indicate this plan in Runtime (copied from Query) */

    bool hasReturning; /* is it insert|update|delete RETURNING? */

    bool hasModifyingCTE; /* has insert|update|delete in WITH? */

    bool canSetTag; /* do I set the command result tag? */

    bool transientPlan; /* redo plan when TransactionXmin changes? */

    bool dependsOnRole; /* is plan specific to current role? */

    Plan* planTree; /* tree of Plan nodes */

    List* rtable; /* list of RangeTblEntry nodes */

    /* rtable indexes of target relations for INSERT/UPDATE/DELETE */
    List* resultRelations; /* integer list of RT indexes, or NIL */

    Node* utilityStmt; /* non-null if this is DECLARE CURSOR */

    List* subplans; /* Plan trees for SubPlan expressions */

    Bitmapset* rewindPlanIDs; /* indices of subplans that require REWIND */

    List* rowMarks; /* a list of PlanRowMark's */

    /*
     * Notice: be careful to use relationOids as it may contain non-table OID
     * in some scenarios, e.g. assignment of relationOids in fix_expr_common.
     */
    List* relationOids; /* contain OIDs of relations the plan depends on */

    List* invalItems; /* other dependencies, as PlanInvalItems */

    int nParamExec; /* number of PARAM_EXEC Params used */

    int num_streams; /* number of stream exist in plan tree */

    int max_push_sql_num; /* number of split sql want push DN execute */

    int gather_count; /* gather_count in query */

    int num_nodes; /* number of data nodes */

    NodeDefinition* nodesDefinition; /* all data nodes' defination */

    int instrument_option; /* used for collect instrument data */

    int num_plannodes; /* how many plan node in this planstmt */

    int query_mem[2]; /* how many memory the query can use ,  memory in kb  */

    int assigned_query_mem[2]; /* how many memory the query is assigned   */

    bool is_dynmaic_smp;

    int dynsmp_max_cpu; /* max avaliable cpu for this dn */

    int dynsmp_avail_cpu; /* max avaliable cpu for this dn */

    int dynsmp_cpu_util;

    int dynsmp_active_statement;

    double dynsmp_query_estimate_cpu_usge;

    int dynsmp_plan_optimal_dop; /* the final optimized dop for the plan */

    int dynsmp_plan_original_dop;

    int dynsmp_dop_mem_limit; /* memory will put a limit on dop */

    int dynsmp_min_non_spill_dop; /* optimal dop cannot greater than this */

    int num_bucketmaps; /* Num of special-bucketmap stored in plannedstmt */

    uint2* bucketMap[MAX_SPECIAL_BUCKETMAP_NUM]; /* the map information need to be get */

    char* query_string; /* convey the query string to backend/stream thread of DataNode for debug purpose */

    List* subplan_ids; /* in which plan id subplan should be inited */

    List* initPlan; /* initplan in top plan node */
    /* data redistribution for DFS table.
     * dataDestRelIndex is index into the range table. This variable
     * will take effect on data redistribution state.
     */
    Index dataDestRelIndex;

    int MaxBloomFilterNum;

    int query_dop; /* Dop of current query. */

    double plannertime; /* planner execute time */

    /* set true in do_query_for_planrouter() for PlannedStmt sent to
     * the compute pool
     */
    bool in_compute_pool;

    /* true if there is/are ForeignScan node(s) of OBS foreign table
     * in plantree.
     */
    bool has_obsrel;

    List* plan_hint_warning; /* hint warning during plan generation, only used in CN */

    List* noanalyze_rellist; /* relations and attributes that have no statistics, only used in CN */

    int ng_num;                     /* nodegroup number */
    NodeGroupQueryMem* ng_queryMem; /* each nodegroup's query mem */
    bool ng_use_planA;              /* true means I am a planA, default false */

    bool isRowTriggerShippable; /* true if all row triggers are shippable. */
    bool is_stream_plan;
    bool multi_node_hint;

    uint64 uniqueSQLId;
} PlannedStmt;

commandType：存储操作类型 select, insert, update, delete
rtable：存储关系列表条目
relationOids：存储查询相关表的oid
plantree：存储由计划节点组成的计划树，每个节点有特定的操作，如顺序扫描，排序，索引扫描

计划节点的基础节点是Plan结构，其他计划结构始终包含它。

typedef struct Plan {
    NodeTag type;

    int plan_node_id;   /* node id */
    int parent_node_id; /* parent node id */
    RemoteQueryExecType exec_type;

    /*
     * estimated execution costs for plan (see costsize.c for more info)
     */
    Cost startup_cost; /* cost expended before fetching any tuples */
    Cost total_cost;   /* total cost (assuming all tuples fetched) */

    /*
     * planner's estimate of result size of this plan step
     */
    double plan_rows; /* number of global rows plan is expected to emit */
    double multiple;
    int plan_width; /* average row width in bytes */
    int dop;        /* degree of parallelism of current plan */

    /*
     * machine learning model estimations
     */
    double pred_rows;
    double pred_startup_time;
    double pred_total_time;
    long pred_max_memory;
    /*
     * MPPDB Recursive-Union Support
     *
     * - @recursive_union_plan_nodeid
     *      Pointing to its belonging RecursiveUnion's plan node id to indate if we are
     *      under RecursiveUnion
     *
     * - @recursive_union_controller
     *      Indicate if current Plan node is controller node in recursive-union steps
     *
     * - @control_plan_nodeid
     *      Normally, set on the top-plan node of a producer thread, to indicate which
     *      control-plan we need syn-up with
     *
     * - @is_sync_planode
     *      Indicate the current producer thread is the sync-up thread in recursive union,
     *      normally set on produer's top plan node
     *
     * Please note the above four variables is meaningless if a plan node is not under
     * recursive-union's recursive part
     */
    /*
     * plan node id of RecursiveUnion node where current plan node belongs to, 0 for
     * not under recursive-union
     */
    int recursive_union_plan_nodeid;

    /* flag to indicate if it is controller plan node */
    bool recursive_union_controller;

    /* plan node id of Controller plan node, 0 for not in control */
    int control_plan_nodeid;

    /* flag indicate if the current plan node is the sync node (for multi-stream case) */
    bool is_sync_plannode;

    /*
     * Common structural data for all Plan types.
     */
    List* targetlist;      /* target list to be computed at this node */
    List* qual;            /* implicitly-ANDed qual conditions */
    struct Plan* lefttree; /* input plan tree(s) */
    struct Plan* righttree;

    bool ispwj;  /* is it special for partitionwisejoin? */
    int paramno; /* the partition'sn that it is scaning */

    List* initPlan;    /* Init Plan nodes (un-correlated expr
                        * subselects) */

    List* distributed_keys; /* distributed on which key */
    ExecNodes* exec_nodes;  /*  List of Datanodes where to execute this plan    */

    /*
     * Information for management of parameter-change-driven rescanning
     *
     * extParam includes the paramIDs of all external PARAM_EXEC params
     * affecting this plan node or its children.  setParam params from the
     * node's initPlans are not included, but their extParams are.
     *
     * allParam includes all the extParam paramIDs, plus the IDs of local
     * params that affect the node (i.e., the setParams of its initplans).
     * These are _all_ the PARAM_EXEC params that affect this node.
     */
    Bitmapset* extParam;
    Bitmapset* allParam;

    // For vectorized engine, plan produce vector output
    //
    bool vec_output;
    /*
     * @hdfs
     * Mark the foreign scan whether has unique results on one of its
     * output columns.
     */
    bool hasUniqueResults;
    /*
     * Mark the plan whether includes delta table or not.
     */
    bool isDeltaTable;

    /* used to replace work_mem, maxmem in [0], and minmem in [1] */
    int operatorMemKB[2];
    /* allowed max mem after spread */
    int operatorMaxMem;

    bool parallel_enabled; /* Is it run in parallel? */
    bool hasHashFilter;    /* true for this plan has a hashfilter */

    List* var_list;        /* Need bloom filter var list. */
    List* filterIndexList; /* Need used bloomfilter array index. */

    /* used to replace work_mem */
    int** ng_operatorMemKBArray; /* for multiple logic cluster */
    int ng_num;
    double innerdistinct; /* join inner rel distinct estimation value */
    double outerdistinct; /* join outer rel distinct estimation value */
} Plan;

startup_cost, total_cost：操作的成本估算
plan_rows：Planner估算会扫描的行数
targetlist：存储查询树中包含的目标对象列表
qual：存储qual条件的列表
lefttree, righttree：添加子节点的节点

示例1
这里介绍上一节示例1的计划树，上一节示例1最廉价的访问路径树是sort path和scan path的集合，根路径是sort path，子路径是scan path。而计划树则是几乎是从最廉价的路径中生成。由下图(a) 可以看到，SortNode被加入到PlannedStmt的plantree中，而SeqScanNode则添加到SortNode的lefttree中。
SortNode中，lefttree指向SeqScanNode。
SeqScanNode中，qual持有where条件。

计划树.png

使用EXPLAIN查看执行计划：

postgres=# explain select * from tbl_1 where id < 300 order by data;
                           QUERY PLAN
----------------------------------------------------------------
 Sort  (cost=1704.33..1705.03 rows=279 width=8)
   Sort Key: data
   ->  Seq Scan on tbl_1  (cost=0.00..1693.00 rows=279 width=8)
         Filter: (id < 300)
(4 rows)

示例2
这里介绍上一节示例2的计划树，示例2最廉价的访问路径树是index path。所以计划树中只有IndexScanNode。该示例中，where条件是访问谓词，存储在IndexScanNode的indexqual中。
访问谓词：Index Cond，表达叶节点遍历的开始和停止条件。
使用EXPLAIN查看执行计划：

postgres=# explain select * from tbl_2 where id < 240;
                                QUERY PLAN
---------------------------------------------------------------------------
 Index Scan using tbl_2_pkey on tbl_2  (cost=0.29..13.25 rows=226 width=8)
   Index Cond: (id < 240)
(2 rows)

执行器如何执行
在单表查询中，执行器以从计划树的末尾到根部的顺序使用计划节点，然后调用执行相应节点处理的函数。
每个计划节点具有用于执行相应操作的函数，它们位于src/backend/executor/目录中。
了解执行器如何执行的最佳方法是读取EXPLAIN命令的输出，因为PostgreSQL的EXPLAIN几乎展示了计划树的内容。如示例1的EXPLAIN输出：

postgres=# explain select * from tbl_1 where id < 300 order by data;
                           QUERY PLAN
----------------------------------------------------------------
 Sort  (cost=1704.33..1705.03 rows=279 width=8)
   Sort Key: data
   ->  Seq Scan on tbl_1  (cost=0.00..1693.00 rows=279 width=8)
         Filter: (id < 300)
(4 rows)

从末尾往上读，首先读到SeqScan节点，执行器执行顺序扫描操作。然后到Sort节点，执行器对顺序扫描的结果执行排序操作。

最后编辑于：2021.06.22 18:17:23

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,271评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,275评论 2赞 380
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,151评论 0赞 336
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,550评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,553评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,559评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,924评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,580评论 0赞 257
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,826评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,578评论 2赞 320
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,661评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,363评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,940评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,926评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,156评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,872评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,391评论 2赞 342

查询处理 - 代价估计和计划树

代价估计和计划树

简单查询的成本估算

计划树

推荐阅读更多精彩内容