1. 借鉴
HIVE的四种排序,以及分区分桶
谈谈hive的order by ,sort by ,distribute by 和 cluster by
hive hsq中的group by & Distribute by & partition by & cluster by & partitioned by & clustered by
2. 开始
排序的四种方式:
- order by
- sort by
- distribute by
- cluster by
下面我们依次来看下
order by
全局排序,只有一个reducer执行,跟配置的mapreduce.job.reduces无关
,将所有数据进行排序(默认升序)。
select * from hotel_checkin_log order by order_time;
sort by
每个reducer内部排序,跟配置的mapreduce.job.reduces有关
,从全局的角度来看并不是排序的。
在排序之前,先确认一下我们reducer的数量
hive (db_company)> set mapred.reduce.tasks;
mapred.reduce.tasks=4
是4个reducer,然后我们执行一下sort by,按照order_time升序排序
select * from hotel_checkin_log sort by order_time;
然后我们看下排序后的数据,其中倒数第三列为排序列。
就下面这个结果结合order_time排序列来看,它并不是全局排序的,我们可以根据数据来推算出它每个reducer都计算几行数据(这里我用回车表示每个reducer的结果,实际数据没有回车的)。
hotel_checkin_log.code hotel_checkin_log.checkin_date hotel_checkin_log.checkin_time hotel_checkin_log.checkout_date hotel_checkin_log.checkout_time hotel_checkin_log.order_time hotel_checkin_log.uid hotel_checkin_log.order_date
10271563 2020-03-20 14:00 2020-03-21 12:00 02:55:22 1124759681900888144 2020-03-19
10271563 2020-04-02 14:00 2020-04-03 12:00 02:55:22 1123659685898059853 2020-04-01
02371493 2020-04-01 17:00 2020-04-02 12:00 09:20:00 1133659694081146884 2020-03-31
00277553 2020-03-22 12:00 2020-03-25 12:00 12:00:00 1213559689173811267 2020-03-20
02371493 2020-04-02 17:00 2020-04-03 12:00 09:20:00 1133659694081146884 2020-04-01
02371493 2020-03-20 17:00 2020-03-21 12:00 09:20:00 1236559684014817343 2020-03-19
00277553 2020-04-02 12:00 2020-04-03 12:00 12:00:00 1213559689173811267 2020-04-01
00277553 2020-03-19 12:00 2020-03-22 12:00 12:00:00 1124559675462631493 2020-03-18
10271563 2020-03-19 14:00 2020-03-20 12:00 02:55:22 1431259675035954008 2020-03-18
10271563 2020-04-01 14:00 2020-04-02 12:00 02:55:22 1123659685898059853 2020-03-31
02371493 2020-03-19 17:00 2020-03-19 12:00 09:20:00 1634659676964192299 2020-03-18
00277553 2020-04-01 12:00 2020-04-02 12:00 12:00:00 1213559689173811267 2020-03-31
00277553 2020-03-20 12:00 2020-03-22 12:00 12:00:00 1234759682320318549 2020-03-19
10271563 2020-03-21 14:00 2020-03-23 12:00 02:55:22 1123659685898059853 2020-03-20
02371493 2020-03-22 17:00 2020-03-23 12:00 09:20:00 1133659694081146884 2020-03-20
Time taken: 33.041 seconds, Fetched: 15 row(s)
为了更方便确认它的reducer内部排序,我们将排序的结果写入文件,据我们所知,一个reducer的结果会生成一个文件
insert overwrite local directory '/opt/envs/datas/sortby'
select * from hotel_checkin_log sort by order_time;
我们看下文件数量就知道了
[root@e2d21b78fdb0 sortby]# ll
total 16
-rw-r--r-- 1 root root 332 Jun 25 11:07 000000_0
-rw-r--r-- 1 root root 332 Jun 25 11:07 000001_0
-rw-r--r-- 1 root root 415 Jun 25 11:07 000002_0
-rw-r--r-- 1 root root 166 Jun 25 11:07 000003_0
另外我也查看了一下文件内容,跟我的推算完全一致,这里我只粘贴出来000002_0这个文件的内容
102715632020-03-1914:002020-03-2012:0002:55:2214312596750359540082020-03-18
102715632020-04-0114:002020-04-0212:0002:55:2211236596858980598532020-03-31
023714932020-03-1917:002020-03-1912:0009:20:0016346596769641922992020-03-18
002775532020-04-0112:002020-04-0212:0012:00:0012135596891738112672020-03-31
002775532020-03-2012:002020-03-2212:0012:00:0012347596823203185492020-03-19
distribute by
分区排序,结合sort by使用,另外和是否是分区表没有半毛钱关系。
按照指定的字段对数据进行划分到不同的输出reduce文件中。
默认是用hash分区,即按照分区字段的hash值来进行分区。
在排序之前,先确认一下我们reducer的数量
hive (db_company)> set mapred.reduce.tasks;
mapred.reduce.tasks=3
是3个reducer,然后我们执行一下distribute by。按照order_date进行分区,并按照order_time升序排序
select * from hotel_checkin_log distribute by order_date sort by order_time;
结果如下,这个就不太直观了,我们还是输出到文件中看吧。
hotel_checkin_log.code hotel_checkin_log.checkin_date hotel_checkin_log.checkin_time hotel_checkin_log.checkout_date hotel_checkin_log.checkout_time hotel_checkin_log.order_time hotel_checkin_log.uid hotel_checkin_log.order_date
10271563 2020-03-20 14:00 2020-03-21 12:00 02:55:22 1124759681900888144 2020-03-19
10271563 2020-04-01 14:00 2020-04-02 12:00 02:55:22 1123659685898059853 2020-03-31
02371493 2020-04-01 17:00 2020-04-02 12:00 09:20:00 1133659694081146884 2020-03-31
02371493 2020-03-20 17:00 2020-03-21 12:00 09:20:00 1236559684014817343 2020-03-19
00277553 2020-03-20 12:00 2020-03-22 12:00 12:00:00 1234759682320318549 2020-03-19
00277553 2020-04-01 12:00 2020-04-02 12:00 12:00:00 1213559689173811267 2020-03-31
10271563 2020-04-02 14:00 2020-04-03 12:00 02:55:22 1123659685898059853 2020-04-01
10271563 2020-03-21 14:00 2020-03-23 12:00 02:55:22 1123659685898059853 2020-03-20
02371493 2020-03-22 17:00 2020-03-23 12:00 09:20:00 1133659694081146884 2020-03-20
02371493 2020-04-02 17:00 2020-04-03 12:00 09:20:00 1133659694081146884 2020-04-01
00277553 2020-03-22 12:00 2020-03-25 12:00 12:00:00 1213559689173811267 2020-03-20
00277553 2020-04-02 12:00 2020-04-03 12:00 12:00:00 1213559689173811267 2020-04-01
10271563 2020-03-19 14:00 2020-03-20 12:00 02:55:22 1431259675035954008 2020-03-18
02371493 2020-03-19 17:00 2020-03-19 12:00 09:20:00 1634659676964192299 2020-03-18
00277553 2020-03-19 12:00 2020-03-22 12:00 12:00:00 1124559675462631493 2020-03-18
我们将数据保存到文件中再看下
insert overwrite local directory '/opt/envs/datas/distributeby'
select * from hotel_checkin_log distribute by order_date sort by order_time;
我们先看下文件数量
[root@e2d21b78fdb0 distributeby]# ll
total 12
-rw-r--r-- 1 root root 498 Jun 25 12:42 000000_0
-rw-r--r-- 1 root root 498 Jun 25 12:42 000001_0
-rw-r--r-- 1 root root 249 Jun 25 12:42 000002_0
再分别看下文件内容
[root@e2d21b78fdb0 distributeby]# more 000000_0
102715632020-03-2014:002020-03-2112:0002:55:2211247596819008881442020-03-19
102715632020-04-0114:002020-04-0212:0002:55:2211236596858980598532020-03-31
023714932020-04-0117:002020-04-0212:0009:20:0011336596940811468842020-03-31
023714932020-03-2017:002020-03-2112:0009:20:0012365596840148173432020-03-19
002775532020-03-2012:002020-03-2212:0012:00:0012347596823203185492020-03-19
002775532020-04-0112:002020-04-0212:0012:00:0012135596891738112672020-03-31
[root@e2d21b78fdb0 distributeby]# more 000001_0
102715632020-04-0214:002020-04-0312:0002:55:2211236596858980598532020-04-01
102715632020-03-2114:002020-03-2312:0002:55:2211236596858980598532020-03-20
023714932020-03-2217:002020-03-2312:0009:20:0011336596940811468842020-03-20
023714932020-04-0217:002020-04-0312:0009:20:0011336596940811468842020-04-01
002775532020-03-2212:002020-03-2512:0012:00:0012135596891738112672020-03-20
002775532020-04-0212:002020-04-0312:0012:00:0012135596891738112672020-04-01
[root@e2d21b78fdb0 distributeby]# more 000002_0
102715632020-03-1914:002020-03-2012:0002:55:2214312596750359540082020-03-18
023714932020-03-1917:002020-03-1912:0009:20:0016346596769641922992020-03-18
002775532020-03-1912:002020-03-2212:0012:00:0011245596754626314932020-03-18
cluster by
1> 当distribute by和sorts by字段相同时,可以使用cluster by方式。
2> 排序只能是升序排序,不能指定排序规则为asc或者desc。
select * from hotel_checkin_log cluster by order_date;
如果指定倒序会出现下列错误
hive (db_company)> select * from hotel_checkin_log cluster by order_date desc;
FAILED: ParseException line 1:54 extraneous input 'desc' expecting EOF near '<EOF>'