场景
很多时候,为了写入效率,在生产环境里业务大表(单表千万行以上)是不允许随意加索引的。而且就算加索引,因为锁表问题,对业务也是有影响的。
我们一般会用离线的从库进行一些数据统计,而生产环境的索引并不能很好的满足统计的需求。没有相应索引,我们又如何高效的进行字段聚合呢?
利用主键索引进行range,然后再进行聚合。
举例
表名:tbl_pay_orders
行数:29,311,362
引擎:InnoDB
字段:
+------------------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+---------------------+------+-----+---------+----------------+
| id | bigint(20) | NO | PRI | NULL | auto_increment |
| amt | bigint(20) | NO | | NULL | |
| created_time | int(11) | NO | | NULL | |
需求:按天统计tbl_pay_orders金额
不考虑索引的情况下,SQL像这样写:
select left(from_unixtime(created_time),10) as day, sum(amt) from tbl_pay_orders group by day;
由于created_time没有索引,MySQL 索引提示如下:
mysql> desc select left(from_unixtime(created_time),10) as day, sum(amt) from tbl_pay_orders group by day;
+----+-------------+------------------------+------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------------+------+---------------+------+---------+------+------+---------------------------------+
| 1 | SIMPLE | tbl_pay_orders | ALL | NULL | NULL | NULL | NULL | 29311362 | Using temporary; Using filesort |
+----+-------------+------------------------+------+---------------+------+---------+------+------+---------------------------------+
1 row in set (0.00 sec)
全表扫描,共29311362行。
那我们如果利用主键呢?SQL可能像这样:
select left(from_unixtime(created_time),10) as day, sum(amt) from tbl_pay_orders where id >= 18000000 and id < 19000000 group by day;
索引提示是这样的:
mysql> desc select left(from_unixtime(created_time),10) as day, sum(amt) from tbl_pay_orders where id >= 18000000 and id < 19000000 group by day;
+----+-------------+------------------------+-------+---------------+---------+---------+------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------------+-------+---------------+---------+---------+------+---------+----------------------------------------------+
| 1 | SIMPLE | tbl_pay_orders | range | PRIMARY | PRIMARY | 8 | NULL | 1879550 | Using where; Using temporary; Using filesort |
+----+-------------+------------------------+-------+---------------+---------+---------+------+---------+----------------------------------------------+
1 row in set (0.01 sec)
我们发现,仍然没有走任何索引(当然了,因为我们并没有改变索引),但是扫描的行数一下子降到了1879550了。在这个量级我们就可以用MySQL方便的做聚合了
问题来了,我怎么知道每天的id范围呢?
答案是离线先按天建索引,生成一个day到start_id的映射关系。
for i in file("idx.txt"):
last_rid, last_day = i.strip().split(",")
wf = open('idx.txt', 'a')
sql = "select id,left(from_unixtime(created_time), 10) as day from tbl_pay_orders where id > %s and id < %s" % (last_rid, int(last_rid) + 1000000)
items = dao.select_sql(sql)
for item in items:
item["day"] = item["day"].replace("-", "").replace(" ", "")
if item["day"] != last_day:
wf.write("%s,%s\n" % (item["id"], item["day"]))
last_day = item["day"]