文章目录
背景:应项目需求,目前有一个业务需要每分钟向数据库中插入一批数据,然后另一个程序会从数据库中读取某一时间段内,某些特征的数据。随着时间的积累,数据量提升,查询时间爆发式增加,不得不优化数据库。本次优化只考虑了查询优化,没有考虑优化后插入效率,为了减轻影响批量插入时采用的是一条insert…values(),()…形式语句。以下记录查询优化过程。
一、准备
1.1、工具
先介绍一下,可能会使用到的mysql相关的工具。
1、清除查询缓存
reset query cache;
2、清理查询缓存碎片以提高内存使用性能:
FLUSH QUERY CACHE;
3、MySQL查询分析器
EXPLAIN或DESC
1.2、环境参数
1、测试服务器采用腾讯云服务器[ dmidecode |more ]
CPU:单核,最大 2000 MHz
内存: 1 GB
2、系统
Linux VM-122-135-ubuntu 3.13.0-36-generic
#63-Ubuntu SMP Wed Sep 3 21:30:07 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
3、数据库版本
mysql Ver 14.14 Distrib 5.5.49, for debian-linux-gnu (x86_64) using readline 6.3
数据库引擎:Innodb
4、数据表格
总共:2801401条记录
二、简单测试
注意每次执行SQl前都会清空缓存
2.1、未加索引
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/total_count)*100 AS UNSIGNED) as rate FROM statistics_data where statistics_id in (20,21,22,23,24,25) and `time` between 1469319304 and 1469405704 order by `time`;
2.2、建立time索引
ALTER TABLE `statistics_data` ADD INDEX ( `time` ) ;
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/total_count)*100 AS UNSIGNED) as rate FROM statistics_data where statistics_id in (20,21,22,23,24,25) and `time` between 1469319304 and 1469405704 order by `time`;
我们发现SQL并没有使用索引,所以用 force index (time) 强制使用索引
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/total_count)*100 AS UNSIGNED) as rate FROM statistics_data force index (time) where statistics_id in (20,21,22,23,24,25) and `time` between 1469319304 and 1469405704 order by `time`;
这个结果基本能满足要求了。稍后会有新的问题的。
2.3、between与< >
网上有分析说between和< >性能会有差别,于是将between换成< >,最终发现结果并没有影响。
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/total_count)*100 AS UNSIGNED) as rate FROM statistics_data where statistics_id in (20,21,22,23,24,25) and `time` >1469319304 and `time` <1469405704 order by `time`;
2.4、建立statistics_id索引
ALTER TABLE `statistics_data` ADD INDEX (`statistics_id`) ;
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/total_count)*100 AS UNSIGNED) as rate FROM statistics_data where statistics_id in (20,21,22,23,24,25) and `time` between 1469319304 and 1469405704 order by `time`;
结果很让人意外,使用索引的查询时间比不使用索引几乎慢很多。最后查找资料才知道问题原因:列的值唯一性太小(如性别,同值的数据超过表的百分之15),不适合建索引,否则可能会使查询效率变差。这time和statistics_id对比充分说明这个问题。
三、查询结果太大带来的问题
本来以为建立statistics_id索引就可以解决问题了,但是当把查询条件放宽,结果又让人意外。
3.1、不用索引
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/total_count)*100 AS UNSIGNED) as rate FROM statistics_data where statistics_id in (20,21,22,23,24,25) and `time` between 1468979827 and 1469584627 order by `time`;
3.2、强制time索引
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/ total_count)*100 AS UNSIGNED) as rate FROM statistics_data force index (time) where statistics_id in (20,21,22,23,24,25) and `time` between 1468979827 and 1469584627 order by `time`;
问题又出现了,此时使用索引比不使用索引查询效率又低了一倍。为什么查询结果数据量大了,索引使用的时间会增加这么快?问题出现在哪儿呢?猜测可能的原因:
1、其实statistics_id和time一样,值唯一性也不大,只是在结果数据量小时,被问题掩盖了;
2、由于查询结果增加,根据数据表中数据分布情况,可能索引导致I/O上的开销变得更大了;
3.3、采用statistics_id、time联合索引
ALTER TABLE `statistics_data` ADD INDEX ( `time`, `statistics_id` ) ;
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/ total_count)*100 AS UNSIGNED) as rate FROM statistics_data where statistics_id in (20,21,22,23,24,25) and `time` between 1468979827 and 1469584627 order by `time`;
同样发现没有使用索引,于是强制使用联合索引:
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/ total_count)*100 AS UNSIGNED) as rate FROM statistics_data force index (time_2) where statistics_id in (20,21,22,23,24,25) and `time` between 1468979827 and 1469584627 order by `time`;
结果很意外,索引还是开销了更多时间,那现在索引带来时间开销,很可能是I/O带来的时间消耗。索引其实是单独的问题,引擎先去找索引,然后根据索引上的指针去找数据.如果索引效率本身不高,那就是额外的消耗了。
到这里又进入里困境,本来以为的建立time就可以解决问题了,现在发现当返回结果太大时,反而会降低效率,于是不得停下来换一个思路,重新分析数据。
四、数据分析
4.1、实验数据整理
4.2、分析表格数据
分析发现 298*9400>2801178,然后分析具体业务发现,同一分钟内statistics_id是唯一的,考虑(statistics_id、time)做联合主键,原因:
1、mysql的Innodb引擎是按主键来组织数据存储的(聚簇索引),所以比起联合索引I/O问题小;
2、自增ID在该表中意义不大,没有业务使用该ID;
3、目前该表格的SQL语句只有两种,一个是批量插入,另一个是上面的条件查询;
考虑到这些,于是开始实际测试联合主键的性能。
五、联合主键
5.1、删除主键
因为之前表格中存在索引和自增ID做主键,为了分析对比,在删除索引后还要删除主键:
ALTER TABLE `statistics_data` CHANGE `id` `id` INT( 10 ) UNSIGNED NOT NULL ; Alter table statistics_data drop primary key;
1、小数据量查询
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/total_count)*100 AS UNSIGNED) as rate FROM statistics_data where statistics_id in (20,21,22,23,24,25) and `time` between 1469319304 and 1469405704 order by `time`;
2、大数据量查询
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/ total_count)*100 AS UNSIGNED) as rate FROM statistics_data where statistics_id in (20,21,22,23,24,25) and `time` between 1468979827 and 1469584627 order by `time`;
3、小数据量加time索引
ALTER TABLE `statistics_data` ADD INDEX ( `time` ) ;
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/total_count)*100 AS UNSIGNED) as rate FROM statistics_data force index (time) where statistics_id in (20,21,22,23,24,25) and `time` between 1469319304 and 1469405704 order by `time`;
4、大数据量加联合索引
ALTER TABLE `statistics_data` ADD INDEX ( `time`, `statistics_id` ) ;
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/total_count)*100 AS UNSIGNED) as rate FROM statistics_data force index (time) where statistics_id in (20,21,22,23,24,25) and `time` between 1468979827 and 1469584627 order by `time`;
我们发现,不使用主键,总体效率都要比使用主键的查询效率低,所以在建数据表时主键还是需要设置的。
5.2、使用联合主键
新建联合主键如下:
Alter table statistics_data add primary key(statistics_id,time);
1、小数据量无索引查询
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/total_count)*100 AS UNSIGNED) as rate FROM statistics_data where statistics_id in (20,21,22,23,24,25) and `time` between 1469319304 and 1469405704 order by `time`;
结果已经出来了,速度提升非常明显。
2、大数据量无索引
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/total_count)*100 AS UNSIGNED) as rate FROM statistics_data where statistics_id in (20,21,22,23,24,25) and `time` between 1468979827 and 1469584627 order by `time`;
3、全数据,无索引
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/total_count)*100 AS UNSIGNED) as rate FROM statistics_data order by `time`;
这种情况下用时很多,一方面数据量大,另一方面查询没有使用聚合索引。但是在使用情况中不会出现这种sql查询,所以对结果没有影响。
4、小数据量time索引
ALTER TABLE `statistics_data` ADD INDEX ( `time` ) ;
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/total_count)*100 AS UNSIGNED) as rate FROM statistics_data force index (time) where statistics_id in (20,21,22,23,24,25) and `time` between 1469319304 and 1469405704 order by `time`;
5、大数据量联合索引
ALTER TABLE `statistics_data` ADD INDEX ( `time`, `statistics_id` ) ;
EXPLAIN SELECT id, connect_count, total_count, monitor_rate, `time`, CAST((connect_count/total_count)*100 AS UNSIGNED) as rate FROM statistics_data force index (time) where statistics_id in (20,21,22,23,24,25) and `time` between 1468979827 and 1469584627 order by `time`;
六、写在最后
6.1、所有测试数据
6.2、总结
1、这次优化主要围绕这几个问题:
A、社什么样的数据适合建索引?(T3)
B、查询结果和数据分布,对索引效率的影响?(T5)
C、联合主键为什么比联合索引效率高?(T12)
2、反思:
A、表格需要主键;
B、慎用自增ID;
C、索引不一定好用;
这次优化数据其实还有很多可以深入挖掘的地方,由于时间关系,就不多介绍了。最后决定使用联合主键来优化查询。
附件
记录文档 一次Mysql 查询优化过程.doc
测试数据 暂不能公开
参考
由于在处理过程中查阅的资料太多,没有记录,就不列举了。
转载标明出处:https://blog.evanxia.com/2016/07/877