# 1.记一次MySQL死锁的排障经历

在做系统异常日志巡检的时候,发现了如下异常堆栈:

2021-10-10 17:39:24.609 [TID:3812.30542.16338587645746445] [ERROR] 16257 [http-nio-8952-exec-215] csas-audit org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/].[dispatcherServlet]-182 Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.springframework.dao.DeadlockLoserDataAccessException: 
### Error updating database.  Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
### The error may involve defaultParameterMap
### The error occurred while setting parameters
### SQL: UPDATE interact_info_audit         SET audit_status = ?,         audit_name = ?,         audit_receive_time = now()         WHERE             aisle_id = ?             AND audit_status = ?         ORDER BY             audit_info_id             LIMIT ?
### Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
; Deadlock found when trying to get lock; try restarting transaction; nested exception is com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction] with root cause
com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
	at ...
1
2
3
4
5
6
7
8
9

一个典型的MySQL死锁问题,下面一起来分析下死锁形成的原因吧

# 信息收集

首先需要知道,在业务日志中异常堆栈里抛出来的,只是多个引起死锁的SQL中的其中一句

那么要知道其余的SQL,可以执行SHOW ENGINE INNODB STATUS;命令查看更多信息:

mysql> show engine innodb status \G
*************************** 1. row ***************************
  Type: InnoDB
  Name: 
Status: 
=====================================
2021-11-02 21:42:04 0x7eccc9d71700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 17 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 5207109 srv_active, 0 srv_shutdown, 0 srv_idle
srv_master_thread log flush and writes: 5207109
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 785810
OS WAIT ARRAY INFO: signal count 720048
RW-shared spins 0, rounds 2031015, OS waits 601608
RW-excl spins 0, rounds 351594, OS waits 2781
RW-sx spins 6786, rounds 66392, OS waits 719
Spin rounds per wait: 2031015.00 RW-shared, 351594.00 RW-excl, 9.78 RW-sx
------------------------
LATEST DETECTED DEADLOCK
------------------------
2021-11-02 15:27:31 0x7eccc96d7700
*** (1) TRANSACTION:
TRANSACTION 2873479758, ACTIVE 0 sec fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 7 lock struct(s), heap size 1136, 4 row lock(s)
MySQL thread id 1187198, OS thread handle 139418004776704, query id 3055533971 10.172.53.90 m_interact System lock
UPDATE interact_info_audit
        SET audit_status = 2,
        audit_name = 'zbs003tuyunhai',
        audit_receive_time = now()
        WHERE
            aisle_id = 1236275659921776642
            AND audit_status = 1
        ORDER BY
            audit_info_id
            LIMIT 19
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1304637 page no 166503 n bits 136 index PRIMARY of table `interactdb`.`interact_info_audit` trx id 2873479758 lock_mode X locks rec but not gap waiting
*** (2) TRANSACTION:
TRANSACTION 2873479759, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
5 lock struct(s), heap size 1136, 4 row lock(s), undo log entries 1
MySQL thread id 1187761, OS thread handle 139418017822464, query id 3055533974 10.172.53.89 m_interact updating
UPDATE interact_info_audit
        SET audit_status = 2,
        audit_name = 'zbs003shilongting',
        audit_receive_time = now()
        WHERE
            aisle_id = 1236275659921776642
            AND audit_status = 1
        ORDER BY
            audit_info_id
            LIMIT 18
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 1304637 page no 166503 n bits 136 index PRIMARY of table `interactdb`.`interact_info_audit` trx id 2873479759 lock_mode X locks rec but not gap
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1304637 page no 134649 n bits 952 index idx_audit_status2 of table `interactdb`.`interact_info_audit` trx id 2873479759 lock_mode X locks rec but not gap waiting
*** WE ROLL BACK TRANSACTION (2)

...

----------------------------
END OF INNODB MONITOR OUTPUT
============================

1 row in set (0.00 sec)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

可以发现返回结果中记录了最近一次检测到的死锁信息

也就是需要重点关注的LATEST DETECTED DEADLOCK中的部分内容

现在知道了引起死锁的是同一句SQL,只是入参不同

同时也用SHOW VARIABLES LIKE '%tx_isolation%';命令查看一下数据库当前的事务隔离级别

image-20211110163055278

可以看到是MySQL的InnoDB存储引擎默认的RR级别(可以解决幻读问题,后文细说)

再用SHOW CREATE TABLE interact_info_audit;命令看下这张表的DDL:

 

























CREATE TABLE `interact_info_audit` (
  `audit_info_id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT '主键自增',
  `interact_info_id` bigint(20) NOT NULL COMMENT '互动信息编号',
  `aisle_id` bigint(20) DEFAULT '0' COMMENT '互动审核通道ID',
  `aisle_enter_time` datetime DEFAULT NULL COMMENT '入通道时间',
  `database_enter_time` datetime DEFAULT NULL COMMENT '入库时间',
  `audit_status` int(11) DEFAULT NULL COMMENT '审核状态 (1-待审核;2-审核中;3-已审核)',
  `audit_result` int(11) DEFAULT '0' COMMENT '审核结果 (1-机审通过;4-人工审核通过;2-人工审核不通过;3-屏蔽词不通过;5-黑库不通过;6-屏蔽;7-敏感词不通)',
  `is_sample` int(11) DEFAULT '0' COMMENT '黑样本标记 (1-是;2-否)',
  `audit_name` varchar(64) DEFAULT NULL COMMENT '审核人名称',
  `audit_receive_time` datetime DEFAULT NULL COMMENT '审核提取时间',
  `audit_finish_time` datetime DEFAULT NULL COMMENT '审核完成时间',
  `audit_version` int(11) NOT NULL DEFAULT '0' COMMENT '人工审核乐观锁,审核一次num+1',
  `batch_id` varchar(64) DEFAULT NULL COMMENT '批ID,UUID,一键清理通道数据使用',
  PRIMARY KEY (`audit_info_id`) USING BTREE,
  UNIQUE KEY `interact_info_id` (`interact_info_id`) USING BTREE,
  KEY `database_enter_time_idx` (`database_enter_time`) USING BTREE,
  KEY `audit_result_idx` (`audit_result`) USING BTREE,
  KEY `audit_finish_time_idx` (`audit_finish_time`) USING BTREE,
  KEY `aisle_id_idx` (`aisle_id`) USING BTREE,
  KEY `aisle_enter_time_idx` (`aisle_enter_time`) USING BTREE,
  KEY `idx_audit_status2` (`audit_status`,`audit_finish_time`),
  KEY `idx_aisle_id_audit_name` (`aisle_id`,`audit_name`),
  KEY `idx_audit_status_aisle_id` (`audit_status`,`aisle_id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1401196293340430338 DEFAULT CHARSET=utf8mb4 ROW_FORMAT=DYNAMIC COMMENT='互动审核信息表 '
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

可以发现SQL中用于做条件查询的aisle_id字段是非UNIQUE

最后看下这句SQL的执行计划:

EXPLAIN UPDATE interact_info_audit
SET audit_status = 2,
audit_name = 'zbs003tuyunhai',
audit_receive_time = now()
WHERE
	aisle_id = 1236275659921776642
	AND audit_status = 1
ORDER BY
	audit_info_id
	LIMIT 19;
1
2
3
4
5
6
7
8
9
10

返回结果如下:

image-20211110163259006

image-20211110163330965

可以看到使用的是idx_audit_status2,也就是audit_status,audit_finish_time这条联合索引

至此,需要了解的信息已经收集完成了,接下来开始分析,为什么会产生死锁

# 故障复现

# 原因分析

开始分析之前,先解释一下这句出现死锁问题的SQL的业务意义:

它是一句用于提取通道内待审核数据的SQL语句,多个审核员(audit_name)会从同一个通道内(aisle_id)获取到一部分的数据,并将这部分数据按照创建时间(audit_info_id作为有索引的字段可替代创建时间字段来排序)正序排列,并限制条数(LIMIT 19),更新状态为审核中(audit_status = 1)

在系统正常运行过程中,某个通道内会包含80%的待审核内容(例如弹幕、聊天内容,它们会相对于评论或者其它业务场景下的数据有更高的存有量),而相对于全部状态的内容来说,只有极少一部分内容为审核中状态

那么,在where条件中使用到的字段只有aisle_idaudit_status,已有索引中,可供优化器的选择也就只有这以下四个:

  • KEY aisle_id_idx (aisle_id) USING BTREE
  • KEY idx_audit_status2 (audit_status,audit_finish_time)
  • KEY idx_aisle_id_audit_name (aisle_id,audit_name)
  • KEY idx_audit_status_aisle_id (audit_status,aisle_id) USING BTREE

按常识来看,可能大部分人会觉得优化器会选择一个最左字段为aisle_id的联合索引,但是在以上前提条件下,MySQL的查询优化器自动选择了一个最左字段为audit_status的联合索引:

KEY `idx_audit_status2` (`audit_status`,`audit_finish_time`)
1

其实是十分合理的,毕竟audit_status = 1这个条件相对于aisle_id = ?足以过滤掉绝更多数据行

SELECT
	count(
	IF
	( aisle_id = 1209391489069170689, 1, NULL )) / count(*) AS p 
FROM
	interact_info_audit;
1
2
3
4
5
6

image-20211128131108760

SELECT
	count(
	IF
	( audit_status = 1, 1, NULL )) / count(*) AS p 
FROM
	interact_info_audit;
1
2
3
4
5
6

image-20211128131140137

但这里需要注意的是,即便是索引能够对查询效率起到提升,仍然不建议在更新频繁且区分度低的字段上建立索引,因为在非主键字段建立的都是二级索引,会导致取数据的时候先从二级索引拿到主键,再用主键去主键索引回表,如果记录数过多,可能会导致回表IO次数远大于全表扫描造成的IO次数,这样从效率上来看就显得得不偿失了,不过从文中场景来看,还是利大于弊的

aisle_idaudit_status这两个字段的区分度如下:

SELECT
	count( DISTINCT aisle_id ) / count(*)
FROM
	interact_info_audit;
1
2
3
4

image-20211121125630479

SELECT
	count( DISTINCT audit_status ) / count(*)
FROM
	interact_info_audit;
1
2
3
4

image-20211121125645852

可以看到以上两个字段的区分度都非常低,可以理解为像ID那种不重复的字段,区分度就非常高;而像性别这样的字段,区分度就非常低,因为字段的取值集合内元素相对于整个表的记录数就非常少了

那么问题来了,为什么没有选择这条索引呢

KEY `idx_audit_status_aisle_id` (`audit_status`,`aisle_id`) USING BTREE
1

答案笔者也不清楚,如果有人能解答的话,麻烦联系下笔者~

至此,我们已经知道了这条SQL会走一条二级索引,并且要注意的是,是一条非唯一的二级索引,那么在update语句执行过程中,是怎么产生死锁的呢?

要搞明白死锁的产生场景,我们先要知道MySQL的锁

# 复盘总结

最近更新: 12/22/2021, 3:56:58 PM