一次gc current retry引发的数据库宕机问题分析

微信扫一扫,分享到朋友圈

一次gc current retry引发的数据库宕机问题分析

版权声明:本文为Buddy Yuan原创文章,转载请注明出处。原文地址: 一次GC CURRENT RETRY引发的数据库宕机问题分析

今天一套RAC数据库宕机,比较麻烦的是没有oswatch数据,diag的日志也没有在宕机的时候做详细的systemdump输出。导致分析有点困难,但是我们还是通过DBA_HIST_ACTIVE_SESS_HISTORY发现了问题的一段端倪。大致经过时这样。

节点一:alert日志

2020-09-14T11:08:53.607751+08:00
minact-scn: useg scan erroring out with error e:12751
2020-09-14T11:08:59.629871+08:00
IPC Send timeout detected. Sender: ospid 50947 [oracle@db-billing-inst01]
Receiver: inst 2 binc 1101851292 ospid 389144
2020-09-14T11:08:59.633865+08:00
IPC Send timeout to 2.3 inc 4 for msg type 65521 from opid 5044
2020-09-14T11:08:59.637822+08:00
IPC Send timeout: Terminating pid 5044 osid 50947
2020-09-14T11:09:00.136296+08:00
IPC Send timeout detected. Sender: ospid 50945 [oracle@db-billing-inst01]
Receiver: inst 2 binc 1101851292 ospid 389144
2020-09-14T11:09:00.137402+08:00
IPC Send timeout to 2.3 inc 4 for msg type 65521 from opid 4946
2020-09-14T11:09:53.337493+08:00
CL00: delete dead process - failed
2020-09-14T11:10:03.397211+08:00
Detected an inconsistent instance membership by instance 1
Evicting instance 2 from cluster
Waiting for instances to leave: 2
2020-09-14T11:10:03.585682+08:00
Dumping diagnostic data in directory=[cdmp_20200914111003], requested by (instance=2, osid=389146 (LMS3)), summary=[abnormal instance termination].
2020-09-14T11:10:06.171694+08:00

从11点08分开始出现IPC Send timeout detected. Sender。而在11点过10分,发生了节点驱逐。

节点二:alert日志

2020-09-14T11:10:03.585841+08:00
System state dump requested by (instance=2, osid=389146 (LMS3)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/inst/inst2/trace/inst2_diag_389116_20200914111003.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/inst/inst2/trace/inst2_lmon_389136.trc (incident=863530):
ORA-29740 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/oracle/diag/rdbms/inst/inst2/incident/incdir_863530/inst2_lmon_389136_i863530.trc
2020-09-14T11:10:12.370110+08:00
Instance terminated by USER, pid = 389146

从11点10分收到节点1驱逐的终止了实例。

节点1的问题期间AWR和ASH情况。

通过top 5等待事件,可以发现31.4%的DB time集中在“library cache lock”等待上面。

SQL> select to_char(SAMPLE_TIME,'YYYY-MM-DD HH24:MI'),SQL_ID,event,count(1) from  dba_hist_active_sess_history
2  where event='library cache lock'
3  and to_char(SAMPLE_TIME,'YYYY-MM-DD HH24')='2020-09-14 11'
group by to_char(SAMPLE_TIME,'YYYY-MM-DD HH24:MI'),SQL_ID,event order by 1;  4
TO_CHAR(SAMPLE_T SQL_ID        EVENT                                                              COUNT(1)
---------------- ------------- ---------------------------------------------------------------- ----------
2020-09-14 11:05               library cache lock                                                      237
2020-09-14 11:06               library cache lock                                                      939
2020-09-14 11:07               library cache lock                                                     1775
2020-09-14 11:08               library cache lock                                                     2465
2020-09-14 11:09               library cache lock                                                     2820
2020-09-14 11:10               library cache lock                                                      495

通过ASH报告,可以看到从11点05分开始,节点1上面出现大量的library cache lock等待,到11点10分左右恢复,和后台日志出故障的时间基本吻合。继续查看library cache lock的阻塞源头,发现如下,在11点05分都是被14504和16446会话阻塞。

SQL> select blocking_session,blocking_inst_id,count(1) from  dba_hist_active_sess_history
2  where event='library cache lock'
3  and to_char(SAMPLE_TIME,'YYYY-MM-DD HH24:MI')='2020-09-14 11:05'
4  group by blocking_session,blocking_inst_id order by 3 desc;
BLOCKING_SESSION BLOCKING_INST_ID   COUNT(1)
---------------- ---------------- ----------
14504                1        191
16446                1         46

通过查询16446会话,可以看到也是被节点1的14504会话所阻塞。

SQL> select to_char(SAMPLE_TIME,'YYYY-MM-DD HH24:MI:SS'),SQL_ID,event,blocking_session,blocking_inst_id from  dba_hist_active_sess_history
2  where
3  session_id=16446
4  and to_char(SAMPLE_TIME,'YYYY-MM-DD HH24')='2020-09-14 11'
5  order by to_char(SAMPLE_TIME,'YYYY-MM-DD HH24:MI:SS') asc;
TO_CHAR(SAMPLE_TIME SQL_ID        EVENT                                                            BLOCKING_SESSION BLOCKING_INST_ID
------------------- ------------- ---------------------------------------------------------------- ---------------- ----------------
2020-09-14 11:05:10 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:05:20 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:05:31 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:05:41 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:05:51 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:06:01 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:06:11 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:06:21 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:06:31 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:06:41 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:06:51 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:07:01 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:07:11 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:07:21 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:07:32 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:07:42 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:07:52 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:08:02 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:08:12 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:08:22 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:08:32 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:08:42 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:08:52 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:09:02 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:09:13 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:09:23 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:09:33 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:09:43 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:09:53 9zg9qd9bm4spu buffer busy waits                                                           14504                1
2020-09-14 11:10:03 9zg9qd9bm4spu buffer busy waits                                                           14504                1

而14504会话,在问题期间是被节点1的4400会话所堵塞。

TO_CHAR(SAMPLE_TIME SQL_ID        EVENT                                                            BLOCKING_SESSION BLOCKING_INST_ID
------------------- ------------- ---------------------------------------------------------------- ---------------- ----------------
2020-09-14 11:01:09
2020-09-14 11:04:50 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:05:00 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:05:10 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:05:20 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:05:31 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:05:41 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:05:51 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:06:01 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:06:11 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:06:21 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:06:31 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:06:41 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:06:51 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:07:01 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:07:11 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:07:21 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:07:32 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:07:42 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:07:52 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:08:02 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:08:12 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:08:22 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:08:32 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:08:42 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:08:52 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:09:02 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:09:13 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:09:23 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:09:33 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:09:43 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:09:53 9zg9qd9bm4spu buffer busy waits                                                            4400                1
2020-09-14 11:10:03 9zg9qd9bm4spu buffer busy waits                                                            4400                1

继续查看4400会话。发现在执行语句2xqkxxgbg7qf6的时候,一直处于“gc current retry”这种异常的等待事件。一直到11点10分之后才恢复。

2020-09-14 11:04:40 2vtubzta22rj3
2020-09-14 11:04:50 2xqkxxgbg7qf6 gc current request
2020-09-14 11:05:00 2xqkxxgbg7qf6 gc current request
2020-09-14 11:05:10 2xqkxxgbg7qf6 gc current request
2020-09-14 11:05:20 2xqkxxgbg7qf6 gc current request
2020-09-14 11:05:31 2xqkxxgbg7qf6 gc current request
2020-09-14 11:05:41 2xqkxxgbg7qf6 gc current request
2020-09-14 11:05:51 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:06:01 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:06:11 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:06:21 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:06:31 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:06:41 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:06:51 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:07:01 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:07:11 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:07:21 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:07:32 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:07:42 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:07:52 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:08:02 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:08:12 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:08:22 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:08:32 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:08:42 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:08:52 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:09:02 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:09:13 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:09:23 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:09:33 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:09:43 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:09:53 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:10:03 2xqkxxgbg7qf6 gc current retry
2020-09-14 11:10:16 7htgm5uapq51r gc buffer busy release

通过查询mos。可以发现“gc current retry”等待会导致数据库hang住。文章RAC DB hang on ‘gc current retry’ (Doc ID 2666025.1)和我们出现的问题非常吻合。需要安装补丁Bug 30866988 – PROCESS STUCK IN GC CURRENT REQUEST/RETRY LOOP。

微信扫一扫,分享到朋友圈

一次gc current retry引发的数据库宕机问题分析

C++ Templates (2.3 类模板的局部使用 Partial Usage of Class Templates)

上一篇

首款搭载Apple Silicon的Mac将于 11 月发布

下一篇

你也可能喜欢

一次gc current retry引发的数据库宕机问题分析

长按储存图像,分享给朋友