删除空子表约束导致主表事务row cache lock/Library cache lock/Cursor pin s wait on x

春节前夕,公司一套核心核心再核心的数据库出现严重性能问题(版本信息:11.2.0.4),数据库活动会话飙升到300+,幸好当时处置得当,人员判断与应急迅速,及时kill了相关blocker session从而使得影响没有扩大。20200114-1从应用负责人员获悉,此时间段应用人员正在执行上线操作,具体操作为删除一张num rows=0的空表的外键约束。

alter table TRAVELSKYDBA drop constraint FK_TRAVELSKYDBA;

ASH中活动会话基本都在等待Library Cache Lock,Cursor Pin s Wait On X,Row Cache Lock上,阻塞情况如下:

20200123-11

继续分析,发现了一个更加奇怪的现象:

20200123-13

SESSION_ID=1827的会话为应用上线alter table删除外键约束的语句会话,SESSION_ID=73,2119,4000均为对删除外键的主表的正常业务DML语句(INSERT,DELETE,UPDATE),从ASH中看到1827会话正在等待正常业务事务(SESSION_ID=73)的LIBRARY CACHE LOCK,但是SESSION_ID=73在等 ROW CACHE LOCK,被ALTER TABLE (SESSION_ID=1827)阻塞,行成了死锁,并且ROW CACHE LOCK 的P1=8(dc_object与grants,dc_object),随后的正常事务的DML SQL想去软解析时遭遇Cursor Pin S wait on X。

从数据库DIAG进程自动dump的system state dump与Hanganalyze trace中确认,这些应用的SQL因为Rolling Invalidate Window Exceeded触发了硬解析,alter table 子表 drop constraints时,这项操作是需要申请主表DML游标Object Handle的Library Cache Lock的,主表涉及SQL在硬解析,ALTER TABLE求LIBRARY CACHE LOCK而不得被阻塞,但是主表DML硬解析申请ROW CACHE LOCK,理论上非常快速内存操作为何长时间不释放?产生死锁?并且Oracle并没有处理这个死锁

-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
 Oracle session identified by:
 {
 instance: 1 (#######)
 os id: 111552
 process id: 161, oracle@###-rac
 session id: 73
 session serial #: 181
 }
 is waiting for 'row cache lock' with wait info:
 {
 p1: 'cache id'=0x8
 p2: 'mode'=0x0
 p3: 'request'=0x3
 time in wait: 0.612302 sec
 heur. time in wait: 2 min 23 sec
 timeout after: 2.387698 sec
 wait id: 24235
 blocking: 286 sessions
 wait history:
 * time between current wait and wait #1: 0.000034 sec
 1. event: 'row cache lock'
 time waited: 3.000963 sec
 wait id: 24234 p1: 'cache id'=0x8
 p2: 'mode'=0x0
 p3: 'request'=0x3
 * time between wait #1 and #2: 0.000039 sec
 2. event: 'row cache lock'
 time waited: 3.000978 sec
 wait id: 24233 p1: 'cache id'=0x8
 p2: 'mode'=0x0
 p3: 'request'=0x3
 * time between wait #2 and #3: 0.000064 sec
 3. event: 'row cache lock'
 time waited: 3.000913 sec
 wait id: 24232 p1: 'cache id'=0x8
 p2: 'mode'=0x0
 p3: 'request'=0x3
 }
 and is blocked by
 => Oracle session identified by:
 {
 instance: 1 (*******)
 os id: 92902
 process id: 112, oracle@*****-rac
 session id: 1827
 session serial #: 19367
 }
 which is waiting for 'library cache lock' with wait info:
 {
 p1: 'handle address'=0xfeec0a948
 p2: 'lock address'=0xfe24600b8
 p3: '100*mode+namespace'=0x15dba00010003
 time in wait: 2 min 23 sec
 timeout after: 12 min 36 sec
 wait id: 1540
 blocking: 286 sessions
 wait history:
 * time between current wait and wait #1: 0.000150 sec
 1. event: 'gc current block 2-way'
 time waited: 0.000285 sec
 wait id: 1539 p1: ''=0x1
 p2: ''=0x131f3
 p3: ''=0x1
 * time between wait #1 and #2: 0.000131 sec
 2. event: 'gc current block 2-way'
 time waited: 0.000374 sec
 wait id: 1538 p1: ''=0x1
 p2: ''=0x131de
 p3: ''=0x1
 * time between wait #2 and #3: 0.002456 sec
 3. event: 'gc cr block 2-way'
 time waited: 0.000405 sec
 wait id: 1537 p1: ''=0x1
 p2: ''=0x131f3
 p3: ''=0x1
 }
 and is blocked by the session at the start of the chain.

Chain 1 Signature: 'library cache lock'<='row cache lock' (cycle)
Chain 1 Signature Hash: 0x52a8007d
----------------------------------------------------------------
并且 p3: '100*mode+namespace'=0x15dba00010003 --003 LOCK MODE=X

SQL> select to_number('15dba','xxxxxxx') from dual;
TO_NUMBER('15DBA','XXXXXXX')
---------------------------
 89530  --主表对象OBJECT_ID

20200123-16

SESSION_ID=73在sql net message from client中居然阻塞了1827的 alter table?

分析到这里,越来越感觉是Bug导致,从Mos上寻找相关案例发现一篇“现象相似”的文章,可惜版本无法对应上,不能确定是因为此Bug导致,虽然最后没有定位具体的Bug Number,但是此问题要告诫警示大家,不要再事务繁忙期间进行alter table DDL操作,虽然是一张空表,但是和其对象的相关OBJECT也存在受到牵连的可能性,从而产生跟大的影响。

20200123-17