春节前夕,公司一套核心核心再核心的数据库出现严重性能问题(版本信息:11.2.0.4),数据库活动会话飙升到300+,幸好当时处置得当,人员判断与应急迅速,及时kill了相关blocker session从而使得影响没有扩大。从应用负责人员获悉,此时间段应用人员正在执行上线操作,具体操作为删除一张num rows=0的空表的外键约束。
alter table TRAVELSKYDBA drop constraint FK_TRAVELSKYDBA;
ASH中活动会话基本都在等待Library Cache Lock,Cursor Pin s Wait On X,Row Cache Lock上,阻塞情况如下:
继续分析,发现了一个更加奇怪的现象:
SESSION_ID=1827的会话为应用上线alter table删除外键约束的语句会话,SESSION_ID=73,2119,4000均为对删除外键的主表的正常业务DML语句(INSERT,DELETE,UPDATE),从ASH中看到1827会话正在等待正常业务事务(SESSION_ID=73)的LIBRARY CACHE LOCK,但是SESSION_ID=73在等 ROW CACHE LOCK,被ALTER TABLE (SESSION_ID=1827)阻塞,行成了死锁,并且ROW CACHE LOCK 的P1=8(dc_object与grants,dc_object),随后的正常事务的DML SQL想去软解析时遭遇Cursor Pin S wait on X。
从数据库DIAG进程自动dump的system state dump与Hanganalyze trace中确认,这些应用的SQL因为Rolling Invalidate Window Exceeded触发了硬解析,alter table 子表 drop constraints时,这项操作是需要申请主表DML游标Object Handle的Library Cache Lock的,主表涉及SQL在硬解析,ALTER TABLE求LIBRARY CACHE LOCK而不得被阻塞,但是主表DML硬解析申请ROW CACHE LOCK,理论上非常快速内存操作为何长时间不释放?产生死锁?并且Oracle并没有处理这个死锁。
------------------------------------------------------------------------------- Chain 1: ------------------------------------------------------------------------------- Oracle session identified by: { instance: 1 (#######) os id: 111552 process id: 161, oracle@###-rac session id: 73 session serial #: 181 } is waiting for 'row cache lock' with wait info: { p1: 'cache id'=0x8 p2: 'mode'=0x0 p3: 'request'=0x3 time in wait: 0.612302 sec heur. time in wait: 2 min 23 sec timeout after: 2.387698 sec wait id: 24235 blocking: 286 sessions wait history: * time between current wait and wait #1: 0.000034 sec 1. event: 'row cache lock' time waited: 3.000963 sec wait id: 24234 p1: 'cache id'=0x8 p2: 'mode'=0x0 p3: 'request'=0x3 * time between wait #1 and #2: 0.000039 sec 2. event: 'row cache lock' time waited: 3.000978 sec wait id: 24233 p1: 'cache id'=0x8 p2: 'mode'=0x0 p3: 'request'=0x3 * time between wait #2 and #3: 0.000064 sec 3. event: 'row cache lock' time waited: 3.000913 sec wait id: 24232 p1: 'cache id'=0x8 p2: 'mode'=0x0 p3: 'request'=0x3 } and is blocked by => Oracle session identified by: { instance: 1 (*******) os id: 92902 process id: 112, oracle@*****-rac session id: 1827 session serial #: 19367 } which is waiting for 'library cache lock' with wait info: { p1: 'handle address'=0xfeec0a948 p2: 'lock address'=0xfe24600b8 p3: '100*mode+namespace'=0x15dba00010003 time in wait: 2 min 23 sec timeout after: 12 min 36 sec wait id: 1540 blocking: 286 sessions wait history: * time between current wait and wait #1: 0.000150 sec 1. event: 'gc current block 2-way' time waited: 0.000285 sec wait id: 1539 p1: ''=0x1 p2: ''=0x131f3 p3: ''=0x1 * time between wait #1 and #2: 0.000131 sec 2. event: 'gc current block 2-way' time waited: 0.000374 sec wait id: 1538 p1: ''=0x1 p2: ''=0x131de p3: ''=0x1 * time between wait #2 and #3: 0.002456 sec 3. event: 'gc cr block 2-way' time waited: 0.000405 sec wait id: 1537 p1: ''=0x1 p2: ''=0x131f3 p3: ''=0x1 } and is blocked by the session at the start of the chain. Chain 1 Signature: 'library cache lock'<='row cache lock' (cycle) Chain 1 Signature Hash: 0x52a8007d ---------------------------------------------------------------- 并且 p3: '100*mode+namespace'=0x15dba00010003 --003 LOCK MODE=X SQL> select to_number('15dba','xxxxxxx') from dual; TO_NUMBER('15DBA','XXXXXXX') --------------------------- 89530 --主表对象OBJECT_ID
SESSION_ID=73在sql net message from client中居然阻塞了1827的 alter table?
分析到这里,越来越感觉是Bug导致,从Mos上寻找相关案例发现一篇“现象相似”的文章,可惜版本无法对应上,不能确定是因为此Bug导致,虽然最后没有定位具体的Bug Number,但是此问题要告诫警示大家,不要再事务繁忙期间进行alter table DDL操作,虽然是一张空表,但是和其对象的相关OBJECT也存在受到牵连的可能性,从而产生跟大的影响。