生产数据库查询与插入hang案例

前不久,生产Oracle数据库上出现一个比较紧急的案例,前台业务查询与插入操作全部夯死,涉及这些SQL的业务全部中断。接到电话后立即查看系统状态

此时系统中出现了latch: row cache objects ,library cache lock这些本不该在生产系统中出现的事件:
SQL> select event,count(*) from v$session where wait_class <> 'Idle' group by event order by 2 desc;

EVENT                                                              COUNT(*)
---------------------------------------------------------------- ----------
library cache lock                                                        2
SQL*Net message to client                                                 2
log file sync                                                             1
latch: row cache objects                                                  1



这些等待的Blocking_session均为空
SQL> select username,sid,serial#,sql_id,blocking_session,event from  v$session where wait_class <> 'Idle';

USERNAME                              SID    SERIAL# SQL_ID        BLOCKING_SESSION EVENT
------------------------------ ---------- ---------- ------------- ---------------- ----------------------------------------------------------------
Albert                               1015       3251 fdn44h4am80mb                  gc cr multi block request
Albert                               1184      54311 6sx33q2nq3c67                  library cache lock
Albert                               2738       4293 3ndnsw62wz2f1                  latch: row cache objects
Albert                               2957      17039                                reliable message
Albert                               3414      13501 7jj34snwajctj                  library cache lock
SYS                                  4192      29879 dh8qnqqr8jqyt                  SQL*Net message to client

此时需要找到Library cache lock等事件的阻塞源头,尽快恢复中断业务,可以利用如下代码进行确认追踪

SQL> select /*+ ordered */
  2  w1.sid waiting_session,
  3  h1.sid holding_session,
  4  w.kgllktype lock_or_pin,
  5  w.kgllkhdl address,
  6  decode(h.kgllkmod,
  7          0,
  8          'None',
  9          1,
 10          'Null',
 11          2,
 12          'Share',
 13          3,
 14          'Exclusive',
 15          'Unknown') mode_held,
 16  decode(w.kgllkreq,
 17          0,
 18          'None',
 19          1,
 20          'Null',
 21          2,
 22          'Share',
 23          3,
 24          'Exclusive',
 25          'Unknown') mode_requested
 26    from dba_kgllock w, dba_kgllock h, v$session w1, v$session h1
 27  where (((h.kgllkmod != 0) and (h.kgllkmod != 1) and
 28         ((h.kgllkreq = 0) or (h.kgllkreq = 1))) and
 29         (((w.kgllkmod = 0) or (w.kgllkmod = 1)) and
 30         ((w.kgllkreq != 0) and (w.kgllkreq != 1))))
 31     and w.kgllktype = h.kgllktype
 32     and w.kgllkhdl = h.kgllkhdl
 33     and w.kgllkuse = w1.saddr
 34     and h.kgllkuse = h1.saddr;

WAITING_SESSION HOLDING_SESSION LOCK ADDRESS          MODE_HELD MODE_REQU
--------------- --------------- ---- ---------------- --------- ---------
           3414            2738 Lock 0000001FDF10B328 Share     Exclusive
           1184            2738 Lock 0000001FDF10B328 Share     Share

可以从上述结果中看到,2738会话持有Hanle address “0000001FDF10B328”的独占锁:Exclusive,所以应急杀掉2738会话,KILL SESSION会话后,故障恢复。

SQL> r
  1* select username,sid,serial#,sql_id,blocking_session,event from  v$session where wait_class <> 'Idle'

USERNAME                              SID    SERIAL# SQL_ID        BLOCKING_SESSION EVENT
------------------------------ ---------- ---------- ------------- ---------------- ----------------------------------------------------------------
Albert                                   1015       3251 35gvvuy7b2p2s                  db file scattered read
Albert                                   1184      54311 6sx33q2nq3c67                  SQL*Net more data from client
Albert                                   2957      17041 6sx33q2nq3c67                  SQL*Net more data from client
Albert                                 4192      29961 dh8qnqqr8jqyt                  SQL*Net message to client

事后根因分析,为何在业务高峰期间会有library cache lock长时间占有不释放的情况?

确认ASH时间点后,VTERM\GS02是受害者,基本大部分会话都是被2738这个会话阻塞
SQL> select --to_char(sample_time, 'yyyymmdd hh24:mi:ss') sample_time,
  2   event,
  3   sql_id,
  4   program,
  5   machine,
  6   sql_opname,
  7   BLOCKING_INST_ID || '_' || BLOCKING_SESSION,
  8   count(*)
  9    from v$active_session_history a
 10   where to_char(sample_time, 'yyyymmdd hh24:mi:ss') between
 11         '20200501 00:00:00' and '20200501 10:44:00'
 12     and event = 'library cache lock'
 13   group by --to_char(sample_time, 'yyyymmdd hh24:mi:ss'),
 14            event,
 15            sql_id,
 16            program,
 17            machine,
 18            BLOCKING_INST_ID || '_' || BLOCKING_SESSION,
 19            sql_opname
 20   order by 7
 21  ;

EVENT                                                            SQL_ID        PROGRAM                                          MACHINE                                                          SQL_OPNAME                                                       BLOCKING_INST_ID||'_'||BLOCKING_SESSION                                            COUNT(*)
---------------------------------------------------------------- ------------- ------------------------------------------------ ---------------------------------------------------------------- ---------------------------------------------------------------- -------------------------------------------------------------------------------- ----------
library cache lock                                               0uujs8nqcm145 plsqldev.exe                                     VTERM\GS02                                                       SELECT                                                           _                                                                                         7  
library cache lock                                               49qykykfhwqtm plsqldev.exe                                     VTERM\GS02                                                       SELECT                                                           _                                                                                        22
library cache lock                                               973s831gy6jk4 plsqldev.exe                                     VTERM\GS02                                                       SELECT                                                           _                                                                                        38
library cache lock                                                             sqlldr@hostname (TNS V1-V3)                      hostname                                                                                                                          _                                                                                        58
library cache lock                                               g639trfsrp9m5 plsqldev.exe                                     VTERM\Travelskydba01                                             PL/SQL EXECUTE                                                   _                                                                                        62
library cache lock                                               3y3gt31q23uk1 plsqldev.exe                                     VTERM\GS02                                                       SELECT                                                           _                                                                                        65
library cache lock                                               6bwmj9n38kjp6 plsqldev.exe                                     VTERM\GS02                                                       SELECT                                                           _                                                                                       494
library cache lock                                               973s831gy6jk4 plsqldev.exe                                     VTERM\Travelskydba01                                             SELECT                                                           _                                                                                      1475
library cache lock                                               7jj34snwajctj detrreceipt@hostname2 (TNS V1-V3)                hostname2                                                        INSERT                                                           1_2738                                                                                 2419
library cache lock                                               6sx33q2nq3c67 sqlldr@hostname (TNS V1-V3)                      hostname                                                         INSERT                                                           _                                                                                      2419
library cache lock                                               7jj34snwajctj detrreceipt@hostname2 (TNS V1-V3)                hostname2                                                        INSERT                                                           _                                                                                      2515
library cache lock                                               6sx33q2nq3c67 sqlldr@hostname (TNS V1-V3)                      hostname                                                         INSERT                                                           1_2738                                                                                10760

12 rows selected

ASH中后续并没有记录2738这个会话再等latch: row cache objects,但是抓到了其他的进程信息,例如PROGRAM/MACHINE,猜测故障当时此会话已经不是ACTIVE

ASH中并没有记录2738这个会话再等latch: row cache objects
SQL> select --to_char(sample_time, 'yyyymmdd hh24:mi:ss') sample_time,
  2   event,
  3   sql_id,
  4   program,
  5   machine,
  6   sql_opname,
  7   BLOCKING_INST_ID || '_' || BLOCKING_SESSION,
  8   count(*)
  9    from ash_20200501 a
 10   where to_char(sample_time, 'yyyymmdd hh24:mi:ss') between
 11         '20200430 00:00:00' and '20200501 10:44:00'
 12     and session_id=2738 and program ='plsqldev.exe'
 13   group by --to_char(sample_time, 'yyyymmdd hh24:mi:ss'),
 14            event,
 15            sql_id,
 16            program,
 17            machine,
 18            BLOCKING_INST_ID || '_' || BLOCKING_SESSION,
 19            sql_opname
 20   order by 1
 21  ;

EVENT                                                            SQL_ID        PROGRAM                                          MACHINE                                                          SQL_OPNAME                                                       BLOCKING_INST_ID||'_'||BLOCKING_SESSION                                            COUNT(*)
---------------------------------------------------------------- ------------- ------------------------------------------------ ---------------------------------------------------------------- ---------------------------------------------------------------- -------------------------------------------------------------------------------- ----------
                                                                 3ndnsw62wz2f1 plsqldev.exe                                     VTERM\Travelskydba01                                                    SELECT                                                           _                                                                            58210
                                                                 3ndnsw62wz2f1 plsqldev.exe                                     VTERM\Travelskydba01                                                                                                                     _                                                                             6

SQL> 

2738这个会话最早在4月30日18点20开始出现,一直持续到故障发生时(故障发生时间:20200501 10:00左右)

738这个会话最早在4月30日18点20开始出现,一直持续到故障发生时:
SQL> select rownum, a.*
  2    from (select to_char(sample_time, 'yyyymmdd hh24:mi:ss') sample_time,
  3                 event,
  4                 sql_id,
  5                 program,
  6                 machine,
  7                 sql_opname,
  8                 BLOCKING_INST_ID || '_' || BLOCKING_SESSION,
  9                 count(*)
 10            from ash_20200501 a
 11           where to_char(sample_time, 'yyyymmdd hh24:mi:ss') between
 12                 '20200430 00:00:00' and '20200501 10:44:00'
 13             and session_id = 2738
 14             and program = 'plsqldev.exe'
 15           group by to_char(sample_time, 'yyyymmdd hh24:mi:ss'),
 16                    event,
 17                    sql_id,
 18                    program,
 19                    machine,
 20                    BLOCKING_INST_ID || '_' || BLOCKING_SESSION,
 21                    sql_opname
 22           order by 1) a where rownum <2
 23  ;

    ROWNUM SAMPLE_TIME       EVENT                                                            SQL_ID        PROGRAM                                          MACHINE                                                          SQL_OPNAME                                                       BLOCKING_INST_ID||'_'||BLOCKING_SESSION                                            COUNT(*)
---------- ----------------- ---------------------------------------------------------------- ------------- ------------------------------------------------ ---------------------------------------------------------------- ---------------------------------------------------------------- -------------------------------------------------------------------------------- ----------
         1 20200430 18:20:47                                                                  3ndnsw62wz2f1 plsqldev.exe                                     VTERM\Travelskydba01                                                                                                                       _                                                                                         1

从数据库diag进程做的DUMP Trace文件中继续分析:

SO: 0x1f82293c18, type: 4, owner: 0x1f8193f078, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x1f8193f078, name=session, file=ksu.h LINE:12729, pg=0
    (session) sid: 2738 ser: 4293 trans: (nil), creator: 0x1f8193f078
              flags: (0x41) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
              flags2: (0x40009) -/-/INC
              DID: , short-term DID: 
              txn branch: (nil)
              edition#: 100              oct: 3, prv: 0, sql: 0x1a17a583d0, psql: 0x1fb8f402f0, user: 80/LK
    ksuxds FALSE at location: 0
    service name: lk
    client details:
      O/S info: user: Travelskydba, term: Travelskydba01, ospid: 3984:3476
      machine: VTERM\Travelskydba01 program: plsqldev.exe
      application name: PL/SQL Developer, hash value=1190136663
      action name: SQL Window, hash value=2699165734
    Current Wait Stack:
      Not in wait; last wait ended 105 min 27 sec ago             --啥也没在等,与现场事实不符。
    Wait State:
      fixed_waits=0 flags=0x20 boundary=(nil)/-1
    Session Wait History:
        elapsed time of 105 min 27 sec since last wait
     0: waited for 'latch: row cache objects'
        address=0x1f2cc595e8, number=0x135, tries=0x0
        wait_id=2702 seq_num=2703 snap_id=1
        wait times: snap=0.000028 sec, exc=0.000028 sec, total=0.000028 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 58 min 22 sec of elapsed time
     1: waited for 'latch: row cache objects'
        address=0x1f2cc595e8, number=0x135, tries=0x0
        wait_id=2701 seq_num=2702 snap_id=1
        wait times: snap=0.000033 sec, exc=0.000033 sec, total=0.000033 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 82 min 0 sec of elapsed time
     2: waited for 'latch: row cache objects'
        address=0x1f2cc595e8, number=0x135, tries=0x0
        wait_id=2700 seq_num=2701 snap_id=1
        wait times: snap=0.000022 sec, exc=0.000022 sec, total=0.000022 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 687 min 9 sec of elapsed time
     3: waited for 'SQL*Net more data from client'
        driver id=0x54435000, #bytes=0x2d, =0x0
        wait_id=2699 seq_num=2700 snap_id=1
        wait times: snap=0.000173 sec, exc=0.000173 sec, total=0.000173 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000012 sec of elapsed time
     4: waited for 'SQL*Net more data from client'
        driver id=0x54435000, #bytes=0x1, =0x0
        wait_id=2698 seq_num=2699 snap_id=1
        wait times: snap=0.000176 sec, exc=0.000176 sec, total=0.000176 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000009 sec of elapsed time
     5: waited for 'SQL*Net more data from client'
        driver id=0x54435000, #bytes=0x14, =0x0
        wait_id=2697 seq_num=2698 snap_id=1
        wait times: snap=0.000260 sec, exc=0.000260 sec, total=0.000260 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000009 sec of elapsed time
     6: waited for 'SQL*Net more data from client'
        driver id=0x54435000, #bytes=0x28, =0x0
        wait_id=2696 seq_num=2697 snap_id=1
        wait times: snap=0.000356 sec, exc=0.000356 sec, total=0.000356 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000009 sec of elapsed time
     7: waited for 'SQL*Net more data from client'
        driver id=0x54435000, #bytes=0x3c, =0x0
        wait_id=2695 seq_num=2696 snap_id=1
        wait times: snap=0.000165 sec, exc=0.000165 sec, total=0.000165 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000009 sec of elapsed time
     8: waited for 'SQL*Net more data from client'
        driver id=0x54435000, #bytes=0xf, =0x0
        wait_id=2694 seq_num=2695 snap_id=1
        wait times: snap=0.000178 sec, exc=0.000178 sec, total=0.000178 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000012 sec of elapsed time
     9: waited for 'SQL*Net more data from client'
        driver id=0x54435000, #bytes=0x23, =0x0
        wait_id=2693 seq_num=2694 snap_id=1
        wait times: snap=0.000484 sec, exc=0.000484 sec, total=0.000484 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000008 sec of elapsed time
只找到,这个个人终端会话18点20开始做已解析的时候 拿到的PINMODE=X
SO: 0x1fc3a1eb98, type: 78, owner: 0x1f82293c18, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
       proc=0x1f8193f078, name=LIBRARY OBJECT LOCK, file=kgl.h LINE:8751, pg=0

      LibraryObjectLock:  Address=0x1fc3a1eb98 Handle=0x1a19970a70 Mode=N CanBeBrokenCount=1 Incarnation=1 ExecutionCount=0
ClusterLock=0x1fc3a287e8 Context=0x7f9bd2f720d0
User=0x1f82293c18 Session=0x1f82293c18 ReferenceCount=1
Flags=CBK/[0020] SavepointNum=0
      LibraryHandle:  Address=0x1a19970a70 Hash=0 LockMode=N PinMode=X LoadLockMode=0 Status=VALD
        Name:  Namespace=SQL AREA(00) Type=CURSOR(00)
        Statistics:  InvalidationCount=0 ExecutionCount=0 LoadCount=1 ActiveLocks=1 TotalLockCount=1 TotalPinCount=2
        Counters:  BrokenCount=1 RevocablePointer=1 KeepDependency=0 Version=0 BucketInUse=0 HandleInUse=0 HandleReferenceCount=0
        Concurrency:  DependencyMutex=0x1a19970b20(0, 0, 0, 0) Mutex=0x1a17a58510(0, 6112, 0, 0)
        Flags=RON/PIN/PN0/EXP/CHD/[10012111]
        WaitersLists:
          Lock=0x1a19970b00[0x1a19970b00,0x1a19970b00]
          Pin=0x1a19970ae0[0x1a19970ae0,0x1a19970ae0]
          LoadLock=0x1a19970b58[0x1a19970b58,0x1a19970b58]
        LibraryObject:  Address=0x198c54a5c0 HeapMask=0000-0001-0001-0000 Flags=EXS[0000] Flags2=[0000] PublicFlags=[0000]
          DataBlocks:
            Block:  #='0' name=KGLH0^85cf89c1 pins=0 Change=NONE
              Heap=0x1a192656e0 Pointer=0x198c54a660 Extent=0x198c54a540 Flags=I/-/P/A/-/-
              FreedLocation=0 Alloc=2.382812 Size=3.976562 LoadTime=45051112200
            Block:  #='6' name=SQLA^85cf89c1 pins=0 Change=NONE
              Heap=0x198c54bc98 Pointer=0x194ece6d00 Extent=0x194ece60c0 Flags=I/-/P/A/-/E
              FreedLocation=0 Alloc=0.000000 Size=0.000000 LoadTime=0
        NamespaceDump:
          Child Cursor:  Heap0=0x198c54a660 Heap6=0x194ece6d00 Heap0 Load Time=04-30-2020 18:20:46 Heap6 Load Time=04-30-2020 18:20:46

      ----------------------------------------
      KGX Atomic Operation Log 0x1fc3a287e8
       Mutex 0x198c54bbf0(2738, 0) idn 85cf89c1 oper LONG_EXCL(18)   --独占MUTEX,是在做硬解析
       Cursor Pin uid 2738 efd 0 whr 1 slp 0
       opr=3 pso=0x1fc3a1eb98 flg=0
       pcs=0x198c54bb58 nxt=(nil) flg=35 cld=0 hd=0x1a19970a70 par=0x198c54b660
       ct=0 hsh=0 unp=(nil) unn=0 hvl=8c54c508 nhv=1 ses=0x1f82293c18
       hep=0x198c54bbf0 flg=80 ld=1 ob=0x198c54a5c0 ptr=0x194ece6d00 fex=0x194ece60c0

VTERM\Travelskydba01这个人在 4月30日 18点20 执行了一个SQL,并且猜测是关闭了PLSQL客户端,随后会话夯死在里面,长时间持有LIBRARY CACHE LOCK (MODE=x)模式,随后了解到,执行的SQL是长达上千行的SQL Union拼接,没有关闭PLSQL,而是下班后直接关闭了虚拟桌面界面。猜测SQL一直在硬解析,但是为何对TABLE依赖对象长时间持有X,缺乏现场systemdump争取支撑。建议大家使用PLSQL时候不要直接关闭软件,cancel会话操作后再退出PLSQL,避免此问题的发生。