生产数据库查询与插入hang案例

前不久,生产Oracle数据库上出现一个比较紧急的案例,前台业务查询与插入操作全部夯死,涉及这些SQL的业务全部中断。接到电话后立即查看系统状态

此时系统中出现了latch: row cache objects ,library cache lock这些本不该在生产系统中出现的事件:
SQL> select event,count(*) from v$session where wait_class <> 'Idle' group by event order by 2 desc;

EVENT                                                              COUNT(*)
---------------------------------------------------------------- ----------
library cache lock                                                        2
SQL*Net message to client                                                 2
log file sync                                                             1
latch: row cache objects                                                  1



这些等待的Blocking_session均为空
SQL> select username,sid,serial#,sql_id,blocking_session,event from  v$session where wait_class <> 'Idle';

USERNAME                              SID    SERIAL# SQL_ID        BLOCKING_SESSION EVENT
------------------------------ ---------- ---------- ------------- ---------------- ----------------------------------------------------------------
Albert                               1015       3251 fdn44h4am80mb                  gc cr multi block request
Albert                               1184      54311 6sx33q2nq3c67                  library cache lock
Albert                               2738       4293 3ndnsw62wz2f1                  latch: row cache objects
Albert                               2957      17039                                reliable message
Albert                               3414      13501 7jj34snwajctj                  library cache lock
SYS                                  4192      29879 dh8qnqqr8jqyt                  SQL*Net message to client

此时需要找到Library cache lock等事件的阻塞源头,尽快恢复中断业务,可以利用如下代码进行确认追踪

SQL> select /*+ ordered */
  2  w1.sid waiting_session,
  3  h1.sid holding_session,
  4  w.kgllktype lock_or_pin,
  5  w.kgllkhdl address,
  6  decode(h.kgllkmod,
  7          0,
  8          'None',
  9          1,
 10          'Null',
 11          2,
 12          'Share',
 13          3,
 14          'Exclusive',
 15          'Unknown') mode_held,
 16  decode(w.kgllkreq,
 17          0,
 18          'None',
 19          1,
 20          'Null',
 21          2,
 22          'Share',
 23          3,
 24          'Exclusive',
 25          'Unknown') mode_requested
 26    from dba_kgllock w, dba_kgllock h, v$session w1, v$session h1
 27  where (((h.kgllkmod != 0) and (h.kgllkmod != 1) and
 28         ((h.kgllkreq = 0) or (h.kgllkreq = 1))) and
 29         (((w.kgllkmod = 0) or (w.kgllkmod = 1)) and
 30         ((w.kgllkreq != 0) and (w.kgllkreq != 1))))
 31     and w.kgllktype = h.kgllktype
 32     and w.kgllkhdl = h.kgllkhdl
 33     and w.kgllkuse = w1.saddr
 34     and h.kgllkuse = h1.saddr;

WAITING_SESSION HOLDING_SESSION LOCK ADDRESS          MODE_HELD MODE_REQU
--------------- --------------- ---- ---------------- --------- ---------
           3414            2738 Lock 0000001FDF10B328 Share     Exclusive
           1184            2738 Lock 0000001FDF10B328 Share     Share

可以从上述结果中看到,2738会话持有Hanle address “0000001FDF10B328”的独占锁:Exclusive,所以应急杀掉2738会话,KILL SESSION会话后,故障恢复。

SQL> r
  1* select username,sid,serial#,sql_id,blocking_session,event from  v$session where wait_class <> 'Idle'

USERNAME                              SID    SERIAL# SQL_ID        BLOCKING_SESSION EVENT
------------------------------ ---------- ---------- ------------- ---------------- ----------------------------------------------------------------
Albert                                   1015       3251 35gvvuy7b2p2s                  db file scattered read
Albert                                   1184      54311 6sx33q2nq3c67                  SQL*Net more data from client
Albert                                   2957      17041 6sx33q2nq3c67                  SQL*Net more data from client
Albert                                 4192      29961 dh8qnqqr8jqyt                  SQL*Net message to client

事后根因分析,为何在业务高峰期间会有library cache lock长时间占有不释放的情况?

确认ASH时间点后,VTERM\GS02是受害者,基本大部分会话都是被2738这个会话阻塞
SQL> select --to_char(sample_time, 'yyyymmdd hh24:mi:ss') sample_time,
  2   event,
  3   sql_id,
  4   program,
  5   machine,
  6   sql_opname,
  7   BLOCKING_INST_ID || '_' || BLOCKING_SESSION,
  8   count(*)
  9    from v$active_session_history a
 10   where to_char(sample_time, 'yyyymmdd hh24:mi:ss') between
 11         '20200501 00:00:00' and '20200501 10:44:00'
 12     and event = 'library cache lock'
 13   group by --to_char(sample_time, 'yyyymmdd hh24:mi:ss'),
 14            event,
 15            sql_id,
 16            program,
 17            machine,
 18            BLOCKING_INST_ID || '_' || BLOCKING_SESSION,
 19            sql_opname
 20   order by 7
 21  ;

EVENT                                                            SQL_ID        PROGRAM                                          MACHINE                                                          SQL_OPNAME                                                       BLOCKING_INST_ID||'_'||BLOCKING_SESSION                                            COUNT(*)
---------------------------------------------------------------- ------------- ------------------------------------------------ ---------------------------------------------------------------- ---------------------------------------------------------------- -------------------------------------------------------------------------------- ----------
library cache lock                                               0uujs8nqcm145 plsqldev.exe                                     VTERM\GS02                                                       SELECT                                                           _                                                                                         7  
library cache lock                                               49qykykfhwqtm plsqldev.exe                                     VTERM\GS02                                                       SELECT                                                           _                                                                                        22
library cache lock                                               973s831gy6jk4 plsqldev.exe                                     VTERM\GS02                                                       SELECT                                                           _                                                                                        38
library cache lock                                                             sqlldr@hostname (TNS V1-V3)                      hostname                                                                                                                          _                                                                                        58
library cache lock                                               g639trfsrp9m5 plsqldev.exe                                     VTERM\Travelskydba01                                             PL/SQL EXECUTE                                                   _                                                                                        62
library cache lock                                               3y3gt31q23uk1 plsqldev.exe                                     VTERM\GS02                                                       SELECT                                                           _                                                                                        65
library cache lock                                               6bwmj9n38kjp6 plsqldev.exe                                     VTERM\GS02                                                       SELECT                                                           _                                                                                       494
library cache lock                                               973s831gy6jk4 plsqldev.exe                                     VTERM\Travelskydba01                                             SELECT                                                           _                                                                                      1475
library cache lock                                               7jj34snwajctj detrreceipt@hostname2 (TNS V1-V3)                hostname2                                                        INSERT                                                           1_2738                                                                                 2419
library cache lock                                               6sx33q2nq3c67 sqlldr@hostname (TNS V1-V3)                      hostname                                                         INSERT                                                           _                                                                                      2419
library cache lock                                               7jj34snwajctj detrreceipt@hostname2 (TNS V1-V3)                hostname2                                                        INSERT                                                           _                                                                                      2515
library cache lock                                               6sx33q2nq3c67 sqlldr@hostname (TNS V1-V3)                      hostname                                                         INSERT                                                           1_2738                                                                                10760

12 rows selected

ASH中后续并没有记录2738这个会话再等latch: row cache objects,但是抓到了其他的进程信息,例如PROGRAM/MACHINE,猜测故障当时此会话已经不是ACTIVE

ASH中并没有记录2738这个会话再等latch: row cache objects
SQL> select --to_char(sample_time, 'yyyymmdd hh24:mi:ss') sample_time,
  2   event,
  3   sql_id,
  4   program,
  5   machine,
  6   sql_opname,
  7   BLOCKING_INST_ID || '_' || BLOCKING_SESSION,
  8   count(*)
  9    from ash_20200501 a
 10   where to_char(sample_time, 'yyyymmdd hh24:mi:ss') between
 11         '20200430 00:00:00' and '20200501 10:44:00'
 12     and session_id=2738 and program ='plsqldev.exe'
 13   group by --to_char(sample_time, 'yyyymmdd hh24:mi:ss'),
 14            event,
 15            sql_id,
 16            program,
 17            machine,
 18            BLOCKING_INST_ID || '_' || BLOCKING_SESSION,
 19            sql_opname
 20   order by 1
 21  ;

EVENT                                                            SQL_ID        PROGRAM                                          MACHINE                                                          SQL_OPNAME                                                       BLOCKING_INST_ID||'_'||BLOCKING_SESSION                                            COUNT(*)
---------------------------------------------------------------- ------------- ------------------------------------------------ ---------------------------------------------------------------- ---------------------------------------------------------------- -------------------------------------------------------------------------------- ----------
                                                                 3ndnsw62wz2f1 plsqldev.exe                                     VTERM\Travelskydba01                                                    SELECT                                                           _                                                                            58210
                                                                 3ndnsw62wz2f1 plsqldev.exe                                     VTERM\Travelskydba01                                                                                                                     _                                                                             6

SQL> 

2738这个会话最早在4月30日18点20开始出现,一直持续到故障发生时(故障发生时间:20200501 10:00左右)

738这个会话最早在4月30日18点20开始出现,一直持续到故障发生时:
SQL> select rownum, a.*
  2    from (select to_char(sample_time, 'yyyymmdd hh24:mi:ss') sample_time,
  3                 event,
  4                 sql_id,
  5                 program,
  6                 machine,
  7                 sql_opname,
  8                 BLOCKING_INST_ID || '_' || BLOCKING_SESSION,
  9                 count(*)
 10            from ash_20200501 a
 11           where to_char(sample_time, 'yyyymmdd hh24:mi:ss') between
 12                 '20200430 00:00:00' and '20200501 10:44:00'
 13             and session_id = 2738
 14             and program = 'plsqldev.exe'
 15           group by to_char(sample_time, 'yyyymmdd hh24:mi:ss'),
 16                    event,
 17                    sql_id,
 18                    program,
 19                    machine,
 20                    BLOCKING_INST_ID || '_' || BLOCKING_SESSION,
 21                    sql_opname
 22           order by 1) a where rownum <2
 23  ;

    ROWNUM SAMPLE_TIME       EVENT                                                            SQL_ID        PROGRAM                                          MACHINE                                                          SQL_OPNAME                                                       BLOCKING_INST_ID||'_'||BLOCKING_SESSION                                            COUNT(*)
---------- ----------------- ---------------------------------------------------------------- ------------- ------------------------------------------------ ---------------------------------------------------------------- ---------------------------------------------------------------- -------------------------------------------------------------------------------- ----------
         1 20200430 18:20:47                                                                  3ndnsw62wz2f1 plsqldev.exe                                     VTERM\Travelskydba01                                                                                                                       _                                                                                         1
继续阅读

RAC集群节点关闭长时间Waiting for ASM to shutdown处理过程与案例分析

5月1日晚,接到值班同事求助电话,RAC集群重启节点后出现监听无法注册SERVICE的情况,应用连接在数据库实例重启后无法在维护节点上连接。

放下电话登录到系统中查看CRS状态出现如下错误

CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors。

CRS此时已经处于非正常状态,listener/service都是以资源的方式注册在CRSD中,所以出现SERVICE的异常现象是符合情理的。随后我准备重启CRS的时候,在stop的过程中却出现了长时间hang的情况,等待Waiting for ASM to shutdown。

[root@travelskydba-rac grid]# cd bin
[root@travelskydba-rac bin]# ./crsctl stat res tt
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[root@travelskydba-rac bin]# ./crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[root@travelskydba-rac bin]# ./crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.crsd' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.crsd' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.mdnsd' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.crf' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.ctssd' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.evmd' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.asm' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.mdnsd' on 'travelskydba-rac' succeeded
CRS-2677: Stop of 'ora.crf' on 'travelskydba-rac' succeeded
CRS-2677: Stop of 'ora.evmd' on 'travelskydba-rac' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'travelskydba-rac' succeeded

CRS-2675: Stop of 'ora.asm' on 'travelskydba-rac' failed  ---卡在这里很长时间,大概有8分钟左右
CRS-2679: Attempting to clean 'ora.asm' on 'travelskydba-rac'
CRS-2681: Clean of 'ora.asm' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.cssd' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.gipcd' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.drivers.acfs' on 'travelskydba-rac' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'travelskydba-rac' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'travelskydba-rac' has completed
CRS-4133: Oracle High Availability Services has been stopped.
2020-05-01 22:37:28.633:
[ctssd(2029)]CRS-2405:The Cluster Time Synchronization Service on host travelskydba-rac is shutdown by user
[client(89699)]CRS-10001:01-May-20 22:37 ACFS-9290: Waiting for ASM to shutdown
.... --省略重复输出
[client(124967)]CRS-10001:01-May-20 22:46 ACFS-9290: Waiting for ASM to shutdown.
[client(125047)]CRS-10001:01-May-20 22:46 ACFS-9290: Waiting for ASM to shutdown.
[client(125174)]CRS-10001:01-May-20 22:46 ACFS-9290: Waiting for ASM to shutdown.
[client(125491)]CRS-10001:01-May-20 22:47 ACFS-9290: Waiting for ASM to shutdown.
[client(125548)]CRS-10001:01-May-20 22:47 ACFS-9290: Waiting for ASM to shutdown.
[client(125614)]CRS-10001:01-May-20 22:47 ACFS-9290: Waiting for ASM to shutdown.
2020-05-01 22:47:28.727:
[/opt/app/11.2.0/grid/bin/oraagent.bin(759)]CRS-5818:Aborted command 'stop' for resource 'ora.asm'. Details at (:CRSAGF00113:) {0:0:52684} in /opt/app/11.2.0/grid/log/travelskydba-rac/agent/ohasd/oraagent_grid//oraagent_grid.log.
2020-05-01 22:47:30.730:
[ohasd(196541)]CRS-2757:Command 'Stop' timed out waiting for response from the resource 'ora.asm'. Details at (:CRSPE00111:) {0:0:52684} in /opt/app/11.2.0/grid/log/travelskydba-rac/ohasd/ohasd.log.
2020-05-01 22:47:33.404:
[cssd(872)]CRS-1603:CSSD on node travelskydba-rac shutdown by user.
2020-05-01 22:47:33.513:
[ohasd(196541)]CRS-2767:Resource state recovery not attempted for 'ora.cssdmonitor' as its target state is OFFLINE
2020-05-01 22:47:33.609:
[cssd(872)]CRS-1660:The CSS daemon shutdown has completed
2020-05-01 22:47:35.756:
[gpnpd(790)]CRS-2329:GPNPD on node travelskydba-rac shutdown.

DB alert日志:
NOTE: ASMB terminating
Errors in file /opt/app/ora11g/diag/rdbms/Albert/Albert1/trace/Albert1_asmb_188518.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1828 Serial number: 22601
Errors in file /opt/app/ora11g/diag/rdbms/Albert/Albert1/trace/Albert1_asmb_188518.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1828 Serial number: 22601
ASMB (ospid: 188518): terminating the instance due to error 15064
Fri May 01 22:47:30 2020
System state dump requested by (instance=1, osid=188518 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /opt/app/ora11g/diag/rdbms/Albert/Albert1/trace/Albert1_diag_188401_20200501224730.trc
Dumping diagnostic data in directory=[cdmp_20200501224730], requested by (instance=1, osid=188518 (ASMB)), summary=[abnormal instance termination].
Instance terminated by ASMB, pid = 188518

节点2:集群日志:
2020-05-01 22:47:33.505:
[cssd(133163)]CRS-1625:Node travelskydba-rac, number 1, was shut down
2020-05-01 22:47:33.517:
[cssd(133163)]CRS-1601:CSSD Reconfiguration complete. Active nodes are travelskydba2-rac travelskydba3-rac .


节点3集群日志:
2020-05-01 22:47:33.505:
[cssd(141351)]CRS-1625:Node travelskydba-rac, number 1, was shut down
2020-05-01 22:47:33.517:
[cssd(141351)]CRS-1601:CSSD Reconfiguration complete. Active nodes are tr730e67-rac travelskydba3-rac .

也就是说节点1集群完全停下来,耗时22:37-22:47 10分钟,hang期间,让值班员联系硬件重启服务器,但是在硬件重启操作前,CRS却停下来了。

保险起见,还是让硬件重启一下服务器(因21点左右,硬件团队同事刚刚对此节点集群的私网心跳线进行更换)。

继续阅读

ORA-15028: ASM file ‘+DG_ARC***’ not dropped; currently being accessed 案例

昨日,接到硬件团队同事电话,告知一个数据库出现内存耗尽,需要重启。查询了一下配置,确认此数据库database_role为 physical standby,随后硬件团队同事进行了服务器重启。

服务器重启过后,启动数据库,但是数据库登录依然缓慢,因为我们配置了broker,会自动启用实时应用:

alter database recover managed standby database using current logfile disconnect from session;

但是奇怪的是,数据库并没有应用日志,一直在处于waiting for log的状态,等待的日志文件已经在备库本地,无奈之下,手工注册日志后数据库进行了日志应用,但是很块MRP0进程继续处于 waiting for log的状态 ,再次注册日志却抛出了

ORA-00311: cannot read header from archiveed log
ORA-27072: File I/O error

看来这次故障并不是简单的内存耗尽,严重怀疑是存储链路层面出现了问题。向硬件团队反馈后,准备断掉主备库DataGuard关系,等存储问题恢复后进行重构。

将DataGuard断掉后,验证一下主库归档备份的状态,发现了更加奇怪的现象:

    RECID OPERATION                         STATUS                  START_TIME          END_TIME            OBJECT_TYPE
---------- --------------------------------- ----------------------- ------------------- ------------------- -------------	
     54558 BACKUP                            COMPLETED WITH WARNINGS 2020-04-16 23:30:37 2020-04-16 23:30:50 ARCHIVELOG
     54556 BACKUP                            COMPLETED WITH WARNINGS 2020-04-16 23:19:42 2020-04-16 23:19:57 ARCHIVELOG
     54554 BACKUP                            COMPLETED WITH WARNINGS 2020-04-16 23:15:10 2020-04-16 23:15:24 ARCHIVELOG
     54552 BACKUP                            COMPLETED WITH WARNINGS 2020-04-16 22:58:50 2020-04-16 22:59:12 ARCHIVELOG
     54550 BACKUP                            COMPLETED WITH WARNINGS 2020-04-16 21:58:51 2020-04-16 21:59:08 ARCHIVELOG
     54548 BACKUP                            COMPLETED WITH WARNINGS 2020-04-16 20:58:51 2020-04-16 20:59:04 ARCHIVELOG
     54546 BACKUP                            COMPLETED WITH WARNINGS 2020-04-16 19:58:51 2020-04-16 19:59:04 ARCHIVELOG


input archived log thread=1 sequence=54511 RECID=217898 STAMP=1037920783
input archived log thread=2 sequence=54482 RECID=217899 STAMP=1037920784
channel ch00: starting piece 1 at 16-APR-20
channel ch00: finished piece 1 at 16-APR-20
piece handle=arch_Travelskydba_std_27847_1_1037920788 tag=TAG20200416T231948 comment=API Version 2.0,MMS Version 5.0.0.0
channel ch00: backup set complete, elapsed time: 00:00:07
channel ch00: deleting archived log(s)
archived log file name=+DG_ARC/Travelskydba_std/archivelog/2020_04_16/thread_1_seq_54501.352.1037918509 RECID=217870 STAMP=1037918508
RMAN-08118: WARNING: could not delete the following archived redo log
archived log file name=+DG_ARC/Travelskydba_std/archivelog/2020_04_16/thread_1_seq_54501.352.1037918509 thread=1 sequence=54501
error from target database:
ORA-15028: ASM file '+DG_ARC/Travelskydba_std/archivelog/2020_04_16/thread_1_seq_54501.352.1037918509' not dropped; currently being accessed

archived log file name=+DG_ARC/Travelskydba_std/archivelog/2020_04_16/thread_1_seq_54502.355.1037918877 RECID=217873 STAMP=1037918877
RMAN-08118: WARNING: could not delete the following archived redo log
archived log file name=+DG_ARC/Travelskydba_std/archivelog/2020_04_16/thread_1_seq_54502.355.1037918877 thread=1 sequence=54502
error from target database:
ORA-15028: ASM file '+DG_ARC/Travelskydba_std/archivelog/2020_04_16/thread_1_seq_54502.355.1037918877' not dropped; currently being accessed

archived log file name=+DG_ARC/Travelskydba_std/archivelog/2020_04_16/thread_2_seq_54474.308.1037918875 RECID=217872 STAMP=1037918875
RMAN-08118: WARNING: could not delete the following archived redo log
archived log file name=+DG_ARC/Travelskydba_std/archivelog/2020_04_16/thread_2_seq_54474.308.1037918875 thread=2 sequence=54474
error from target database:
ORA-15028: ASM file '+DG_ARC/Travelskydba_std/archivelog/2020_04_16/thread_2_seq_54474.308.1037918875' not dropped; currently being accessed

archived log file name=+DG_ARC/Travelskydba_std/archivelog/2020_04_16/thread_2_seq_54481.388.1037920773 RECID=217897 STAMP=1037920772
archived log file name=+DG_ARC/Travelskydba_std/archivelog/2020_04_16/thread_1_seq_54510.318.1037920771 RECID=217896 STAMP=1037920770
archived log file name=+DG_ARC/Travelskydba_std/archivelog/2020_04_16/thread_1_seq_54511.332.1037920783 RECID=217898 STAMP=1037920783
archived log file name=+DG_ARC/Travelskydba_std/archivelog/2020_04_16/thread_2_seq_54482.321.1037920785 RECID=217899 STAMP=1037920784
Finished backup at 16-APR-20

released channel: ch00	

备份日志中出现了ORA-15028错误, error from target database:
ORA-15028: ASM file not dropped; currently being accessed , 可是主备库关系已经断开,备库理论上不需要主库的归档日志并且主库归档进程也是正常的。

再次登录备库,找到了异常的跟因:备库已经shutdown complete,但是操作系统中PMON与arch进程却没有终止(即使用操作系统kill -9也无果)。

随后将备库操作系统关机,再次执行主库的归档备份后, ORA-15028 错误不再出现,备份成功完成

     RECID OPERATION                         STATUS                  START_TIME          END_TIME            OBJECT_TYPE
---------- --------------------------------- ----------------------- ------------------- ------------------- -------------
     54560 BACKUP                            COMPLETED               2020-04-16 23:32:37 2020-04-16 23:32:50 ARCHIVELOG
     54558 BACKUP                            COMPLETED WITH WARNINGS 2020-04-16 23:30:37 2020-04-16 23:30:50 ARCHIVELOG
     54556 BACKUP                            COMPLETED WITH WARNINGS 2020-04-16 23:19:42 2020-04-16 23:19:57 ARCHIVELOG
     54554 BACKUP                            COMPLETED WITH WARNINGS 2020-04-16 23:15:10 2020-04-16 23:15:24 ARCHIVELOG
     54552 BACKUP                            COMPLETED WITH WARNINGS 2020-04-16 22:58:50 2020-04-16 22:59:12 ARCHIVELOG

随后在Oracle Support上搜寻了一下这个错误,文档 Doc ID 1466848.1 进行了非常详细的描述,对应本次的故障案例,备库ARCH进程夯死,传输接收时出现异常,从而主库没有及时的释放archive log的lock,RMAN备份archivelog后无法删除文件。

ORACLE Flashback+DataGuard进行数据恢复

上周,在公司中心范围内进行了一次技术分享,主要阐述一下Datagurd技术的使用范围,其中涉及一些Flashback技术的应用:我们主要利用Flashback闪回特性进行应用人员的误操作,进行恢复。

实际操作流程如下:

主库:Travelskydba            
备库:Travelskydba_std

主库:
SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON
--------- ------------------------------ -------------------- ---------------- --- ------------------
Travelskydba Travelskydba                   READ WRITE           PRIMARY          YES NO


备库:	
SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON
--------- ------------------------------ -------------------- ---------------- --- ------------------
Travelskydba Travelskydba_std                  MOUNTED              PHYSICAL STANDBY YES YES

SQL> show parameter db_flashback_retention_target   

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_flashback_retention_target        integer     1440

主库执行事务:
SQL> create user albert identified by albert;

User created.

SQL> grant connect,resource to albert;

Grant succeeded.

SQL> conn albert/albert
Connected.

SQL> create table Travelskydba (id1 number,id2 number,id3 number);

Table created.

SQL> insert into Travelskydba values(1,2,3);

1 row created.

SQL> insert into Travelskydba values(4,5,6);

1 row created.

SQL> commit;

Commit complete.

主库:
SQL>  select * from albert.Travelskydba;

       ID1        ID2        ID3
---------- ---------- ----------
         1          2          3
         4          5          6

SQL> conn / as sysdba
Connected.
SQL> select current_scn from v$database;

CURRENT_SCN
-----------
   20619438


--开启实时应用的备库已经完成redo应用,数据已经保持与主库同步了:
SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON
--------- ------------------------------ -------------------- ---------------- --- ------------------
Travelskydba Travelskydba_std                  READ ONLY WITH APPLY PHYSICAL STANDBY YES YES

SQL> select * from albert.travelskydba;

       ID1        ID2        ID3
---------- ---------- ----------
         4          5          6

有时我们接收到应用人员的电话,数据误删除了,或者对数据的调研不足,导致实时应用的数据当做历史数据删除掉了,请求DBA需要进行数据恢复,此时我们可以利用DataGuard 备库已经开启的flashback特性进行数据恢复。

SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON
--------- ------------------------------ -------------------- ---------------- --- ------------------
Travelskydba Travelskydba_std                  READ ONLY WITH APPLY PHYSICAL STANDBY YES YES   
SQL> select OLDEST_FLASHBACK_SCN,to_char(OLDEST_FLASHBACK_TIME,'yyyymmdd hh24:mi:ss') time,RETENTION_TARGET,FLASHBACK_SIZE,ESTIMATED_FLASHBACK_SIZE from v$flashback_database_log;

OLDEST_FLASHBACK_SCN TIME              RETENTION_TARGET FLASHBACK_SIZE ESTIMATED_FLASHBACK_SIZE
-------------------- ----------------- ---------------- -------------- ------------------------
            20611974 20200325 12:16:12             1440      629145600               1323294720


如果数据量较少,主库可以利用DBLINK 进行数据恢复:
SQL> show user
USER is "ALBERT"
SQL> insert into travelskydba select * from albert.travelskydba@TRAVELSKYDBA_LINK where id1=1;

1 row created.

SQL> commit;

Commit complete.

SQL> select * from travelskydba;

       ID1        ID2        ID3
---------- ---------- ----------
         1          2          3
         4          5          6


--数据恢复后,备库可以继续进行实时应用		 
SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup;
ORACLE instance started.

Total System Global Area 4108759040 bytes
Fixed Size                  2259680 bytes
Variable Size            1543505184 bytes
Database Buffers         2550136832 bytes
Redo Buffers               12857344 bytes
Database mounted.
Database opened.		

--备库:
SQL> set linesize 200 
SQL> select process,status,client_process,thread#,sequence#,block#,active_agents,known_agents from gv$managed_standby where process in ('LNS','RFS','MRP0') and  THREAD# <> 0
  2  /

PROCESS   STATUS       CLIENT_P    THREAD#  SEQUENCE#     BLOCK# ACTIVE_AGENTS KNOWN_AGENTS
--------- ------------ -------- ---------- ---------- ---------- ------------- ------------
MRP0      APPLYING_LOG N/A               1       6942          1             5            5

SQL> /

PROCESS   STATUS       CLIENT_P    THREAD#  SEQUENCE#     BLOCK# ACTIVE_AGENTS KNOWN_AGENTS
--------- ------------ -------- ---------- ---------- ---------- ------------- ------------
MRP0      APPLYING_LOG N/A               1       6942          1             5            5
RFS       CLOSING      UNKNOWN           1       6944          1             0            0
RFS       OPENING      UNKNOWN           1       6943          0             0            0
RFS       OPENING      UNKNOWN           1       6942          0             0            0
RFS       OPENING      UNKNOWN           1       6946          0             0            0
RFS       OPENING      UNKNOWN           1       6948          0             0            0

6 rows selected.
 
 
--主库:
SQL> select process,status,client_process,thread#,sequence#,block#,active_agents,known_agents from gv$managed_standby where process in ('LNS','RFS','MRP0') and  THREAD# <> 0;

PROCESS   STATUS       CLIENT_P    THREAD#  SEQUENCE#     BLOCK# ACTIVE_AGENTS KNOWN_AGENTS
--------- ------------ -------- ---------- ---------- ---------- ------------- ------------
LNS       WRITING      LNS               1       6950          5             0            0

SQL> /

PROCESS   STATUS       CLIENT_P    THREAD#  SEQUENCE#     BLOCK# ACTIVE_AGENTS KNOWN_AGENTS
--------- ------------ -------- ---------- ---------- ---------- ------------- ------------
LNS       WRITING      LNS               1       6950          6             0            0

SQL> /
/
PROCESS   STATUS       CLIENT_P    THREAD#  SEQUENCE#     BLOCK# ACTIVE_AGENTS KNOWN_AGENTS
--------- ------------ -------- ---------- ---------- ---------- ------------- ------------
LNS       WRITING      LNS               1       6950          7             0            0

SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            /oracle/archive_Travelskydba/Traveskydba
Oldest online log sequence     6948
Next log sequence to archive   6950
Current log sequence           6950

上述情形适用于误删除数据量较小的情况,如果遇到误删除的数据较多,利用DB_LINK的方式效率相对是比较低下的,我们需要将备库打开至read write mode,将数据导入导出,从而达到进行数据恢复的目的

--备库:
SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON
--------- ------------------------------ -------------------- ---------------- --- ------------------
Travelskydba  Travelskydba_std                  READ ONLY WITH APPLY PHYSICAL STANDBY YES YES



SQL> select * from albert.travelskydba;

       ID1        ID2        ID3
---------- ---------- ----------
         1          2          3

主库误删除用户:
SQL> drop user albert cascade;

User dropped.

备库:
SQL> select * from albert.travelskydba;
select * from albert.travelskydba
                     *
ERROR at line 1:
ORA-00942: table or view does not exist


SQL> alter database recover managed standby database cancel;

Database altered.

SQL> create restore point tra guarantee flashback database; --创建resotre point

Restore point created.

SQL>  flashback database to scn 20668180;

Flashback complete.


SQL>  ALTER DATABASE ACTIVATE STANDBY DATABASE;--将备库打开至可读可写状态

Database altered.

SQL> alter database open;

Database altered.


SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON
--------- ------------------------------ -------------------- ---------------- --- ------------------
Travelskydba Travelskydba_std                  READ WRITE           PRIMARY          YES YES


SQL> select * from albert.travelskydba; --数据恢复完成,将此用户导出,导入恢复至主库

       ID1        ID2        ID3
---------- ---------- ----------
         1          2          3
		 
SQL> create directory tra_dir as '/home/ora11g/';

Directory created.

SQL> 
SQL> grant read,write,execute on directory tra_dir to public;

Grant succeeded.


[ora11g@albert ]$ expdp schemas=ALBERT directory=tra_dir dumpfile=albert_dmp.20200325 logfile=expdp_20200325.log 

Export: Release 11.2.0.4.0 - Production on Wed Mar 25 09:59:48 2020

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

Username: / as sysdba

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Starting "SYS"."SYS_EXPORT_SCHEMA_01":  /******** AS SYSDBA schemas=ALBERT directory=tra_dir dumpfile=albert_dmp.20200325 logfile=expdp_20200325.log 
Estimate in progress using BLOCKS method...
Processing object type SCHEMA_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 64 KB
Processing object type SCHEMA_EXPORT/USER
Processing object type SCHEMA_EXPORT/SYSTEM_GRANT
Processing object type SCHEMA_EXPORT/ROLE_GRANT
Processing object type SCHEMA_EXPORT/DEFAULT_ROLE
Processing object type SCHEMA_EXPORT/PRE_SCHEMA/PROCACT_SCHEMA
Processing object type SCHEMA_EXPORT/TABLE/TABLE
. . exported "ALBERT"."TRAVELSKYDBA"                     5.812 KB       1 rows
Master table "SYS"."SYS_EXPORT_SCHEMA_01" successfully loaded/unloaded
******************************************************************************
Dump file set for SYS.SYS_EXPORT_SCHEMA_01 is:
  /home/ora11g/albert_dmp.20200325
Job "SYS"."SYS_EXPORT_SCHEMA_01" successfully completed at Wed Mar 25 10:00:08 2020 elapsed 0 00:00:17

[ora11g@albert ~]$ scp albert_dmp.20200325 ******:/home/ora11g
The authenticity of host '*******' can't be established.
RSA key fingerprint is 45:b9:52:12:86:55:8d:e5:33:8b:b8:59:02:68:81:ff.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '******' (RSA) to the list of known hosts.
ora11g@******'s password: 
albert_dmp.20200325                                                                                                                                     100%  180KB 180.0KB/s   00:00 


--导入主库:

SQL> create directory tra_dir as '/home/ora11g/';

Directory created.

SQL> grant read,write,execute on directory tra_dir to public;

Grant succeeded.


[ora11g@vm-kvm17146-ora ~]$ impdp schemas=ALBERT directory=tra_dir dumpfile=albert_dmp.20200325 logfile=impdp_20200325.log 

Import: Release 11.2.0.4.0 - Production on Wed Mar 25 20:28:48 2020

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

Username: / as sysdba

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Master table "SYS"."SYS_IMPORT_SCHEMA_01" successfully loaded/unloaded
Starting "SYS"."SYS_IMPORT_SCHEMA_01":  /******** AS SYSDBA schemas=ALBERT directory=tra_dir dumpfile=albert_dmp.20200325 logfile=impdp_20200325.log 
Processing object type SCHEMA_EXPORT/USER
Processing object type SCHEMA_EXPORT/SYSTEM_GRANT
Processing object type SCHEMA_EXPORT/ROLE_GRANT
Processing object type SCHEMA_EXPORT/DEFAULT_ROLE
Processing object type SCHEMA_EXPORT/PRE_SCHEMA/PROCACT_SCHEMA
Processing object type SCHEMA_EXPORT/TABLE/TABLE
Processing object type SCHEMA_EXPORT/TABLE/TABLE_DATA
. . imported "ALBERT"."TRAVELSKYDBA"                     5.812 KB       1 rows
Job "SYS"."SYS_IMPORT_SCHEMA_01" successfully completed at Wed Mar 25 20:28:59 2020 elapsed 0 00:00:04

主库数据恢复:
SQL> select * from albert.travelskydba;

       ID1        ID2        ID3
---------- ---------- ----------
         1          2          3

主库数据已经恢复,但是我们现在要继续处理备库,让备库继续进行实时应用

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> 
SQL> 
SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount; 
ORACLE instance started.

Total System Global Area 4108759040 bytes
Fixed Size                  2259680 bytes
Variable Size            1577059616 bytes
Database Buffers         2516582400 bytes
Redo Buffers               12857344 bytes
Database mounted.
SQL> flashback database to restore point tra; --闪回至restore point

Flashback complete.


SQL> alter database convert to physical standby; 更新control_file,database_role为physical standby

Database altered.


SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON
--------- ------------------------------ -------------------- ---------------- --- ------------------
Travelskydba Travelskydba_std                  MOUNTED              PHYSICAL STANDBY YES YES



SQL> alter database recover managed standby database using current logfile disconnect from session;

Database altered.


主库:
SQL> set linesize 200 
SQL> select process,status,client_process,thread#,sequence#,block#,active_agents,known_agents from  gv$managed_standby where process in('LNS','RFS','MRP0') and thread# <>0;

PROCESS   STATUS       CLIENT_P    THREAD#  SEQUENCE#     BLOCK# ACTIVE_AGENTS KNOWN_AGENTS
--------- ------------ -------- ---------- ---------- ---------- ------------- ------------
LNS       WRITING      LNS               1       6972         90             0            0

SQL> alter system switch logfile;

System altered.

SQL> set linesize 200 
SQL> select process,status,client_process,thread#,sequence#,block#,active_agents,known_agents from  gv$managed_standby where process in('LNS','RFS','MRP0') and thread# <>0
  2  ;

PROCESS   STATUS       CLIENT_P    THREAD#  SEQUENCE#     BLOCK# ACTIVE_AGENTS KNOWN_AGENTS
--------- ------------ -------- ---------- ---------- ---------- ------------- ------------
LNS       WRITING      LNS               1       6973          4             0            0


备库:
SQL> set linesize 200 
SQL> select process,status,client_process,thread#,sequence#,block#,active_agents,known_agents from  gv$managed_standby where process in('LNS','RFS','MRP0') and thread# <>0;

PROCESS   STATUS       CLIENT_P    THREAD#  SEQUENCE#     BLOCK# ACTIVE_AGENTS KNOWN_AGENTS
--------- ------------ -------- ---------- ---------- ---------- ------------- ------------
RFS       IDLE         LGWR              1       6973          8             0            0
MRP0      APPLYING_LOG N/A               1       6973          8             5            5


SQL> select * from albert.travelskydba;

       ID1        ID2        ID3
---------- ---------- ----------
         1          2          3

我们还可以利用Oracle snapshot方式进行备库的read write模式打开

SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on,current_scn from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON       CURRENT_SCN
--------- ------------------------------ -------------------- ---------------- --- ------------------ -----------
Travelskydba  Travelskydba                   READ WRITE           PRIMARY          YES NO                    20755536


SQL>  select * from albert.travelskydba;

       ID1        ID2        ID3
---------- ---------- ----------
         1          2          3



--备库:

SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on,current_scn from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON       CURRENT_SCN
--------- ------------------------------ -------------------- ---------------- --- ------------------ -----------
Travelskydba Travelskydba_std                  READ ONLY WITH APPLY PHYSICAL STANDBY YES YES                   20755688

SQL> select * from albert.travelskydba;

       ID1        ID2        ID3
---------- ---------- ----------
         1          2          3


--主库误删除数据:


SQL> drop user albert cascade;

User dropped.

SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on,current_scn from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON       CURRENT_SCN
--------- ------------------------------ -------------------- ---------------- --- ------------------ -----------
Travelskydba Travelskydba                   READ WRITE           PRIMARY          YES NO                    20755761

   

--备库:
SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on,current_scn from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON       CURRENT_SCN
--------- ------------------------------ -------------------- ---------------- --- ------------------ -----------
Travelskydba Travelskydba_std                  READ ONLY WITH APPLY PHYSICAL STANDBY YES YES                   20755770
   

SQL> select * from albert.travelskydba;
select * from albert.travelskydba
                     *
ERROR at line 1:
ORA-00942: table or view does not exist

准备恢复数据:

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount;
ORACLE instance started.

Total System Global Area 4108759040 bytes
Fixed Size                  2259680 bytes
Variable Size            1577059616 bytes
Database Buffers         2516582400 bytes
Redo Buffers               12857344 bytes
Database mounted.
SQL> show parameter broker

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
dg_broker_config_file1               string      /oracle/oradata_Travelskydba/Travelskydba/dataguardconfig/dg_config0
                                                 1.ora
dg_broker_config_file2               string      /oracle/oradata_Travelskydba/Travelskydba/dataguardconfig/dg_config0
                                                 2.ora
dg_broker_start                      boolean     TRUE
SQL> alter system set dg_broker_start=false;

System altered.

SQL> alter database recover managed standby database cancel;

Database altered.

SQL> flashback database to scn 20755688;

Flashback complete.

SQL> alter database open;

Database altered.

SQL>  select * from albert.travelskydba;

       ID1        ID2        ID3
---------- ---------- ----------
         1          2          3

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount;
ORACLE instance started.

Total System Global Area 4108759040 bytes
Fixed Size                  2259680 bytes
Variable Size            1577059616 bytes
Database Buffers         2516582400 bytes
Redo Buffers               12857344 bytes
Database mounted.
SQL> 
SQL> alter database convert to snapshot standby;

Database altered.

SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on,current_scn from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON       CURRENT_SCN
--------- ------------------------------ -------------------- ---------------- --- ------------------ -----------
Travelskydba Travelskydba_std                  MOUNTED              SNAPSHOT STANDBY YES YES                          0


SQL> alter database open; 

Database altered.

SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on,current_scn from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON       CURRENT_SCN
--------- ------------------------------ -------------------- ---------------- --- ------------------ -----------
Travelskydba  Travelskydba_std                  READ WRITE           SNAPSHOT STANDBY YES YES                   20755867


SQL> select * from albert.travelskydba;

       ID1        ID2        ID3
---------- ---------- ----------
         1          2          3

[ora11g@albert ~]$  expdp schemas=ALBERT directory=tra_dir dumpfile=albert_dmp.20200326 logfile=expdp_20200326.log 

Export: Release 11.2.0.4.0 - Production on Thu Mar 26 01:12:23 2020

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

Username: / as sysdba

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Starting "SYS"."SYS_EXPORT_SCHEMA_01":  /******** AS SYSDBA schemas=ALBERT directory=tra_dir dumpfile=albert_dmp.20200326 logfile=expdp_20200326.log 
Estimate in progress using BLOCKS method...
Processing object type SCHEMA_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 64 KB
Processing object type SCHEMA_EXPORT/USER
Processing object type SCHEMA_EXPORT/SYSTEM_GRANT
Processing object type SCHEMA_EXPORT/ROLE_GRANT
Processing object type SCHEMA_EXPORT/DEFAULT_ROLE
Processing object type SCHEMA_EXPORT/PRE_SCHEMA/PROCACT_SCHEMA
Processing object type SCHEMA_EXPORT/TABLE/TABLE
. . exported "ALBERT"."TRAVELSKYDBA"                     5.812 KB       1 rows
Master table "SYS"."SYS_EXPORT_SCHEMA_01" successfully loaded/unloaded
******************************************************************************
Dump file set for SYS.SYS_EXPORT_SCHEMA_01 is:
  /home/ora11g/albert_dmp.20200326
Job "SYS"."SYS_EXPORT_SCHEMA_01" successfully completed at Thu Mar 26 01:12:44 2020 elapsed 0 00:00:18

--数据导入过程省略--

--备库恢复至standby模式:

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount;
ORACLE instance started.

Total System Global Area 4108759040 bytes
Fixed Size                  2259680 bytes
Variable Size            1577059616 bytes
Database Buffers         2516582400 bytes
Redo Buffers               12857344 bytes
Database mounted.
SQL> alter database convert to physical standby;

Database altered.

SQL> shutdown immediate;
ORA-01507: database not mounted


ORACLE instance shut down.
SQL> startup mount;
ORACLE instance started.

Total System Global Area 4108759040 bytes
Fixed Size                  2259680 bytes
Variable Size            1577059616 bytes
Database Buffers         2516582400 bytes
Redo Buffers               12857344 bytes
Database mounted.

SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on,current_scn from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON       CURRENT_SCN
--------- ------------------------------ -------------------- ---------------- --- ------------------ -----------
Travelskydba Travelskydba_std                  MOUNTED              PHYSICAL STANDBY YES YES                   20755688  --之前闪回的SCN点,flashback database to scn 20755688;




SQL> alter database recover managed standby database using current logfile disconnect from session;

Database altered.

SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on,current_scn from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON       CURRENT_SCN
--------- ------------------------------ -------------------- ---------------- --- ------------------ -----------
Travelskydba Travelskydba_std                  MOUNTED              PHYSICAL STANDBY YES YES                   20757332


SQL> /

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON       CURRENT_SCN
--------- ------------------------------ -------------------- ---------------- --- ------------------ -----------
Travelskydba Travelskydba_std                  MOUNTED              PHYSICAL STANDBY YES YES                   20757332


SQL> alter database open; 

Database altered.

SQL> select name,DB_UNIQUE_NAME,open_mode,database_role,force_logging,flashback_on,current_scn from v$database;

NAME      DB_UNIQUE_NAME                 OPEN_MODE            DATABASE_ROLE    FOR FLASHBACK_ON       CURRENT_SCN
--------- ------------------------------ -------------------- ---------------- --- ------------------ -----------
Travelskydba Travelskydba_std                  MOUNTED              PHYSICAL STANDBY YES YES                   20757498 

主库:
SQL> select process,status,client_process,thread#,sequence#,block#,active_agents,known_agents from  gv$managed_standby where process in('LNS','RFS','MRP0') and thread# <>0;

PROCESS   STATUS       CLIENT_P    THREAD#  SEQUENCE#     BLOCK# ACTIVE_AGENTS KNOWN_AGENTS
--------- ------------ -------- ---------- ---------- ---------- ------------- ------------
LNS       WRITING      LNS               1         19        206             0            0

SQL> 
SQL> alter system switch logfile;

System altered.

SQL> select process,status,client_process,thread#,sequence#,block#,active_agents,known_agents from  gv$managed_standby where process in('LNS','RFS','MRP0') and thread# <>0;

PROCESS   STATUS       CLIENT_P    THREAD#  SEQUENCE#     BLOCK# ACTIVE_AGENTS KNOWN_AGENTS
--------- ------------ -------- ---------- ---------- ---------- ------------- ------------
LNS       WRITING      LNS               1         20          6             0            0

备库:
SQL> set linesize 200 
SQL> select process,status,client_process,thread#,sequence#,block#,active_agents,known_agents from  gv$managed_standby where process in('LNS','RFS','MRP0') and thread# <>0;

PROCESS   STATUS       CLIENT_P    THREAD#  SEQUENCE#     BLOCK# ACTIVE_AGENTS KNOWN_AGENTS
--------- ------------ -------- ---------- ---------- ---------- ------------- ------------
MRP0      APPLYING_LOG N/A               1         19        142             5            5
RFS       IDLE         LGWR              1         19        142             0            0

SQL> /

PROCESS   STATUS       CLIENT_P    THREAD#  SEQUENCE#     BLOCK# ACTIVE_AGENTS KNOWN_AGENTS
--------- ------------ -------- ---------- ---------- ---------- ------------- ------------
MRP0      APPLYING_LOG N/A               1         20         55             5            5
RFS       IDLE         LGWR              1         20         55             0            0

ORA-01196/ORA-01110特殊恢复案例一则

前不久,我们一套生产数据库的dataguard的备库出现实时应用过慢,数据库服务器IOWAIT高的情况,为了避免主库归档日志出现大量挤压,所以值班同事临时应急将主备库关系配置断掉,后续再进行DataGuard重构。除此之外,我们准备将备库打开进行一系列测试,试图探寻数据库服务器IOWAIT高的原因并解决DataGuard延迟的问题。

第二天,同事尝试打开备库,但是报了ORA-01196与错误

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-10458: standby database requires recovery
ORA-01196: file 1 is inconsistent due to a failed media recovery session
ORA-01110: data file 1: '/oracle/oradata_Travelskydba1/Travelskydba_DG/system01.dbf'

此错误因为备库还未将redo log与archive恢复至一致状态,数据库此时处于不一致状态是无法打开的。但是,备库所有的归档日志已经被删除,此时数据库无法进行一个完全恢复

SQL> set line 1000 pagesize 200
SQL> select name,open_mode,CONTROLFILE_CHANGE#,CHECKPOINT_CHANGE#,CURRENT_SCN,DATABASE_ROLE from V$database;

NAME      OPEN_MODE            CONTROLFILE_CHANGE# CHECKPOINT_CHANGE# CURRENT_SCN DATABASE_ROLE
--------- -------------------- ------------------- ------------------ ----------- ----------------
Travelskydba  MOUNTED                          7669047             966955     7669046 PHYSICAL STANDBY

SQL>  ALTER DATABASE ACTIVATE STANDBY DATABASE;
 ALTER DATABASE ACTIVATE STANDBY DATABASE
*
ERROR at line 1:
ORA-01196: file 1 is inconsistent due to a failed media recovery session
ORA-01110: data file 1: '/oracle/oradata_Travelskydba1/Travelskydba_DG/system01.dbf'

此时,我准备重建控制文件,进行不完全恢复操作

QL> alter database backup controlfile to trace as '/home/ora11g/albert.ctl';

Database altered.

SQL> shutdown immediate;
ORA-01109: database not open


Database dismounted.
ORACLE instance shut down.
SQL> 

albert.ctl文件内容:
STARTUP NOMOUNT
CREATE CONTROLFILE REUSE DATABASE "ORA1459T" RESETLOGS  ARCHIVELOG
    MAXLOGFILES 16
    MAXLOGMEMBERS 3
    MAXDATAFILES 100
    MAXINSTANCES 8
    MAXLOGHISTORY 292
LOGFILE
  GROUP 1 (
    '/oracle/redo_Travelskydba1/Travelskydba_DG/redo11.log',
    '/oracle/redo_Travelskydba1/Travelskydba_DG/redo12.log'
  ) SIZE 300M BLOCKSIZE 512,
  GROUP 2 (
    '/oracle/redo_Travelskydba1/Travelskydba_DG/redo21.log',
    '/oracle/redo_Travelskydba1/Travelskydba_DG/redo22.log'
  ) SIZE 300M BLOCKSIZE 512,
  GROUP 3 (
    '/oracle/redo_Travelskydba1/Travelskydba_DG/redo31.log',
    '/oracle/redo_Travelskydba1/Travelskydba_DG/redo32.log'
  ) SIZE 300M BLOCKSIZE 512
-- STANDBY LOGFILE
--   GROUP 4 '/oracle/redo_Travelskydba1/Travelskydba_DG/standby_redo04.log'  SIZE 300M BLOCKSIZE 512,
--   GROUP 5 '/oracle/redo_Travelskydba1/Travelskydba_DG/standby_redo05.log'  SIZE 300M BLOCKSIZE 512
DATAFILE
  '/oracle/oradata_Travelskydba1/Travelskydba_DG/system01.dbf',
  '/oracle/oradata_Travelskydba1/Travelskydba_DG/sysaux01.dbf',
  '/oracle/oradata_Travelskydba1/Travelskydba_DG/undotbs01.dbf',
  '/oracle/oradata_Travelskydba1/Travelskydba_DG/users01.dbf',
  '/oracle/oradata_Travelskydba1/Travelskydba_DG/users02.dbf',
  '/oracle/oradata_Travelskydba1/Travelskydba_DG/users03.dbf',
  '/oracle/oradata_Travelskydba1/Travelskydba_DG/users04.dbf'
CHARACTER SET AL32UTF8
;

[ora11g@ALBERT ~]$ !sql
sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Sat Mar 28 12:36:00 2020

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> @albert.ctl
ORACLE instance started.

Total System Global Area 3290345472 bytes
Fixed Size                  2257720 bytes
Variable Size            1191185608 bytes
Database Buffers         2080374784 bytes
Redo Buffers               16527360 bytes

Control file created.  --控制文件重建完毕

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 7669047 generated at 11/04/2019 23:00:11 needed for thread 1
ORA-00289: suggestion :
/oracle/oradata_Travelskydba1/Travelskydba_DG/1_247_1017488189.dbf
ORA-00280: change 7669047 for thread 1 is in sequence #247


Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
cancel
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01196: file 1 is inconsistent due to a failed media recovery session
ORA-01110: data file 1: '/oracle/oradata_Travelskydba1/Travelskydba_DG/system01.dbf'


ORA-01112: media recovery not started

SQL> alter database open resetlogs;  --仍然无法打开数据库
alter database open resetlogs
*
ERROR at line 1:
ORA-01196: file 1 is inconsistent due to a failed media recovery session
ORA-01110: data file 1: '/oracle/oradata_Travelskydba1/Travelskydba_DG/system01.dbf'

修改数据库alter system set “_allow_resetlogs_corruption” =true scope=spfile;参数,再次进行不完全恢复(此操作会有严重丢失数据风险,生产系统请慎用)

SQL> startup nomount;
ORACLE instance started.

Total System Global Area 3290345472 bytes
Fixed Size                  2257720 bytes
Variable Size            1191185608 bytes
Database Buffers         2080374784 bytes
Redo Buffers               16527360 bytes
SQL> alter system set "_allow_resetlogs_corruption" =true scope=spfile;

System altered.
SQL> shutdown abort;
ORACLE instance shut down.

SQL> @albert.ctl
ORACLE instance started.

Total System Global Area 3290345472 bytes
Fixed Size                  2257720 bytes
Variable Size            1191185608 bytes
Database Buffers         2080374784 bytes
Redo Buffers               16527360 bytes

Control file created.

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 7669047 generated at 11/04/2019 23:00:11 needed for thread 1
ORA-00289: suggestion :
/oracle/oradata_Travelskydba1/Travelskydba_DG/1_247_1017488189.dbf
ORA-00280: change 7669047 for thread 1 is in sequence #247


Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
cancel
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01196: file 1 is inconsistent due to a failed media recovery session
ORA-01110: data file 1: '/oracle/oradata_Travelskydba1/Travelskydba_DG/system01.dbf'


ORA-01112: media recovery not started


SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [2662], [0], [7669055], [0],
[7996264], [4194545], [], [], [], [], [], []
Process ID: 12086
Session ID: 1705 Serial number: 3

此时出现ORA-00600 2262错误,此时距离成功已经不远,只需要推进SCN即可,
SQL> shutdown abort;
ORACLE instance shut down.
SQL> @albert.ctl
ORACLE instance started.

Total System Global Area 3290345472 bytes
Fixed Size                  2257720 bytes
Variable Size            1191185608 bytes
Database Buffers         2080374784 bytes
Redo Buffers               16527360 bytes

Control file created.

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 7779049 generated at 03/28/2020 12:54:12 needed for thread 1
ORA-00289: suggestion : /oracle/oradata_Travelskydba1/Travelskydba_DG/1_2_1036241636.dbf
ORA-00280: change 7779049 for thread 1 is in sequence #2


Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
cancel
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/oracle/oradata_Travelskydba1/Travelskydba_DG/system01.dbf'


ORA-01112: media recovery not started


SQL> oradebug setmypid
Statement processed.
SQL> oradebug dumpvar sga kcsgscn_
kcslf kcsgscn_ [06001AE70, 06001AEA0) = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6001AB50 00000000
SQL>  oradebug poke 0x06001AE70 4 8279047 
BEFORE: [06001AE70, 06001AE74) = 00000000
AFTER:  [06001AE70, 06001AE74) = 007E5407

SQL> alter database open resetlogs;  --数据库成功打开

Database altered.


SQL> set line 1000 pagesize 200    
SQL> select name,open_mode,CONTROLFILE_CHANGE#,CHECKPOINT_CHANGE#,CURRENT_SCN,DATABASE_ROLE from V$database;

NAME      OPEN_MODE            CONTROLFILE_CHANGE# CHECKPOINT_CHANGE# CURRENT_SCN DATABASE_ROLE
--------- -------------------- ------------------- ------------------ ----------- ----------------
Travelskydba  READ WRITE                       8280898            8279049     8280903 PRIMARY

主库 drop tablespace后备库存储空间不释放案例一则

上周,应用人员误操作在平台上tablespace name 打错名字,从而想删除表空间进行重新建立。重建后此数据库备库触发了文件系统满的报警,备库文件系统使用率已经为100%,但是数据库值班人员观察到备库之前建立错误的数据文件已经不在,为何文件系统存储中文件却没有删掉?之前怎么没有出现如此的情况?

我随后做了一个实验给他,原因是在drop tablespace时,没有加上including contents and datafiles

--主库:
SQL> select username,count(*) from gv$session group by username order by 2 desc;

USERNAME                         COUNT(*)
------------------------------ ----------
                                      174
ZABBIX                                  6
SYS                                     5
DBSNMP                                  4
PUBLIC                                  1

SQL> 
SQL> 
SQL> 
SQL> 
SQL> 
SQL> 
SQL> create tablespace albert datafile '+DG_DATA' size 1g;

Tablespace created.


SQL> select name from v$tablespace; 

NAME
------------------------------
SYSTEM
SYSAUX
UNDOTBS1
TEMP
UNDOTBS2
USERS
TBS_TEST_1025_1
TBS_TEST_1025_2
ALBERT --刚刚创建的表空间

     FILE# NAME
---------- -----------------------------------------------------------------
         1 +DG_DATA/Travelskydba/datafile/system.259.1002299725
         2 +DG_DATA/Travelskydba/datafile/sysaux.260.1002299741
         3 +DG_DATA/Travelskydba/datafile/undotbs1.261.1002299753
         4 +DG_DATA/Travelskydba/datafile/undotbs2.263.1002299769
         5 +DG_DATA/Travelskydba/datafile/users.264.1002299781
         6 +DG_DATA/Travelskydba/datafile/tbs_test_1025_1.269.1022603323
         7 +DG_DATA/Travelskydba/datafile/tbs_test_1025_2.270.1022603429
         8 +DG_DATA/Travelskydba/datafile/albert.271.1035217783


--备库:
SQL> select name from v$tablespace;

NAME
------------------------------
SYSTEM
SYSAUX
UNDOTBS1
TEMP
UNDOTBS2
USERS
TBS_TEST_1025_1
TBS_TEST_1025_2
ALBERT


SQL> select file#,name from v$datafile;

     FILE# NAME
---------- -----------------------------------------------------------------
         1 /oracle/oradata_Travelskydba/Travelskydba/system.259.1002299725
         2 /oracle/oradata_Travelskydba/Travelskydba/sysaux.260.1002299741
         3 /oracle/oradata_Travelskydba/Travelskydba/undotbs1.261.1002299753
         4 /oracle/oradata_Travelskydba/Travelskydba/undotbs2.263.1002299769
         5 /oracle/oradata_Travelskydba/Travelskydba/users.264.1002299781
         6 /oracle/oradata_Travelskydba/Travelskydba/tbs_test_1025_1.269.1022603323
         7 /oracle/oradata_Travelskydba/Travelskydba/tbs_test_1025_2.270.1022603429
         8 /oracle/oradata_Travelskydba/Travelskydba/albert.271.1035217783

8 rows selected.

一、加入including contents and datafiles

--主库删除:
SQL> drop tablespace albert including contents and datafiles;

Tablespace dropped.



--备库确认,正常删除掉
SQL> select file#,name from v$datafile;

     FILE# NAME
---------- -----------------------------------------------------------------
         1 /oracle/oradata_Travelskydba/Travelskydba/system.259.1002299725
         2 /oracle/oradata_Travelskydba/Travelskydba/sysaux.260.1002299741
         3 /oracle/oradata_Travelskydba/Travelskydba/undotbs1.261.1002299753
         4 /oracle/oradata_Travelskydba/Travelskydba/undotbs2.263.1002299769
         5 /oracle/oradata_Travelskydba/Travelskydba/users.264.1002299781
         6 /oracle/oradata_Travelskydba/Travelskydba/tbs_test_1025_1.269.1022603323
         7 /oracle/oradata_Travelskydba/Travelskydba/tbs_test_1025_2.270.1022603429

7 rows selected.

SQL> select name from v$tablespace;

NAME
------------------------------
SYSTEM
SYSAUX
UNDOTBS1
TEMP
UNDOTBS2
USERS
TBS_TEST_1025_1
TBS_TEST_1025_2

8 rows selected.


total 68780320
drwxr-xr-x 2 ora11g dba        4096 Oct 14 11:51 dataguardconfig
-rw-r----- 1 ora11g dba    24526848 Oct 23 15:31 stdctl02.ctl.bak
-rw-r----- 1 ora11g dba    24526848 Oct 23 15:31 stdctl01.ctl.bak
-rw-r----- 1 ora11g dba 10485768192 Oct 23 15:34 temp.262.1002299765
-rw-r----- 1 ora11g dba  2097160192 Mar 16 15:56 users.264.1002299781
-rw-r----- 1 ora11g dba 21474844672 Mar 16 15:56 tbs_test_1025_2.270.1022603429
-rw-r----- 1 ora11g dba 21474844672 Mar 16 15:56 tbs_test_1025_1.269.1022603323
-rw-r----- 1 ora11g dba  5242888192 Mar 16 16:35 sysaux.260.1002299741
-rw-r----- 1 ora11g dba  5242888192 Mar 16 16:35 undotbs1.261.1002299753
-rw-r----- 1 ora11g dba  5242888192 Mar 16 16:35 undotbs2.263.1002299769
-rw-r----- 1 ora11g dba  7340040192 Mar 16 16:35 system.259.1002299725
-rw-r----- 1 ora11g dba    24526848 Mar 16 16:35 stdctl02.ctl
-rw-r----- 1 ora11g dba    24526848 Mar 16 16:35 stdctl01.ctl

此时文件已经正常删除掉。


二、主库不加including contents and datafiles;

--主库创建文件:
SQL> create tablespace albert_leung datafile '+DG_DATA' size 2g;

Tablespace created.

--备库确认:

     FILE# NAME
---------- -----------------------------------------------------------------
         1 /oracle/oradata_Travelskydba/Travelskydba/system.259.1002299725
         2 /oracle/oradata_Travelskydba/Travelskydba/sysaux.260.1002299741
         3 /oracle/oradata_Travelskydba/Travelskydba/undotbs1.261.1002299753
         4 /oracle/oradata_Travelskydba/Travelskydba/undotbs2.263.1002299769
         5 /oracle/oradata_Travelskydba/Travelskydba/users.264.1002299781
         6 /oracle/oradata_Travelskydba/Travelskydba/tbs_test_1025_1.269.1022603323
         7 /oracle/oradata_Travelskydba/Travelskydba/tbs_test_1025_2.270.1022603429
         8 /oracle/oradata_Travelskydba/Travelskydba/albert_leung.271.1035218283

8 rows selected.

SQL> select name from v$tablespace; 

NAME
------------------------------
SYSTEM
SYSAUX
UNDOTBS1
TEMP
UNDOTBS2
USERS
TBS_TEST_1025_1
TBS_TEST_1025_2
ALBERT_LEUNG


drwxr-xr-x 2 ora11g dba        4096 Oct 14 11:51 dataguardconfig
-rw-r----- 1 ora11g dba    24526848 Oct 23 15:31 stdctl02.ctl.bak
-rw-r----- 1 ora11g dba    24526848 Oct 23 15:31 stdctl01.ctl.bak
-rw-r----- 1 ora11g dba 10485768192 Oct 23 15:34 temp.262.1002299765
-rw-r----- 1 ora11g dba  2097160192 Mar 16 15:56 users.264.1002299781
-rw-r----- 1 ora11g dba 21474844672 Mar 16 15:56 tbs_test_1025_2.270.1022603429
-rw-r----- 1 ora11g dba 21474844672 Mar 16 15:56 tbs_test_1025_1.269.1022603323
-rw-r----- 1 ora11g dba  5242888192 Mar 16 16:35 sysaux.260.1002299741
-rw-r----- 1 ora11g dba  5242888192 Mar 16 16:38 undotbs1.261.1002299753
-rw-r----- 1 ora11g dba  2147491840 Mar 16 16:38 albert_leung.271.1035218283
-rw-r----- 1 ora11g dba  5242888192 Mar 16 16:38 undotbs2.263.1002299769
-rw-r----- 1 ora11g dba  7340040192 Mar 16 16:38 system.259.1002299725
-rw-r----- 1 ora11g dba    24526848 Mar 16 16:38 stdctl02.ctl
-rw-r----- 1 ora11g dba    24526848 Mar 16 16:38 stdctl01.ctl


--主库删除文件:
SQL> drop tablespace albert_leung;

Tablespace dropped.

--备库数据字典:

SQL> select name from v$tablespace; 

NAME
------------------------------
SYSTEM
SYSAUX
UNDOTBS1
TEMP
UNDOTBS2
USERS
TBS_TEST_1025_1
TBS_TEST_1025_2

8 rows selected

SQL> select FILE#,NAME from v$datafile;

     FILE# NAME
---------- -----------------------------------------------------------------
         2 /oracle/oradata_Travelskydba/Travelskydba/sysaux.260.1002299741
         3 /oracle/oradata_Travelskydba/Travelskydba/undotbs1.261.1002299753
         4 /oracle/oradata_Travelskydba/Travelskydba/undotbs2.263.1002299769
         5 /oracle/oradata_Travelskydba/Travelskydba/users.264.1002299781
         6 /oracle/oradata_Travelskydba/Travelskydba/tbs_test_1025_1.269.1022603323
         7 /oracle/oradata_Travelskydba/Travelskydba/tbs_test_1025_2.270.1022603429

--可以看到,备库中的数据字典中已经删除掉,也就是说controlfile中该记录已经被删除

--但是观察到,备库文件系统,物理的数据文件还在,并没有删除。
[ora11g@vm-vmw45661-ora Travelskydba]$ ls -l
total 68780320
-rw-r----- 1 ora11g dba  2147491840 Mar 16 16:38 albert_leung.271.1035218283   --主库CHECKPOINT后不更新
drwxr-xr-x 2 ora11g dba        4096 Oct 14 11:51 dataguardconfig
-rw-r----- 1 ora11g dba    24526848 Mar 16 16:43 stdctl01.ctl
-rw-r----- 1 ora11g dba    24526848 Oct 23 15:31 stdctl01.ctl.bak
-rw-r----- 1 ora11g dba    24526848 Mar 16 16:43 stdctl02.ctl
-rw-r----- 1 ora11g dba    24526848 Oct 23 15:31 stdctl02.ctl.bak
-rw-r----- 1 ora11g dba  5242888192 Mar 16 16:41 sysaux.260.1002299741
-rw-r----- 1 ora11g dba  7340040192 Mar 16 16:41 system.259.1002299725
-rw-r----- 1 ora11g dba 21474844672 Mar 16 15:56 tbs_test_1025_1.269.1022603323
-rw-r----- 1 ora11g dba 21474844672 Mar 16 15:56 tbs_test_1025_2.270.1022603429
-rw-r----- 1 ora11g dba 10485768192 Oct 23 15:34 temp.262.1002299765
-rw-r----- 1 ora11g dba  5242888192 Mar 16 16:41 undotbs1.261.1002299753
-rw-r----- 1 ora11g dba  5242888192 Mar 16 16:41 undotbs2.263.1002299769
-rw-r----- 1 ora11g dba  2097160192 Mar 16 15:56 users.264.1002299781
[ora11g@vm-vmw45661-ora Travelskydba]$ 

--此时,可以使用操作系统命令rm删除掉文件系统中命名错误的数据文件,因为controlfile中已经没有此记录,所以不会涉及数据库的实例恢复。

这个实验,没有加上including contents and datafiles 我只是回答了他第一个问题,但是第二个问题:“为什么之前我删除tablespace的时候,也没有加,怎么就没事呢?

根因在于Oracle Managed Files,这次出问题的此套备库没有使用OMF特性,也就是drop tablespace后,Oracle不会帮你自动的去清理物理文件,只会删除controlfile中的记录,但是之前同事所提到的没事的状况,是因为备库为RAC+ASM架构,默认采用了OMF管理数据文件,所以及时没有添加 including contents and datafiles 也会帮你自动清理文件。

官方原文如下:

1、 Unlike files that are not managed by the database, when an Oracle managed data file or temp file is dropped, the filename is removed from the control file and the file is automatically deleted from the file 

2、 When recovery of a standby database encounters redo for the deletion of a tablespace, it deletes any Oracle managed data files in the local file system. Note that this is independent of the INCLUDING DATAFILES option issued at the primary database.

数据库性能问题导致应用单次SQL效率10ms变为10s

昨日,同事发来一则微信反馈前一天晚应用SQL查询缓慢,原本10ms内能完成的查询SQL在一时间段内:19号22:10到22:45突然执行效率下降到10分钟,然而在22:45后自动恢复正常,想分析一下根因。数据库为一套2节点RAC,Linux -x86 数据库版本为11.2.0.4。

有了明确的时间段,登录到数据库内看看到底这个期间产生了什么性能问题?

数据库DB_TIME、活动会话数与应用反馈的时间基本吻合,从22点10分开始,数据库中的Active_Session数量开始飙升,并且数据库活动会话基本都被Library Cache Lock与Cursor Pin S wait on X等待事件所阻塞。

故障时间段内Top Event百分比

从ASH中看,活动会话最早被节点1的4327会话所阻塞,并且等待的是LIBRARY CACHE LOCK,随后,开始出现sql尝试软解析,遭遇Cursor pin s wait on x,且被节点1的2060会话所阻塞

我们看一下4237与2060在故障期间在干什么?

4327这个会话为oracle auto_task的sql auto tunning advisor job,并且他持有了Library Cache Lock Handle “0x1f9bc9bb98″导致了会话3877想以Mode=3(X)独占模式持有Library Cache Lock不成功,被阻塞。
2060为正常的应用会话,并且同样等待LIBRARY CACHE LOCK,没有阻塞者,这是一个奇怪的现象

但是2060会话在systemstate dump的trace文件中找到阻塞者,同样被4327会话阻塞

从相同时间段的systemstate dump trace文件中看到,2060会话在等待library cache lock,申请的同样是0x1f9bc9bb98的Library Cache Lock Handle (Mode=2(s) 共享模式),虽然没有Dump出2060的阻塞者,但是从申请的Handle上可以推断,都是被4327会话sql auto tunning advisor job阻塞。

分析结论: 此时间段确实为oracle autotask作业的时间窗口,但是job会话并不会长时间持有Library Cache Lock,这个会话当时在等待db file scattered read,数据库当时物理读非常高并且io链路上没有明显异常,异常点1:此JOB作业长时间持有LIBRARY CACHE LOCK不释放且JOB涉及的SQL造成物理读很高,异常点2:内存lock与Mutex争用又会导致cpu Idle低,从而进一步导致io无法调度完成,造成会话JOB工作效率下降,从而没有及时的释放LIBRARY CACHE LOCK的可能性会加大,这些都是非常不正常的行为,所以我决定disable sql tunning advisor功能,规避此问题再次发生。

ORA-00600: internal error code, arguments: [kcbzwfcro_2] 故障案例一则

上周末,一套RAC生产库实例1 alert日志中出现大量ORA-00600报错,触发了监控报警,当时值班人员重启数据库实例后恢复。实例重启应急操作过程中正常一致性关闭未果,而是kill lgwr进程得以让实例终止。

Sat Feb 08 02:34:03 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc  (incident=193747):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/incident/incdir_193747/Travelskydba1_p000_166150_i193747.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 02:34:05 2020
 Dumping diagnostic data in directory=[cdmp_20200208023405], requested by (instance=1, osid=166150 (P000)), summary=[incident=193747].
 Sat Feb 08 02:34:05 2020
 Sweep [inc][193747]: completed
 Sweep [inc2][193747]: completed
 Sat Feb 08 02:36:14 2020
 ALTER SYSTEM ARCHIVE LOG
 Sat Feb 08 02:36:14 2020
 Thread 1 advanced to log sequence 29734 (LGWR switch)
   Current log# 2 seq# 29734 mem# 0: +DG_REDO/Travelskydba/onlinelog/group_2.258.975692641
   Current log# 2 seq# 29734 mem# 1: +DG_REDO/Travelskydba/onlinelog/group_2.259.975692641
 Sat Feb 08 02:36:14 2020
 LNS: Standby redo logfile selected for thread 1 sequence 29734 for destination LOG_ARCHIVE_DEST_2
 Archived Log entry 179716 added for thread 1 sequence 29733 ID 0xd45ba67e dest 1:
 ALTER SYSTEM ARCHIVE LOG
 Sat Feb 08 02:36:25 2020
 Thread 1 advanced to log sequence 29735 (LGWR switch)
   Current log# 3 seq# 29735 mem# 0: +DG_REDO/Travelskydba/onlinelog/group_3.260.975692643
   Current log# 3 seq# 29735 mem# 1: +DG_REDO/Travelskydba/onlinelog/group_3.261.975692645
 Sat Feb 08 02:36:25 2020
 Archived Log entry 179724 added for thread 1 sequence 29734 ID 0xd45ba67e dest 1:
 Sat Feb 08 02:36:25 2020
 LNS: Standby redo logfile selected for thread 1 sequence 29735 for destination LOG_ARCHIVE_DEST_2
 Sat Feb 08 02:39:25 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p046_192910.trc  (incident=195547):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/incident/incdir_195547/Travelskydba1_p046_192910_i195547.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Sat Feb 08 02:39:26 2020
 Dumping diagnostic data in directory=[cdmp_20200208023926], requested by (instance=1, osid=192910 (P046)), summary=[incident=195547].
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p046_192910.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p046_192910.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 02:39:26 2020
 SMON: Restarting fast_start parallel rollback
 Sat Feb 08 02:39:27 2020
 Sweep [inc][195547]: completed
 Sweep [inc2][195547]: completed
 Sat Feb 08 02:44:39 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p002_190908.trc  (incident=193795):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/incident/incdir_193795/Travelskydba1_p002_190908_i193795.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p002_190908.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p002_190908.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 02:44:41 2020
 Dumping diagnostic data in directory=[cdmp_20200208024441], requested by (instance=1, osid=190908 (P002)), summary=[incident=193795].
 Sat Feb 08 02:44:43 2020
 Sweep [inc][193795]: completed
 Sweep [inc2][193795]: completed
 Sat Feb 08 02:50:02 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_53435.trc  (incident=193859):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/incident/incdir_193859/Travelskydba1_p005_53435_i193859.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_53435.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_53435.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 02:50:03 2020
 SMON: Restarting fast_start parallel rollback
 Sat Feb 08 02:50:03 2020
 Dumping diagnostic data in directory=[cdmp_20200208025003], requested by (instance=1, osid=53435 (P005)), summary=[incident=193859].
 Sat Feb 08 02:50:04 2020
 Sweep [inc][193859]: completed
 Sweep [inc2][193859]: completed
 Sat Feb 08 02:55:15 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc  (incident=193748):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/incident/incdir_193748/Travelskydba1_p000_166150_i193748.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Sat Feb 08 02:55:16 2020
 Dumping diagnostic data in directory=[cdmp_20200208025516], requested by (instance=1, osid=166150 (P000)), summary=[incident=193748].
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 02:55:19 2020
 Sweep [inc][193748]: completed
 Sweep [inc2][193748]: completed
 Sat Feb 08 03:00:37 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_104212.trc  (incident=193796):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_104212.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_104212.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:00:37 2020
 SMON: Restarting fast_start parallel rollback
 Sat Feb 08 03:05:49 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc  (incident=193749):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:11:10 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_155132.trc  (incident=193787):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_155132.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_155132.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:11:10 2020
 SMON: Restarting fast_start parallel rollback
 Sat Feb 08 03:13:29 2020
 Stopping background process CJQ0
 Sat Feb 08 03:16:22 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc  (incident=193750):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:21:43 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_17817.trc  (incident=193531):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_17817.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_17817.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:21:43 2020
 SMON: Restarting fast_start parallel rollback
 Sat Feb 08 03:26:55 2020
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:32:15 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_65986.trc  (incident=193491):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_65986.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_65986.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:32:15 2020
 SMON: Restarting fast_start parallel rollback
 Sat Feb 08 03:36:11 2020
 ALTER SYSTEM ARCHIVE LOG
 Sat Feb 08 03:36:12 2020
 Thread 1 advanced to log sequence 29736 (LGWR switch)
   Current log# 4 seq# 29736 mem# 0: +DG_REDO/Travelskydba/onlinelog/group_4.262.975692647
   Current log# 4 seq# 29736 mem# 1: +DG_REDO/Travelskydba/onlinelog/group_4.263.975692649
 Sat Feb 08 03:36:12 2020
 LNS: Standby redo logfile selected for thread 1 sequence 29736 for destination LOG_ARCHIVE_DEST_2
 Sat Feb 08 03:36:13 2020
 Archived Log entry 179730 added for thread 1 sequence 29735 ID 0xd45ba67e dest 1:
 Sat Feb 08 03:36:24 2020
 ALTER SYSTEM ARCHIVE LOG
 Sat Feb 08 03:36:27 2020
 Thread 1 advanced to log sequence 29737 (LGWR switch)
   Current log# 5 seq# 29737 mem# 0: +DG_REDO/Travelskydba/onlinelog/group_5.264.975692651
   Current log# 5 seq# 29737 mem# 1: +DG_REDO/Travelskydba/onlinelog/group_5.265.975692651
 Sat Feb 08 03:36:27 2020
 LNS: Standby redo logfile selected for thread 1 sequence 29737 for destination LOG_ARCHIVE_DEST_2
 Archived Log entry 179738 added for thread 1 sequence 29736 ID 0xd45ba67e dest 1:
 Sat Feb 08 03:37:28 2020
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:39:16 2020
 Shutting down instance (immediate)
 Stopping background process SMCO
 Shutting down instance: further logons disabled
 Stopping background process QMNC
 Stopping background process MMNL
 Stopping background process MMON
 License high water mark = 375
 Sat Feb 08 03:42:48 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_114400.trc  (incident=193539):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/incident/incdir_193539/Travelskydba1_p005_114400_i193539.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_114400.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_114400.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:42:49 2020
 SMON: Restarting fast_start parallel rollback


从alert日志中我们发现,基本上ORA-00600与ORA-10388成对出现,并且SMON在回滚并行进程,ORA-00600的trace文件也是并行进程-Travelskydba1_p00产生。

查看报错trace文件:

*** 2020-02-08 02:55:15.334
*** SESSION ID:(3707.19965) 2020-02-08 02:55:15.334
*** SERVICE NAME:(SYS$BACKGROUND) 2020-02-08 02:55:15.334
 
Incident 193748 created, dump file: /opt/app/ora11g/diag/rdbms/ora624a/ora624a1/incident/incdir_193748/ora624a1_p000_166150_i193748.trc
ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []

fatal error occurred while scanning redo
ORA-10388: parallel query server interrupt (failure)
ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
Error 10388 occurred while dumping log
ORA-10388: parallel query server interrupt (failure)
ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
Parallel Transaction recovery coordinatorcaught exception 10388

*** 2020-02-08 02:55:25.706
*** SESSION ID:(3707.19967) 2020-02-08 02:55:25.707
*** SERVICE NAME:(SYS$BACKGROUND) 2020-02-08 02:55:25.707
 
Parallel Transaction recovery coordinatorcaught exception 10388
DDE: Problem Key 'ORA 600 [kcbzwfcro_2]' was flood controlled (0x2) (incident: 193749)

*** 2020-02-08 03:05:49.294
*** SESSION ID:(3709.435) 2020-02-08 03:05:49.294
*** SERVICE NAME:(SYS$BACKGROUND) 2020-02-08 03:05:49.294
 
ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
Potentially stale force-CR-override buffer found before OBJD MISMATCH check.
This issue should be investigated by both cache fusion and space layer.
BH (0x132f60d8d8) file#: 15 rdba: 0x03c01ae3 (15/6883) class: 1 ba: 0x132e6be000
  set: 75 pool: 3 bsz: 8192 bsi: 0 sflg: 1 pwc: 0,3
  dbwrid: 2 obj: 136046 objn: 136046 tsn: 10 afn: 15 hint: f
  hash: [0x164ee2ecd0,0xe6f484db0] lru: [0xf1f515308,0xf1f51b968]
  ckptq: [NULL] fileq: [NULL] objq: [0x106f19fe10,0x12cec34f10] objaq: [0xf2f496400,0xf2f473b80]
  st: XCURRENT md: NULL fpin: 'kdgwh05: kdglfe' tch: 0 le: 0x3afe991a8
  flags: block_written_once redo_since_read remote_transfered
          force_cr_override
  LRBA: [0x0.0.0] LSCN: [0x0.0] HSCN: [0xffff.ffffffff] HSUB: [1]
GLOBAL CACHE ELEMENT DUMP (address: 0x3afe991a8):
  id1: 0x1ae3 id2: 0xf pkey: OBJ#136046 block: (15/6883)
  lock: X rls: 0x0 acq: 0x0 latch: 23
  flags: 0x20 fair: 0 recovery: 0 fpin: 'kdgwh05: kdglfe'
  bscn: 0x0.1c9bcdb5 bctx: (nil) write: 0 scan: 0x0 
  lcp: (nil) lnk: [NULL] lch: [0x132f60da10,0x132f60da10]
  seq: 72 hist: 113 238 180 113 238 180 113 238 180 113 238 180 113 238 180 113
    238 180 113 238
  LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
    flg: 0x0a200000 sflg: 0x2000 state: XCURRENT tsn: 10 tsh: 0
      addr: 0x132f60d8d8 obj: 136046 cls: DATA bscn: 0x0.1c9bcdb5
  buffer tsn: 10 rdba: 0x03c01ae3 (15/6883)
  scn: 0x0000.1c9bcdb5 seq: 0x01 flg: 0x04 tail: 0xcdb50601
  frmt: 0x02 chkval: 0x84e4 type: 0x06=trans data
堆栈信息:
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
skdstdst()+41        call     kgdsdst()            000000000 ? 000000000 ?
                                                   7FFDF8B56F90 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
ksedst1()+103        call     skdstdst()           000000000 ? 000000000 ?
                                                   7FFDF8B56F90 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
ksedst()+39          call     ksedst1()            000000000 ? 000000001 ?
                                                   7FFDF8B56F90 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
dbkedDefDump()+2746  call     ksedst()             000000000 ? 000000001 ?
                                                   7FFDF8B56F90 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
ksedmp()+41          call     dbkedDefDump()       000000003 ? 000000002 ?
                                                   7FFDF8B56F90 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
ksfdmp()+69          call     ksedmp()             000000003 ? 000000002 ?
                                                   7FFDF8B56F90 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
dbgexPhaseII()+1764  call     ksfdmp()             000000003 ? 000000002 ?
                                                   7FFDF8B56F90 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
dbgexExplicitEndInc  call     dbgexPhaseII()       7F96E5CAD730 ? 7F96E5CB4700 ?
()+755                                             7FFDF8B5FDA8 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
dbgeEndDDEInvocatio  call     dbgexExplicitEndInc  7F96E5CAD730 ? 7F96E5CB4700 ?
nImpl()+769                   ()                   7FFDF8B5FDA8 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
dbgeEndSpltInvokOnR  call     dbgeEndDDEInvocatio  7F96E5CAD730 ? 7F96E5CB4700 ?
ec()+270                      nImpl()              7FFDF8B5FDA8 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
dbgePostErrorKGE()+  call     dbgeEndSpltInvokOnR  7F96E5CAD730 ? 7F96E5CB4700 ?
257                           ec()                 7FFDF8B5FDA8 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
dbkePostKGE_kgsf()+  call     dbgePostErrorKGE()   00C113DA0 ? 7F96E5CBCCC0 ?
71                                                 000002894 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
kgeselv()+276        call     dbkePostKGE_kgsf()   00C113DA0 ? 7F96E5CBCCC0 ?
                                                   000002894 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
ksesecl0()+162       call     kgeselv()            00C113DA0 ? 7F96E5CBCCC0 ?
                                                   000002894 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000000 ?
ksucin()+147         call     ksesecl0()           00C113DA0 ? 7F96E5CBCCC0 ?
                                                   000002894 ? 000000000 ?
                                                   0000000B0 ? 000000000 ?
krr_thread_read()+8  call     ksucin()             00C113DA0 ? 7F96E5CBCCC0 ?
067                                                000002894 ? 000000000 ?
                                                   0000000B0 ? 000000000 ?
krr_read_buffer()+2  call     krr_thread_read()    7F96E42A0888 ? 000000002 ?
8                                                  000000000 ? 000000000 ?
                                                   7FFDF8B615A0 ? 7FFDF8B615A4 ?
krr_parse_redo()+12  call     krr_read_buffer()    7F96E42A0888 ? 000000002 ?
512                                                000000000 ? 000000000 ?
                                                   7FFDF8B615A0 ? 7FFDF8B615A4 ?
kcra_scan_redo()+11  call     krr_parse_redo()     7F96E42A0888 ? 000000002 ?
867                                                000000000 ? 000000000 ?
                                                   000000000 ? 7FFDF8B615A4 ?
kcra_dump_redo()+21  call     kcra_scan_redo()     7FFDF8B64560 ? 000000001 ?
04                                                 7FFDF8B640CC ? 000000000 ?
                                                   000000000 ? 7FFDF8B615A4 ?
kcra_dump_redo_inte  call     kcra_dump_redo()     7FFDF8B64560 ? 000000001 ?
rnal()+3362                                        7FFDF8B640CC ? 000000000 ?
                                                   000000000 ? 7FFDF8B615A4 ?
kcbzwfcro()+997      call     kcra_dump_redo_inte  7FFDF8B64C40 ? 7FFDF8B64C44 ?
                              rnal()               7FFDF8B64C48 ? 000000001 ?
                                                   1300000000 ? 1F00000000 ?
kcbgcur()+10141      call     kcbzwfcro()          132F60D8D8 ? 7FFDF8B685B0 ?
                                                   1F3953EC68 ? 000000001 ?
                                                   1300000000 ? 1F00000000 ?
ktbgcur()+123        call     kcbgcur()            7FFDF8B685B0 ? 000000002 ?
                                                   0000003C5 ? 000000000 ?
                                                   1300000000 ? 1F00000000 ?
kturGetCurPinForUnd  call     ktbgcur()            7FFDF8B685A0 ? 000000002 ?
o()+264                                            0000003C5 ? 000000000 ?
                                                   1300000000 ? 1F00000000 ?
kdiulk()+16575       call     kturGetCurPinForUnd  7FFDF8B685A0 ? 000000002 ?
                              o()                  0000003C5 ? 000000000 ?
                                                   0063CDCD8 ? 1F00000000 ?
kcoubk()+366         call     kdiulk()             7F96E5CC0E10 ? 18A5AF8518 ?
                                                   000000004 ? 000000080 ?
                                                   000000000 ? 1F00000000 ?
ktprundo_intra()+10  call     kcoubk()             7FFDF8B6874C ? 7F96E5CC0E10 ?
73                                                 18A5AF8518 ? 000000004 ?
                                                   000000080 ? 000000000 ?
ktprits()+1218       call     ktprundo_intra()     18A5620EA0 ? 18AD9DA7D0 ?
                                                   18A5AF8518 ? 000000004 ?
                                                   000000080 ? 000000000 ?
ktprmai()+213        call     ktprits()            7FFDF8B68A40 ? 18AD9DA7D0 ?
                                                   18A5AF8518 ? 000000004 ?
                                                   000000080 ? 000000000 ?
kxfprdp()+1466       call     ktprmai()            7FFDF8B68A40 ? 18AD9DA7D0 ?
                                                   18A5AF8518 ? 000000004 ?
                                                   000000080 ? 000000000 ?
opirip()+913         call     kxfprdp()            7FFDF8B68A40 ? 18AD9DA7D0 ?
                                                   1F4D61F8A0 ? 000000004 ?
                                                   000000080 ? 000000000 ?
opidrv()+603         call     opirip()             000000032 ? 000000004 ?
                                                   7FFDF8B69FE8 ? 000000004 ?
                                                   000000080 ? 000000000 ?
sou2o()+103          call     opidrv()             000000032 ? 000000004 ?
                                                   7FFDF8B69FE8 ? 000000004 ?
                                                   000000080 ? 000000000 ?
opimai_real()+250    call     sou2o()              7FFDF8B69FC0 ? 000000032 ?
                                                   000000004 ? 7FFDF8B69FE8 ?
                                                   000000080 ? 000000000 ?
ssthrdmain()+265     call     opimai_real()        000000000 ? 7FFDF8B6A1B0 ?
                                                   000000004 ? 7FFDF8B69FE8 ?
                                                   000000080 ? 000000000 ?
main()+201           call     ssthrdmain()         000000003 ? 7FFDF8B6A1B0 ?
                                                   000000001 ? 000000000 ?
                                                   000000080 ? 000000000 ?
__libc_start_main()  call     main()               000000003 ? 7FFDF8B6A350 ?
+253                                               000000001 ? 000000000 ?
                                                   000000080 ? 000000000 ?
_start()+41          call     __libc_start_main()  000A2A344 ? 000000001 ?
                                                   7FFDF8B6A348 ? 000000000 ?
                                                   000000080 ? 000000000 ?

根据trace文件,似乎与RAC间CR Block传输覆盖版本冲突有关:Potentially stale force-CR-override buffer found before OBJD MISMATCH check.

带着疑惑搜寻Mos(Oracle Support)后,确认此问题为Bug导致(ORA-600:[kcbzwfcro_2] Reported in Alert.log (Doc ID 2085507.1)

故障现象,Call Stack信息基本一致,并且Oracle给予了Workround

_gc_override_force_cr参数简单解释:if TRUE, try to override force-cr requests,在Undo header中会标识这个CR块是否为force_cr_override的flag,但是这个代表何意没有找到具体的官方解释,猜测是在节点间传输时本地CR与远程CR校验时出现错误,本地与远程均认为自己是最新版本CR。本次故障找到原因,但是此参数的含义还有继续找寻资料确认。

一则Bug导致数据库All Database Sessions Hang

春节前夕,一套核心交易系统出现短时间夯死的状况出现(后台数据库版本信息:Oracle 11.2.0.4 RAC 2节点,Linux X86服务器),活动会话数又飙升至近300+

20200127-1如上图所示,数据库从1月19日 09:34分左右开始,活动会话逐渐攀升,数据库在当时存在大量等待事件log file switch (checkpoint not complete),排查了硬件IO链路、应用事务量均正常,还是要在数据库软件层面进行分析。

    可以看到当时数据库活动会话概览,这些在等待log file switch (checkpoint incomplete)的会话均被节点2的1369会话所阻塞。

20200127-2

那么2节点的1369会话这时在干些什么?从ASH中看到,1369会话(LGWR)正在读取控制文件,等待事件为control file sequential read,但是这个动作是很快的动作,理论上不应该造成长时间的阻塞。

20200127-3

当时的DIAG进程做的HANGANALYZE中我们可以看到如下的Chain关系,开始试图找寻是否是Bug引起,因为种种客观现象实在是有悖常理,无法从原理上解释

20200127-4

从Oracle Mos上找寻相关Bug描述,果然有一个刚刚发布不久的Bug:

Bug 29635990 – All Database Sessions Hang Due To Checkpoint Incomplete (Doc ID
29635990.8)
20200127-5

但是这个Bug并未标注影响版本范围,但是和本次故障现象基本一致,遗憾的是并没有相关的Workrond并且此Bug只能在12.2版本中有相关Patch进行修复。

为了规避此问题的再次爆发,我们在数据库中设置了一个Job进行规避,调度时间按照redo 所有group#均变为ACTIVE且准备进行覆盖的时间周期,本系统为5分钟。

declare
jobno number;
begin dbms_job.submit(
jobno,
'begin
 execute immediate ''alter system checkpoint'';
end;',
sysdate,
'sysdate + 5/1440');
commit;
end;
/

每5分钟进行强制全量检查点,避免数据库再次产生log file switch (checkpoint not complete)的现象,从而导致产生更大数据库夯死风险。

删除空子表约束导致主表事务row cache lock/Library cache lock/Cursor pin s wait on x

春节前夕,公司一套核心核心再核心的数据库出现严重性能问题(版本信息:11.2.0.4),数据库活动会话飙升到300+,幸好当时处置得当,人员判断与应急迅速,及时kill了相关blocker session从而使得影响没有扩大。20200114-1从应用负责人员获悉,此时间段应用人员正在执行上线操作,具体操作为删除一张num rows=0的空表的外键约束。

alter table TRAVELSKYDBA drop constraint FK_TRAVELSKYDBA;

ASH中活动会话基本都在等待Library Cache Lock,Cursor Pin s Wait On X,Row Cache Lock上,阻塞情况如下:

20200123-11

继续分析,发现了一个更加奇怪的现象:

20200123-13

SESSION_ID=1827的会话为应用上线alter table删除外键约束的语句会话,SESSION_ID=73,2119,4000均为对删除外键的主表的正常业务DML语句(INSERT,DELETE,UPDATE),从ASH中看到1827会话正在等待正常业务事务(SESSION_ID=73)的LIBRARY CACHE LOCK,但是SESSION_ID=73在等 ROW CACHE LOCK,被ALTER TABLE (SESSION_ID=1827)阻塞,行成了死锁,并且ROW CACHE LOCK 的P1=8(dc_object与grants,dc_object),随后的正常事务的DML SQL想去软解析时遭遇Cursor Pin S wait on X。

从数据库DIAG进程自动dump的system state dump与Hanganalyze trace中确认,这些应用的SQL因为Rolling Invalidate Window Exceeded触发了硬解析,alter table 子表 drop constraints时,这项操作是需要申请主表DML游标Object Handle的Library Cache Lock的,主表涉及SQL在硬解析,ALTER TABLE求LIBRARY CACHE LOCK而不得被阻塞,但是主表DML硬解析申请ROW CACHE LOCK,理论上非常快速内存操作为何长时间不释放?产生死锁?并且Oracle并没有处理这个死锁

-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
 Oracle session identified by:
 {
 instance: 1 (#######)
 os id: 111552
 process id: 161, oracle@###-rac
 session id: 73
 session serial #: 181
 }
 is waiting for 'row cache lock' with wait info:
 {
 p1: 'cache id'=0x8
 p2: 'mode'=0x0
 p3: 'request'=0x3
 time in wait: 0.612302 sec
 heur. time in wait: 2 min 23 sec
 timeout after: 2.387698 sec
 wait id: 24235
 blocking: 286 sessions
 wait history:
 * time between current wait and wait #1: 0.000034 sec
 1. event: 'row cache lock'
 time waited: 3.000963 sec
 wait id: 24234 p1: 'cache id'=0x8
 p2: 'mode'=0x0
 p3: 'request'=0x3
 * time between wait #1 and #2: 0.000039 sec
 2. event: 'row cache lock'
 time waited: 3.000978 sec
 wait id: 24233 p1: 'cache id'=0x8
 p2: 'mode'=0x0
 p3: 'request'=0x3
 * time between wait #2 and #3: 0.000064 sec
 3. event: 'row cache lock'
 time waited: 3.000913 sec
 wait id: 24232 p1: 'cache id'=0x8
 p2: 'mode'=0x0
 p3: 'request'=0x3
 }
 and is blocked by
 => Oracle session identified by:
 {
 instance: 1 (*******)
 os id: 92902
 process id: 112, oracle@*****-rac
 session id: 1827
 session serial #: 19367
 }
 which is waiting for 'library cache lock' with wait info:
 {
 p1: 'handle address'=0xfeec0a948
 p2: 'lock address'=0xfe24600b8
 p3: '100*mode+namespace'=0x15dba00010003
 time in wait: 2 min 23 sec
 timeout after: 12 min 36 sec
 wait id: 1540
 blocking: 286 sessions
 wait history:
 * time between current wait and wait #1: 0.000150 sec
 1. event: 'gc current block 2-way'
 time waited: 0.000285 sec
 wait id: 1539 p1: ''=0x1
 p2: ''=0x131f3
 p3: ''=0x1
 * time between wait #1 and #2: 0.000131 sec
 2. event: 'gc current block 2-way'
 time waited: 0.000374 sec
 wait id: 1538 p1: ''=0x1
 p2: ''=0x131de
 p3: ''=0x1
 * time between wait #2 and #3: 0.002456 sec
 3. event: 'gc cr block 2-way'
 time waited: 0.000405 sec
 wait id: 1537 p1: ''=0x1
 p2: ''=0x131f3
 p3: ''=0x1
 }
 and is blocked by the session at the start of the chain.

Chain 1 Signature: 'library cache lock'<='row cache lock' (cycle)
Chain 1 Signature Hash: 0x52a8007d
----------------------------------------------------------------
并且 p3: '100*mode+namespace'=0x15dba00010003 --003 LOCK MODE=X

SQL> select to_number('15dba','xxxxxxx') from dual;
TO_NUMBER('15DBA','XXXXXXX')
---------------------------
 89530  --主表对象OBJECT_ID

20200123-16

SESSION_ID=73在sql net message from client中居然阻塞了1827的 alter table?

分析到这里,越来越感觉是Bug导致,从Mos上寻找相关案例发现一篇“现象相似”的文章,可惜版本无法对应上,不能确定是因为此Bug导致,虽然最后没有定位具体的Bug Number,但是此问题要告诫警示大家,不要再事务繁忙期间进行alter table DDL操作,虽然是一张空表,但是和其对象的相关OBJECT也存在受到牵连的可能性,从而产生跟大的影响。

20200123-17