DataGuard 切换流程梳理(三)- 命令行 Switchover

虽然我们生产目前DataGuard的切换主要依赖于 DataGuard Broker工具,但是日常工作中经常会发现Broker出现配置问题导致无法正常切换,准备好DG切换的标准SQL*PLUS的命令行方式才是万全之策。

1.切换前确认主备库是否具备切换条件,关注是否Error与delay_time
select thread#,
       DB_UNIQUE_NAME,
       DEST_NAME,
       TARGET,
       DATABASE_MODE,
       error,
       RECOVERY_MODE,
       GAP_STATUS,
       APPLIED_SEQ#,
       current_scn,
       applied_scn,
       to_char((scn_to_timestamp(current_scn)),'yyyymmdd hh24:mi:ss') current_scn_time,
       to_char((scn_to_timestamp(applied_scn)),'yyyymmdd hh24:mi:ss') applied_scn_time,
       scn_to_timestamp(current_scn)- scn_to_timestamp(current_scn) as delay_time
  from (SELECT al.thread#,
               ads.dest_id,
               ads.DEST_NAME,
               ads.TYPE || ' ' || ad.TARGET TARGET,
               ADS.DATABASE_MODE,
               ads.STATUS,
               ads.error,
               ads.RECOVERY_MODE,
               ads.DB_UNIQUE_NAME,
               ads.DESTINATION,
               ads.GAP_STATUS,
               (SELECT sequence#
                  FROM v$log na
                 WHERE na.thread# = al.thread#
                   and status = 'CURRENT') Current_Seq#,
               Last_Archived,
               al.APPLIED_SEQ#,
               ad.applied_scn,
               current_scn
          FROM (SELECT dest_id,
                       thread#,
                       d.current_scn,
                       MAX(sequence#) Last_Archived,
                       max(CASE
                             WHEN APPLIED = 'YES' THEN
                              sequence#
                           end) APPLIED_SEQ#
                  FROM v$archived_log V, v$database d
                 WHERE V.resetlogs_change# = d.RESETLOGS_CHANGE#
                 group by dest_id, thread#, d.current_scn) al,
               v$archive_dest_status ads,
               v$archive_dest AD
         WHERE al.dest_id(+) = ads.dest_id 
           AND ads.STATUS != 'INACTIVE'
           AND AD.DEST_ID = ADS.DEST_ID and ad.APPLIED_SCN <> 0
         ORDER BY al.thread#, ads.dest_id);

SQL> select name,database_role,switchover_status from v$database;
       --确认为 to stanby或SESSIONS ACTIVE状态
2.SERVICE处理:
备库添加:
SQL> show parameter service;

NAME                  TYPE        VALUE
------------------- ----------- ------------------------------
service_names       string      ORA1436B,dbaas

SQL> alter system set service_names='ORA1436B,dbaas,Albert';

System altered.

SQL> show parameter service;

NAME                  TYPE        VALUE
----------------    ----------- ------------------------------
service_names         string      ORA1436B,dbaas,Albert


主库修改:
SQL> show parameter service; 

NAME                  TYPE        VALUE
--------------     ----------- ------------------------------
service_names         string      ORA1436B,dbaas,Albert

SQL> alter system set service_names='ORA1436B,dbaas';

System altered.

SQL> show parameter service;

NAME                   TYPE        VALUE
-------------        ----------- ------------------------------
service_names        string      ORA1436B,dbaas
3.RAC主库切换前保留1个节点
登陆到主库其他节点
SQL>SHUTDOWN IMMEDIATE
4.主库SWITCHOVER
SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY; 
或 
SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY with session shutdown; 

注:如果有主库有活动的 session 可以使用此选项,否则转换会遇到 ORA-01093 错误或直接KILL活动会话
5.主库打开至Mount状态:
sqlplus / as sysdba 
SQL> startup mount 
6.备库SWITCH OVER操作
检查备库状态,是否可以切换
SQL> select name,database_role,switchover_status from v$database; 
--确认是否为to primary

原备库取消恢复进程:
SQL> alter database recover managed standby database cancel;

原备库(新主库)switchover 操作
SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY; 

可能会抛出报ORA-16139: media recovery required,是是由于未应用日志引起,西药先执行执行 :
SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT 
FROM SESSION;
7.原备库(新主库)重启
SQL> shutdown immediate 
SQL> startup
如不重启,新备库MRP进程则会一直处于WAIT_FOR_LOG的状态
8.原主库,新备库启用日志应用: 
SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT; 
9.确认主备库数据库数据字典状态与数据库日志有无异常,报错
select name,db_unique_name,database_role,open_mode from v$database;

主库进行日志切换,查看备库的日志,看是否开始接收并应用
SQL>select process,status,client_process,thread#,sequence#,block#,active_agents,known_agents from gv$managed_standby where process in ('LNS','RFS','MRP0') and THREAD# <> 0

数据库性能问题导致应用单次SQL效率10ms变为10s

昨日,同事发来一则微信反馈前一天晚应用SQL查询缓慢,原本10ms内能完成的查询SQL在一时间段内:19号22:10到22:45突然执行效率下降到10分钟,然而在22:45后自动恢复正常,想分析一下根因。数据库为一套2节点RAC,Linux -x86 数据库版本为11.2.0.4。

有了明确的时间段,登录到数据库内看看到底这个期间产生了什么性能问题?

数据库DB_TIME、活动会话数与应用反馈的时间基本吻合,从22点10分开始,数据库中的Active_Session数量开始飙升,并且数据库活动会话基本都被Library Cache Lock与Cursor Pin S wait on X等待事件所阻塞。

故障时间段内Top Event百分比

从ASH中看,活动会话最早被节点1的4327会话所阻塞,并且等待的是LIBRARY CACHE LOCK,随后,开始出现sql尝试软解析,遭遇Cursor pin s wait on x,且被节点1的2060会话所阻塞

我们看一下4237与2060在故障期间在干什么?

4327这个会话为oracle auto_task的sql auto tunning advisor job,并且他持有了Library Cache Lock Handle “0x1f9bc9bb98″导致了会话3877想以Mode=3(X)独占模式持有Library Cache Lock不成功,被阻塞。
2060为正常的应用会话,并且同样等待LIBRARY CACHE LOCK,没有阻塞者,这是一个奇怪的现象

但是2060会话在systemstate dump的trace文件中找到阻塞者,同样被4327会话阻塞

从相同时间段的systemstate dump trace文件中看到,2060会话在等待library cache lock,申请的同样是0x1f9bc9bb98的Library Cache Lock Handle (Mode=2(s) 共享模式),虽然没有Dump出2060的阻塞者,但是从申请的Handle上可以推断,都是被4327会话sql auto tunning advisor job阻塞。

分析结论: 此时间段确实为oracle autotask作业的时间窗口,但是job会话并不会长时间持有Library Cache Lock,这个会话当时在等待db file scattered read,数据库当时物理读非常高并且io链路上没有明显异常,异常点1:此JOB作业长时间持有LIBRARY CACHE LOCK不释放且JOB涉及的SQL造成物理读很高,异常点2:内存lock与Mutex争用又会导致cpu Idle低,从而进一步导致io无法调度完成,造成会话JOB工作效率下降,从而没有及时的释放LIBRARY CACHE LOCK的可能性会加大,这些都是非常不正常的行为,所以我决定disable sql tunning advisor功能,规避此问题再次发生。

DataGuard 切换流程梳理(二)-Broker fiailover

DG Failover一般作为数据库主库故障的应急处理,不保证备库数据完整性,请谨慎选择。

Failover之后不存在主库、备库关系。如主库可启动到Mount阶段,可在主库执行alter system flush redo to ‘备库’;确保主库online redo log内容同步到备库节点,最大限度上保证Failover前后数据不丢失。

主库已发生故障,且不可恢复,启动failover应急手段处理:
[ora11g@]$ dgmgrl
DGMGRL for Linux: Version 11.2.0.4.0 - 64bit Production

Copyright (c) 2000, 2009, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys
Password:
Connected.
DGMGRL> show configuration

Configuration - DG_Config

  Protection Mode: MaxPerformance
  Databases:
    ora1436b    - Primary database
    ora1436bstd - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
ORA-01034: ORACLE not available
ORA-16625: cannot reach database "ora1436b"
DGM-17017: unable to determine configuration status

DGMGRL> show database ora1436bstd

Database - ora1436bstd

  Role:            PHYSICAL STANDBY
  Intended State:  APPLY-ON
  Transport Lag:   0 seconds (computed 105 seconds ago)
  Apply Lag:       0 seconds (computed 105 seconds ago)
  Apply Rate:      49.00 KByte/s
  Real Time Query: OFF
  Instance(s):
    ORA1436B

  Database Warning(s):
    ORA-16857: standby disconnected from redo source for longer than specified threshold

Database Status:
WARNING

--执行failover操作:
DGMGRL> failover to ora1436bstd
Performing failover NOW, please wait...
Failover succeeded, new primary is "ora1436bstd"

确认备库v$database视图 database_role是否为“primary ”并且观察alert日志是否出现异常报错信息。

ORA-00600: internal error code, arguments: [kcbzwfcro_2] 故障案例一则

上周末,一套RAC生产库实例1 alert日志中出现大量ORA-00600报错,触发了监控报警,当时值班人员重启数据库实例后恢复。实例重启应急操作过程中正常一致性关闭未果,而是kill lgwr进程得以让实例终止。

Sat Feb 08 02:34:03 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc  (incident=193747):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/incident/incdir_193747/Travelskydba1_p000_166150_i193747.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 02:34:05 2020
 Dumping diagnostic data in directory=[cdmp_20200208023405], requested by (instance=1, osid=166150 (P000)), summary=[incident=193747].
 Sat Feb 08 02:34:05 2020
 Sweep [inc][193747]: completed
 Sweep [inc2][193747]: completed
 Sat Feb 08 02:36:14 2020
 ALTER SYSTEM ARCHIVE LOG
 Sat Feb 08 02:36:14 2020
 Thread 1 advanced to log sequence 29734 (LGWR switch)
   Current log# 2 seq# 29734 mem# 0: +DG_REDO/Travelskydba/onlinelog/group_2.258.975692641
   Current log# 2 seq# 29734 mem# 1: +DG_REDO/Travelskydba/onlinelog/group_2.259.975692641
 Sat Feb 08 02:36:14 2020
 LNS: Standby redo logfile selected for thread 1 sequence 29734 for destination LOG_ARCHIVE_DEST_2
 Archived Log entry 179716 added for thread 1 sequence 29733 ID 0xd45ba67e dest 1:
 ALTER SYSTEM ARCHIVE LOG
 Sat Feb 08 02:36:25 2020
 Thread 1 advanced to log sequence 29735 (LGWR switch)
   Current log# 3 seq# 29735 mem# 0: +DG_REDO/Travelskydba/onlinelog/group_3.260.975692643
   Current log# 3 seq# 29735 mem# 1: +DG_REDO/Travelskydba/onlinelog/group_3.261.975692645
 Sat Feb 08 02:36:25 2020
 Archived Log entry 179724 added for thread 1 sequence 29734 ID 0xd45ba67e dest 1:
 Sat Feb 08 02:36:25 2020
 LNS: Standby redo logfile selected for thread 1 sequence 29735 for destination LOG_ARCHIVE_DEST_2
 Sat Feb 08 02:39:25 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p046_192910.trc  (incident=195547):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/incident/incdir_195547/Travelskydba1_p046_192910_i195547.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Sat Feb 08 02:39:26 2020
 Dumping diagnostic data in directory=[cdmp_20200208023926], requested by (instance=1, osid=192910 (P046)), summary=[incident=195547].
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p046_192910.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p046_192910.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 02:39:26 2020
 SMON: Restarting fast_start parallel rollback
 Sat Feb 08 02:39:27 2020
 Sweep [inc][195547]: completed
 Sweep [inc2][195547]: completed
 Sat Feb 08 02:44:39 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p002_190908.trc  (incident=193795):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/incident/incdir_193795/Travelskydba1_p002_190908_i193795.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p002_190908.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p002_190908.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 02:44:41 2020
 Dumping diagnostic data in directory=[cdmp_20200208024441], requested by (instance=1, osid=190908 (P002)), summary=[incident=193795].
 Sat Feb 08 02:44:43 2020
 Sweep [inc][193795]: completed
 Sweep [inc2][193795]: completed
 Sat Feb 08 02:50:02 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_53435.trc  (incident=193859):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/incident/incdir_193859/Travelskydba1_p005_53435_i193859.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_53435.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_53435.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 02:50:03 2020
 SMON: Restarting fast_start parallel rollback
 Sat Feb 08 02:50:03 2020
 Dumping diagnostic data in directory=[cdmp_20200208025003], requested by (instance=1, osid=53435 (P005)), summary=[incident=193859].
 Sat Feb 08 02:50:04 2020
 Sweep [inc][193859]: completed
 Sweep [inc2][193859]: completed
 Sat Feb 08 02:55:15 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc  (incident=193748):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/incident/incdir_193748/Travelskydba1_p000_166150_i193748.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Sat Feb 08 02:55:16 2020
 Dumping diagnostic data in directory=[cdmp_20200208025516], requested by (instance=1, osid=166150 (P000)), summary=[incident=193748].
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 02:55:19 2020
 Sweep [inc][193748]: completed
 Sweep [inc2][193748]: completed
 Sat Feb 08 03:00:37 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_104212.trc  (incident=193796):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_104212.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_104212.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:00:37 2020
 SMON: Restarting fast_start parallel rollback
 Sat Feb 08 03:05:49 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc  (incident=193749):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:11:10 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_155132.trc  (incident=193787):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_155132.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_155132.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:11:10 2020
 SMON: Restarting fast_start parallel rollback
 Sat Feb 08 03:13:29 2020
 Stopping background process CJQ0
 Sat Feb 08 03:16:22 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc  (incident=193750):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:21:43 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_17817.trc  (incident=193531):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_17817.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_17817.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:21:43 2020
 SMON: Restarting fast_start parallel rollback
 Sat Feb 08 03:26:55 2020
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:32:15 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_65986.trc  (incident=193491):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_65986.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_65986.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:32:15 2020
 SMON: Restarting fast_start parallel rollback
 Sat Feb 08 03:36:11 2020
 ALTER SYSTEM ARCHIVE LOG
 Sat Feb 08 03:36:12 2020
 Thread 1 advanced to log sequence 29736 (LGWR switch)
   Current log# 4 seq# 29736 mem# 0: +DG_REDO/Travelskydba/onlinelog/group_4.262.975692647
   Current log# 4 seq# 29736 mem# 1: +DG_REDO/Travelskydba/onlinelog/group_4.263.975692649
 Sat Feb 08 03:36:12 2020
 LNS: Standby redo logfile selected for thread 1 sequence 29736 for destination LOG_ARCHIVE_DEST_2
 Sat Feb 08 03:36:13 2020
 Archived Log entry 179730 added for thread 1 sequence 29735 ID 0xd45ba67e dest 1:
 Sat Feb 08 03:36:24 2020
 ALTER SYSTEM ARCHIVE LOG
 Sat Feb 08 03:36:27 2020
 Thread 1 advanced to log sequence 29737 (LGWR switch)
   Current log# 5 seq# 29737 mem# 0: +DG_REDO/Travelskydba/onlinelog/group_5.264.975692651
   Current log# 5 seq# 29737 mem# 1: +DG_REDO/Travelskydba/onlinelog/group_5.265.975692651
 Sat Feb 08 03:36:27 2020
 LNS: Standby redo logfile selected for thread 1 sequence 29737 for destination LOG_ARCHIVE_DEST_2
 Archived Log entry 179738 added for thread 1 sequence 29736 ID 0xd45ba67e dest 1:
 Sat Feb 08 03:37:28 2020
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p000_166150.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:39:16 2020
 Shutting down instance (immediate)
 Stopping background process SMCO
 Shutting down instance: further logons disabled
 Stopping background process QMNC
 Stopping background process MMNL
 Stopping background process MMON
 License high water mark = 375
 Sat Feb 08 03:42:48 2020
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_114400.trc  (incident=193539):
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/incident/incdir_193539/Travelskydba1_p005_114400_i193539.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_114400.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Errors in file /opt/app/ora11g/diag/rdbms/Travelskydba/Travelskydba1/trace/Travelskydba1_p005_114400.trc:
 ORA-10388: parallel query server interrupt (failure)
 ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
 Sat Feb 08 03:42:49 2020
 SMON: Restarting fast_start parallel rollback


从alert日志中我们发现,基本上ORA-00600与ORA-10388成对出现,并且SMON在回滚并行进程,ORA-00600的trace文件也是并行进程-Travelskydba1_p00产生。

查看报错trace文件:

*** 2020-02-08 02:55:15.334
*** SESSION ID:(3707.19965) 2020-02-08 02:55:15.334
*** SERVICE NAME:(SYS$BACKGROUND) 2020-02-08 02:55:15.334
 
Incident 193748 created, dump file: /opt/app/ora11g/diag/rdbms/ora624a/ora624a1/incident/incdir_193748/ora624a1_p000_166150_i193748.trc
ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []

fatal error occurred while scanning redo
ORA-10388: parallel query server interrupt (failure)
ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
Error 10388 occurred while dumping log
ORA-10388: parallel query server interrupt (failure)
ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
Parallel Transaction recovery coordinatorcaught exception 10388

*** 2020-02-08 02:55:25.706
*** SESSION ID:(3707.19967) 2020-02-08 02:55:25.707
*** SERVICE NAME:(SYS$BACKGROUND) 2020-02-08 02:55:25.707
 
Parallel Transaction recovery coordinatorcaught exception 10388
DDE: Problem Key 'ORA 600 [kcbzwfcro_2]' was flood controlled (0x2) (incident: 193749)

*** 2020-02-08 03:05:49.294
*** SESSION ID:(3709.435) 2020-02-08 03:05:49.294
*** SERVICE NAME:(SYS$BACKGROUND) 2020-02-08 03:05:49.294
 
ORA-00600: internal error code, arguments: [kcbzwfcro_2], [136046], [1], [32768], [0], [], [], [], [], [], [], []
Potentially stale force-CR-override buffer found before OBJD MISMATCH check.
This issue should be investigated by both cache fusion and space layer.
BH (0x132f60d8d8) file#: 15 rdba: 0x03c01ae3 (15/6883) class: 1 ba: 0x132e6be000
  set: 75 pool: 3 bsz: 8192 bsi: 0 sflg: 1 pwc: 0,3
  dbwrid: 2 obj: 136046 objn: 136046 tsn: 10 afn: 15 hint: f
  hash: [0x164ee2ecd0,0xe6f484db0] lru: [0xf1f515308,0xf1f51b968]
  ckptq: [NULL] fileq: [NULL] objq: [0x106f19fe10,0x12cec34f10] objaq: [0xf2f496400,0xf2f473b80]
  st: XCURRENT md: NULL fpin: 'kdgwh05: kdglfe' tch: 0 le: 0x3afe991a8
  flags: block_written_once redo_since_read remote_transfered
          force_cr_override
  LRBA: [0x0.0.0] LSCN: [0x0.0] HSCN: [0xffff.ffffffff] HSUB: [1]
GLOBAL CACHE ELEMENT DUMP (address: 0x3afe991a8):
  id1: 0x1ae3 id2: 0xf pkey: OBJ#136046 block: (15/6883)
  lock: X rls: 0x0 acq: 0x0 latch: 23
  flags: 0x20 fair: 0 recovery: 0 fpin: 'kdgwh05: kdglfe'
  bscn: 0x0.1c9bcdb5 bctx: (nil) write: 0 scan: 0x0 
  lcp: (nil) lnk: [NULL] lch: [0x132f60da10,0x132f60da10]
  seq: 72 hist: 113 238 180 113 238 180 113 238 180 113 238 180 113 238 180 113
    238 180 113 238
  LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
    flg: 0x0a200000 sflg: 0x2000 state: XCURRENT tsn: 10 tsh: 0
      addr: 0x132f60d8d8 obj: 136046 cls: DATA bscn: 0x0.1c9bcdb5
  buffer tsn: 10 rdba: 0x03c01ae3 (15/6883)
  scn: 0x0000.1c9bcdb5 seq: 0x01 flg: 0x04 tail: 0xcdb50601
  frmt: 0x02 chkval: 0x84e4 type: 0x06=trans data
堆栈信息:
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
skdstdst()+41        call     kgdsdst()            000000000 ? 000000000 ?
                                                   7FFDF8B56F90 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
ksedst1()+103        call     skdstdst()           000000000 ? 000000000 ?
                                                   7FFDF8B56F90 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
ksedst()+39          call     ksedst1()            000000000 ? 000000001 ?
                                                   7FFDF8B56F90 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
dbkedDefDump()+2746  call     ksedst()             000000000 ? 000000001 ?
                                                   7FFDF8B56F90 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
ksedmp()+41          call     dbkedDefDump()       000000003 ? 000000002 ?
                                                   7FFDF8B56F90 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
ksfdmp()+69          call     ksedmp()             000000003 ? 000000002 ?
                                                   7FFDF8B56F90 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
dbgexPhaseII()+1764  call     ksfdmp()             000000003 ? 000000002 ?
                                                   7FFDF8B56F90 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
dbgexExplicitEndInc  call     dbgexPhaseII()       7F96E5CAD730 ? 7F96E5CB4700 ?
()+755                                             7FFDF8B5FDA8 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
dbgeEndDDEInvocatio  call     dbgexExplicitEndInc  7F96E5CAD730 ? 7F96E5CB4700 ?
nImpl()+769                   ()                   7FFDF8B5FDA8 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
dbgeEndSpltInvokOnR  call     dbgeEndDDEInvocatio  7F96E5CAD730 ? 7F96E5CB4700 ?
ec()+270                      nImpl()              7FFDF8B5FDA8 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
dbgePostErrorKGE()+  call     dbgeEndSpltInvokOnR  7F96E5CAD730 ? 7F96E5CB4700 ?
257                           ec()                 7FFDF8B5FDA8 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
dbkePostKGE_kgsf()+  call     dbgePostErrorKGE()   00C113DA0 ? 7F96E5CBCCC0 ?
71                                                 000002894 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
kgeselv()+276        call     dbkePostKGE_kgsf()   00C113DA0 ? 7F96E5CBCCC0 ?
                                                   000002894 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000002 ?
ksesecl0()+162       call     kgeselv()            00C113DA0 ? 7F96E5CBCCC0 ?
                                                   000002894 ? 7FFDF8B57068 ?
                                                   7FFDF8B5BB10 ? 000000000 ?
ksucin()+147         call     ksesecl0()           00C113DA0 ? 7F96E5CBCCC0 ?
                                                   000002894 ? 000000000 ?
                                                   0000000B0 ? 000000000 ?
krr_thread_read()+8  call     ksucin()             00C113DA0 ? 7F96E5CBCCC0 ?
067                                                000002894 ? 000000000 ?
                                                   0000000B0 ? 000000000 ?
krr_read_buffer()+2  call     krr_thread_read()    7F96E42A0888 ? 000000002 ?
8                                                  000000000 ? 000000000 ?
                                                   7FFDF8B615A0 ? 7FFDF8B615A4 ?
krr_parse_redo()+12  call     krr_read_buffer()    7F96E42A0888 ? 000000002 ?
512                                                000000000 ? 000000000 ?
                                                   7FFDF8B615A0 ? 7FFDF8B615A4 ?
kcra_scan_redo()+11  call     krr_parse_redo()     7F96E42A0888 ? 000000002 ?
867                                                000000000 ? 000000000 ?
                                                   000000000 ? 7FFDF8B615A4 ?
kcra_dump_redo()+21  call     kcra_scan_redo()     7FFDF8B64560 ? 000000001 ?
04                                                 7FFDF8B640CC ? 000000000 ?
                                                   000000000 ? 7FFDF8B615A4 ?
kcra_dump_redo_inte  call     kcra_dump_redo()     7FFDF8B64560 ? 000000001 ?
rnal()+3362                                        7FFDF8B640CC ? 000000000 ?
                                                   000000000 ? 7FFDF8B615A4 ?
kcbzwfcro()+997      call     kcra_dump_redo_inte  7FFDF8B64C40 ? 7FFDF8B64C44 ?
                              rnal()               7FFDF8B64C48 ? 000000001 ?
                                                   1300000000 ? 1F00000000 ?
kcbgcur()+10141      call     kcbzwfcro()          132F60D8D8 ? 7FFDF8B685B0 ?
                                                   1F3953EC68 ? 000000001 ?
                                                   1300000000 ? 1F00000000 ?
ktbgcur()+123        call     kcbgcur()            7FFDF8B685B0 ? 000000002 ?
                                                   0000003C5 ? 000000000 ?
                                                   1300000000 ? 1F00000000 ?
kturGetCurPinForUnd  call     ktbgcur()            7FFDF8B685A0 ? 000000002 ?
o()+264                                            0000003C5 ? 000000000 ?
                                                   1300000000 ? 1F00000000 ?
kdiulk()+16575       call     kturGetCurPinForUnd  7FFDF8B685A0 ? 000000002 ?
                              o()                  0000003C5 ? 000000000 ?
                                                   0063CDCD8 ? 1F00000000 ?
kcoubk()+366         call     kdiulk()             7F96E5CC0E10 ? 18A5AF8518 ?
                                                   000000004 ? 000000080 ?
                                                   000000000 ? 1F00000000 ?
ktprundo_intra()+10  call     kcoubk()             7FFDF8B6874C ? 7F96E5CC0E10 ?
73                                                 18A5AF8518 ? 000000004 ?
                                                   000000080 ? 000000000 ?
ktprits()+1218       call     ktprundo_intra()     18A5620EA0 ? 18AD9DA7D0 ?
                                                   18A5AF8518 ? 000000004 ?
                                                   000000080 ? 000000000 ?
ktprmai()+213        call     ktprits()            7FFDF8B68A40 ? 18AD9DA7D0 ?
                                                   18A5AF8518 ? 000000004 ?
                                                   000000080 ? 000000000 ?
kxfprdp()+1466       call     ktprmai()            7FFDF8B68A40 ? 18AD9DA7D0 ?
                                                   18A5AF8518 ? 000000004 ?
                                                   000000080 ? 000000000 ?
opirip()+913         call     kxfprdp()            7FFDF8B68A40 ? 18AD9DA7D0 ?
                                                   1F4D61F8A0 ? 000000004 ?
                                                   000000080 ? 000000000 ?
opidrv()+603         call     opirip()             000000032 ? 000000004 ?
                                                   7FFDF8B69FE8 ? 000000004 ?
                                                   000000080 ? 000000000 ?
sou2o()+103          call     opidrv()             000000032 ? 000000004 ?
                                                   7FFDF8B69FE8 ? 000000004 ?
                                                   000000080 ? 000000000 ?
opimai_real()+250    call     sou2o()              7FFDF8B69FC0 ? 000000032 ?
                                                   000000004 ? 7FFDF8B69FE8 ?
                                                   000000080 ? 000000000 ?
ssthrdmain()+265     call     opimai_real()        000000000 ? 7FFDF8B6A1B0 ?
                                                   000000004 ? 7FFDF8B69FE8 ?
                                                   000000080 ? 000000000 ?
main()+201           call     ssthrdmain()         000000003 ? 7FFDF8B6A1B0 ?
                                                   000000001 ? 000000000 ?
                                                   000000080 ? 000000000 ?
__libc_start_main()  call     main()               000000003 ? 7FFDF8B6A350 ?
+253                                               000000001 ? 000000000 ?
                                                   000000080 ? 000000000 ?
_start()+41          call     __libc_start_main()  000A2A344 ? 000000001 ?
                                                   7FFDF8B6A348 ? 000000000 ?
                                                   000000080 ? 000000000 ?

根据trace文件,似乎与RAC间CR Block传输覆盖版本冲突有关:Potentially stale force-CR-override buffer found before OBJD MISMATCH check.

带着疑惑搜寻Mos(Oracle Support)后,确认此问题为Bug导致(ORA-600:[kcbzwfcro_2] Reported in Alert.log (Doc ID 2085507.1)

故障现象,Call Stack信息基本一致,并且Oracle给予了Workround

_gc_override_force_cr参数简单解释:if TRUE, try to override force-cr requests,在Undo header中会标识这个CR块是否为force_cr_override的flag,但是这个代表何意没有找到具体的官方解释,猜测是在节点间传输时本地CR与远程CR校验时出现错误,本地与远程均认为自己是最新版本CR。本次故障找到原因,但是此参数的含义还有继续找寻资料确认。

DataGuard 切换流程梳理(一)-Broker switchover

近期手头有 一项重要工作便是重新梳理Dataguard切换流程与注意事项,以便团队在数据库应急时更加熟练从容的进行容灾切换,主要分为四部分:

1.Broker Switcher

2.Broker Failover

3.命令行 switcher over

4.命令行Failover

5.命令行Failover怎样保证数据不丢失

一.操作前准备

操作前首先要确认主备库的ip、tnsnames、主备库的DB_UNIQUE_NAME、主备库目前DG配置状态,日志同步状态,主备库连接Process数,应用SERVICE处理、SERVICE在监听上注册情况等。检查点比较多,极有可能存在应急操作前没有足够的时间去确保各项检查点的正确性,所以这里罗列出来是为了提醒用途,DG切换成功与否需要以上检查内容的保证。

1.1 确认主备库的tnsnames与ip

主库
IP:181.208
机器名:liangce-ora
DB_UNQIUE_NAME:ORALIANGCE

备库
181.209
机器名:liangcestd-ora
DB_UNIQUE_NAME:ORALIANGCEstd

登录主备库确认$ORACLE_HOME/network/admin目录确认tnsnames.ora文件中的内容,确认ip对应的连接串是否如主库、备库配置如预期,并且tns别名中的配置信息进行DB_UNIQUE_NAME确认

1.2 确认主备库的process配置

为什么要确认此项配置?因为会存在主库多节点RAC+DG单机的情况,也就是说是否存在备库的单机process连接数足够承载RAC多节点的应用连接?

1.3 应用SERVICE处理

我们目前应用SERVICE只注册在主库上,并且应用目前使用的连接串是有DG配置的,如果在切换后主库不处理应用SERVICE,会导致ORA-O1033错误,详情请见早起文章:《中秋节快乐–中秋节前DataGuard切换演练产生问题》

备库:
SQL> show parameter service;

NAME                      TYPE        VALUE
---------------------- ----------- ------------------------------
service_names             string      ORALIANGCE,dbaas

SQL> alter system set service_names='ORALIANGCE,dbaas,Albert';

System altered.

SQL> show parameter service;

NAME                       TYPE        VALUE
----------------------- ----------- ------------------------------
service_names               string      ORALIANGCE,dbaas,Albert



主库:
SQL> show parameter service; 

NAME                        TYPE        VALUE
------------------------- ----------- ------------------------------
service_names               string      ORALIANGCE,dbaas,Albert

SQL> alter system set service_names='ORALIANGCE,dbaas';

System altered.

SQL> show parameter service;

NAME                          TYPE        VALUE
--------------------------- ----------- ------------------------------
service_names                  string      ORALIANGCE,dbaas

1.4 主备库配置,同步传输确认

主库上执行:

select thread#,
       DB_UNIQUE_NAME,
       DEST_NAME,
       TARGET,
       DATABASE_MODE,
       error,
       RECOVERY_MODE,
       GAP_STATUS,
       APPLIED_SEQ#,
       current_scn,
       applied_scn,
       to_char((scn_to_timestamp(current_scn)),'yyyymmdd hh24:mi:ss') current_scn_time,
       to_char((scn_to_timestamp(applied_scn)),'yyyymmdd hh24:mi:ss') applied_scn_time,
       scn_to_timestamp(current_scn)- scn_to_timestamp(current_scn) as delay_time
  from (SELECT al.thread#,
               ads.dest_id,
               ads.DEST_NAME,
               ads.TYPE || ' ' || ad.TARGET TARGET,
               ADS.DATABASE_MODE,
               ads.STATUS,
               ads.error,
               ads.RECOVERY_MODE,
               ads.DB_UNIQUE_NAME,
               ads.DESTINATION,
               ads.GAP_STATUS,
               (SELECT sequence#
                  FROM v$log na
                 WHERE na.thread# = al.thread#
                   and status = 'CURRENT') Current_Seq#,
               Last_Archived,
               al.APPLIED_SEQ#,
               ad.applied_scn,
               current_scn
          FROM (SELECT dest_id,
                       thread#,
                       d.current_scn,
                       MAX(sequence#) Last_Archived,
                       max(CASE
                             WHEN APPLIED = 'YES' THEN
                              sequence#
                           end) APPLIED_SEQ#
                  FROM v$archived_log V, v$database d
                 WHERE V.resetlogs_change# = d.RESETLOGS_CHANGE#
                 group by dest_id, thread#, d.current_scn) al,
               v$archive_dest_status ads,
               v$archive_dest AD
         WHERE al.dest_id(+) = ads.dest_id 
           AND ads.STATUS != 'INACTIVE'
           AND AD.DEST_ID = ADS.DEST_ID and ad.APPLIED_SCN <> 0
         ORDER BY al.thread#, ads.dest_id)
在主库上执行即可,可以确认主备库的一些基本配置与Current SCN,Applied SCN,Delay_time等

主库上执行:

[ora11g@~]$ dgmgrl
DGMGRL for Linux: Version 11.2.0.4.0 - 64bit Production

Copyright (c) 2000, 2009, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys
Password:
Connected.
DGMGRL> show configuration 

Configuration - DG_Config

  Protection Mode: MaxPerformance
  Databases:
    ORALIANGCE    - Primary database
    ORALIANGCEstd - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS


2、查看主备库数据传输是否异常
DGMGRL> 
DGMGRL> show database ORALIANGCEstd  

Database - ORALIANGCEstd

  Role:            PHYSICAL STANDBY
  Intended State:  APPLY-ON
  Transport Lag:   0 seconds (computed 1 second ago)
  Apply Lag:       0 seconds (computed 1 second ago)
  Apply Rate:      13.00 KByte/s
  Real Time Query: ON
  Instance(s):
    ORALIANGCE

Database Status:
SUCCESS

备库中的Transport Lag,Apply Lag是否正常

1.5 broker switcher切换

[ora11g@]$ dgmgrl
DGMGRL for Linux: Version 11.2.0.4.0 - 64bit Production

Copyright (c) 2000, 2009, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys
Password:
Connected.
DGMGRL> switchover to ORALIANGCEstd
Performing switchover NOW, please wait...
Operation requires a connection to instance "ORALIANGCE" on database "ORALIANGCEstd"
Connecting to instance "ORALIANGCE"...
Connected.
New primary database "ORALIANGCEstd" is opening...
Operation requires startup of instance "ORALIANGCE" on database "ORALIANGCE"
Starting instance "ORALIANGCE"...
ORACLE instance started.
Database mounted.
Database opened.
Switchover succeeded, new primary is "ORALIANGCEstd"
DGMGRL>

Broker会帮助你封装主备库实例启停,DG角色切换等命令

1.6 切换后检查

1.如RAC-RAC间切换,需要检查RAC各个节点是否启动

2.主备库数据字典检查
新主库:
SQL> select name,db_unique_name,database_role,open_mode from v$database;

NAME      DB_UNIQUE_NAME                 DATABASE_ROLE    OPEN_MODE
--------- ------------------------------ ---------------- --------------------
ORALIANGCE  ORALIANGCE                       PHYSICAL STANDBY READ ONLY WITH APPLY 

新备库:
SQL> select name,db_unique_name,database_role,open_mode from v$database;

SQL> set linesize 200 
SQL> select process,status,client_process,thread#,sequence#,block#,active_agents,known_agents from gv$managed_standby where process in ('LNS','RFS','MRP0') and  THREAD# <> 0;

PROCESS   STATUS       CLIENT_P    THREAD#  SEQUENCE#     BLOCK# ACTIVE_AGENTS KNOWN_AGENTS
--------- ------------ -------- ---------- ---------- ---------- ------------- ------------
RFS       IDLE         LGWR              1       6864        469             0            0
MRP0      APPLYING_LOG N/A               1       6864        469             5            5

3.alert日志检查:需要仔细检查主备库的alert日志中有无其他ORA报错,如使用HugePage,需要确认HugePage是否正确。

4.应用SERVICE监听注册情况

一则Bug导致数据库All Database Sessions Hang

春节前夕,一套核心交易系统出现短时间夯死的状况出现(后台数据库版本信息:Oracle 11.2.0.4 RAC 2节点,Linux X86服务器),活动会话数又飙升至近300+

20200127-1如上图所示,数据库从1月19日 09:34分左右开始,活动会话逐渐攀升,数据库在当时存在大量等待事件log file switch (checkpoint not complete),排查了硬件IO链路、应用事务量均正常,还是要在数据库软件层面进行分析。

    可以看到当时数据库活动会话概览,这些在等待log file switch (checkpoint incomplete)的会话均被节点2的1369会话所阻塞。

20200127-2

那么2节点的1369会话这时在干些什么?从ASH中看到,1369会话(LGWR)正在读取控制文件,等待事件为control file sequential read,但是这个动作是很快的动作,理论上不应该造成长时间的阻塞。

20200127-3

当时的DIAG进程做的HANGANALYZE中我们可以看到如下的Chain关系,开始试图找寻是否是Bug引起,因为种种客观现象实在是有悖常理,无法从原理上解释

20200127-4

从Oracle Mos上找寻相关Bug描述,果然有一个刚刚发布不久的Bug:

Bug 29635990 – All Database Sessions Hang Due To Checkpoint Incomplete (Doc ID
29635990.8)
20200127-5

但是这个Bug并未标注影响版本范围,但是和本次故障现象基本一致,遗憾的是并没有相关的Workrond并且此Bug只能在12.2版本中有相关Patch进行修复。

为了规避此问题的再次爆发,我们在数据库中设置了一个Job进行规避,调度时间按照redo 所有group#均变为ACTIVE且准备进行覆盖的时间周期,本系统为5分钟。

declare
jobno number;
begin dbms_job.submit(
jobno,
'begin
 execute immediate ''alter system checkpoint'';
end;',
sysdate,
'sysdate + 5/1440');
commit;
end;
/

每5分钟进行强制全量检查点,避免数据库再次产生log file switch (checkpoint not complete)的现象,从而导致产生更大数据库夯死风险。

删除空子表约束导致主表事务row cache lock/Library cache lock/Cursor pin s wait on x

春节前夕,公司一套核心核心再核心的数据库出现严重性能问题(版本信息:11.2.0.4),数据库活动会话飙升到300+,幸好当时处置得当,人员判断与应急迅速,及时kill了相关blocker session从而使得影响没有扩大。20200114-1从应用负责人员获悉,此时间段应用人员正在执行上线操作,具体操作为删除一张num rows=0的空表的外键约束。

alter table TRAVELSKYDBA drop constraint FK_TRAVELSKYDBA;

ASH中活动会话基本都在等待Library Cache Lock,Cursor Pin s Wait On X,Row Cache Lock上,阻塞情况如下:

20200123-11

继续分析,发现了一个更加奇怪的现象:

20200123-13

SESSION_ID=1827的会话为应用上线alter table删除外键约束的语句会话,SESSION_ID=73,2119,4000均为对删除外键的主表的正常业务DML语句(INSERT,DELETE,UPDATE),从ASH中看到1827会话正在等待正常业务事务(SESSION_ID=73)的LIBRARY CACHE LOCK,但是SESSION_ID=73在等 ROW CACHE LOCK,被ALTER TABLE (SESSION_ID=1827)阻塞,行成了死锁,并且ROW CACHE LOCK 的P1=8(dc_object与grants,dc_object),随后的正常事务的DML SQL想去软解析时遭遇Cursor Pin S wait on X。

从数据库DIAG进程自动dump的system state dump与Hanganalyze trace中确认,这些应用的SQL因为Rolling Invalidate Window Exceeded触发了硬解析,alter table 子表 drop constraints时,这项操作是需要申请主表DML游标Object Handle的Library Cache Lock的,主表涉及SQL在硬解析,ALTER TABLE求LIBRARY CACHE LOCK而不得被阻塞,但是主表DML硬解析申请ROW CACHE LOCK,理论上非常快速内存操作为何长时间不释放?产生死锁?并且Oracle并没有处理这个死锁

-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
 Oracle session identified by:
 {
 instance: 1 (#######)
 os id: 111552
 process id: 161, oracle@###-rac
 session id: 73
 session serial #: 181
 }
 is waiting for 'row cache lock' with wait info:
 {
 p1: 'cache id'=0x8
 p2: 'mode'=0x0
 p3: 'request'=0x3
 time in wait: 0.612302 sec
 heur. time in wait: 2 min 23 sec
 timeout after: 2.387698 sec
 wait id: 24235
 blocking: 286 sessions
 wait history:
 * time between current wait and wait #1: 0.000034 sec
 1. event: 'row cache lock'
 time waited: 3.000963 sec
 wait id: 24234 p1: 'cache id'=0x8
 p2: 'mode'=0x0
 p3: 'request'=0x3
 * time between wait #1 and #2: 0.000039 sec
 2. event: 'row cache lock'
 time waited: 3.000978 sec
 wait id: 24233 p1: 'cache id'=0x8
 p2: 'mode'=0x0
 p3: 'request'=0x3
 * time between wait #2 and #3: 0.000064 sec
 3. event: 'row cache lock'
 time waited: 3.000913 sec
 wait id: 24232 p1: 'cache id'=0x8
 p2: 'mode'=0x0
 p3: 'request'=0x3
 }
 and is blocked by
 => Oracle session identified by:
 {
 instance: 1 (*******)
 os id: 92902
 process id: 112, oracle@*****-rac
 session id: 1827
 session serial #: 19367
 }
 which is waiting for 'library cache lock' with wait info:
 {
 p1: 'handle address'=0xfeec0a948
 p2: 'lock address'=0xfe24600b8
 p3: '100*mode+namespace'=0x15dba00010003
 time in wait: 2 min 23 sec
 timeout after: 12 min 36 sec
 wait id: 1540
 blocking: 286 sessions
 wait history:
 * time between current wait and wait #1: 0.000150 sec
 1. event: 'gc current block 2-way'
 time waited: 0.000285 sec
 wait id: 1539 p1: ''=0x1
 p2: ''=0x131f3
 p3: ''=0x1
 * time between wait #1 and #2: 0.000131 sec
 2. event: 'gc current block 2-way'
 time waited: 0.000374 sec
 wait id: 1538 p1: ''=0x1
 p2: ''=0x131de
 p3: ''=0x1
 * time between wait #2 and #3: 0.002456 sec
 3. event: 'gc cr block 2-way'
 time waited: 0.000405 sec
 wait id: 1537 p1: ''=0x1
 p2: ''=0x131f3
 p3: ''=0x1
 }
 and is blocked by the session at the start of the chain.

Chain 1 Signature: 'library cache lock'<='row cache lock' (cycle)
Chain 1 Signature Hash: 0x52a8007d
----------------------------------------------------------------
并且 p3: '100*mode+namespace'=0x15dba00010003 --003 LOCK MODE=X

SQL> select to_number('15dba','xxxxxxx') from dual;
TO_NUMBER('15DBA','XXXXXXX')
---------------------------
 89530  --主表对象OBJECT_ID

20200123-16

SESSION_ID=73在sql net message from client中居然阻塞了1827的 alter table?

分析到这里,越来越感觉是Bug导致,从Mos上寻找相关案例发现一篇“现象相似”的文章,可惜版本无法对应上,不能确定是因为此Bug导致,虽然最后没有定位具体的Bug Number,但是此问题要告诫警示大家,不要再事务繁忙期间进行alter table DDL操作,虽然是一张空表,但是和其对象的相关OBJECT也存在受到牵连的可能性,从而产生跟大的影响。

20200123-17

参数一致却导致ORA-01105: mount is incompatible with mounts by other instances

昨日,RAC数据库2节点因私网心跳超时导致CRS重启,异常的是,数据库实例并没有如愿启动,而是在CRS自动带起数据库实例时,NOMOUNT到MOUNT过程中抛出了如下报错:

ORA-01105: mount is incompatible with mounts by other instances 
ORA-19808: recovery destination parameter mismatch

此错误为RAC数据库各实例同一参数生效值存在differ的情况,但是针对本次案例,经过分析与测试,此问题定性为软件Bug,故障现象还原模拟如下:

20200121-1

与生产系统现象一致,3节点RAC中,db_recovery_file_dest 参数为空,db_recovery_file_dest_size为300gb,

20200123-1

我们在节点2上设置db_recovery_file_dest参数,设置为+DG_DATA

20200123-2

我完全还原生产环境中故障的现象,生产环境中,在时间节点1时,集群2节点设置了db_recovery_file_dest参数,并且该参数在1个月后左右,人为将节点1上这个参数设置成了空,并且两次参数范围生效均为scope=both sid=’*’

故障现象出现,节点2宕机,并且无法启动抛出了预期的错误,生成create pfile from memory与create pfile from spfile后进行对比,但是这个参数各个节点均为格式一致的空,理论上不存在参数不一致且冲突的设置。

20200123-3

是否这个参数的设置方式不规范导致了此问题的发生,询问了Oracle业界大神ACED杨廷琨老师,杨老师建议用reset方式再次进行测试,但是测试结果一致,只用alter system reset方式进行参数的重置,仍然会导致此问题的发生。

严谨的杨老师继续给予我第二个测试内容,

(1)是否只是设置为空会引发这个问题,如果修改为其他目录是不是在任意一个节点都可以?

(2)另外,如果置空需要在当时set的节点运行,那么如果当前崩溃尝试启动的恰好是这个节点呢?其他节点的reset都无法使当前节点的实例启动吗?

测试点(1)是否只是设置为空会引发这个问题,如果修改为其他目录是不是在任意一个节点都可以?

20200123-4

这时,模拟2节点宕机并重启后,没有发生ORA-01105与ORA-19808的错误,说明本节点设置值A,另外节点设置非空值B是OK的

测试点(2)如果置空需要在当时set的节点运行,那么如果当前崩溃尝试启动的恰好是这个节点呢?其他节点的reset都无法使当前节点的实例启动吗?

2020012305

20200123-6测试结论:因为这个参数设置导致的“冲突”,从而引发了实例重启的故障,但是,通过上述对生产环境的模拟发现,这个参数各个节点实际上内存生效值与实例SPFILE中的参数数值并不存在差异(包括格式),显然,这是Oracle软件的Bug行为,

1.节点set value,节点set null必须在同一节点上运行,如果不同,则会影响set value 节点重启,解决方案为在set null节点再次执行set null命令,其他节点设置其他VALUE是无影响的(只要别设置为空)

2.如果set null节点已宕机的情况下,之后set value节点也无法启动了,需要其他存活节点上再次执行set null命令后,set value,set null的异常节点均可成功启动

3.如果set null节点存活,则必须在set null节点 再次执行set null命令,若在其他存活节点上执行set null命令,set value节点同样无法启动。

 

RAC节点2遭遇连续驱逐

公司一套核心业务数据库RAC集群数据库2节点实例分别在2019年 11月13日,2020年1月2日,2020年1月11日发生3次被集群驱逐的情况发生,并且故障时,三次重启CRS均夯住,只能通过重启服务器的方式应急解决。

1.集群日志(2019.11.13)

20190112-1112.集群日志(2020.01.02)

20200112-2

3.集群日志(2020.01.11)

20200112-3

通过以上日志得知,三次实例驱逐均是因为集群间私网心跳超时导致,但是依照RAC网络心跳集群的驱逐的原则,2节点被驱逐不见得根因就是出在2节点上,该环境为2节点RAC,因心跳超时导致集群分为2个子集群,保留最小节点号的子集群会留下,所以1节点会有很大几率被保留,2节点会被驱逐。

分析Oswatcher中netstat采集的网络信息:

节点1,故障期间包重组失败统计:

20200112-04

节点1,物理丢包统计:

20200112-07

节点2,故障期间包重组失败统计:

20200112-05

节点2,物理丢包统计:

20200112-06

可以看到,节点1故障期间没有发生物理丢包与包重组失败,然而节点2则物理丢包与包重组失败都存在(数值一值在累计增加),所以基本推定节点2数据库服务器私网网卡存在异常现象。当讨论这个故障期间,有同事提出可以修改linux操作系统内核参数,避免大量包重组失败的问题:

net.ipv4.ipfrag_high_thresh = 16M     --default 4M
net.ipv4.ipfrag_low_thresh = 15M      --default 3M

这种方式是可行的,当无法排查定位出硬件设备问题时,可以尝试利用此方式来继续观察,但是上述现象除了包重组失败还存在物理丢包现象,也就是说很多Packet没有到包重组那步就已经出现ERROR于DROP的现象,物理丢包仍然会导致RAC集群心跳超时从而继续驱逐节点。

通过上述分析后,我们决定不修改内核参数,决定进行本日第二次的私网网卡切换操作,继续观察,硬件同事将eth14切换到eth08后,丢包现象消失,问题暂时解决,继续观察。

Cursor 触发了OPTIMIZER_MISMATCH=Y的Bug?

近日,一个数据服务器CPU基本每日触发报警,Idle已经接近于0,并且存在一定的周期性规律,CPU Idle如下图所示:

20200105-11

定位到具体进程后,发现CPU TIME主要被两个SQL因为创建合适的索引导致大量全表扫描耗尽。经过简单优化后,SQL效率提升,但却发现了一个奇怪的问题:

优化前:

20200105-12

优化后:

20200105-13

使得游标立即失效后,此SQL会立即进行硬解析,但是却硬解析了2个Child Cursor:

20200105-14

20200105-15可以看到,两个Cursor确认是因为客户端NLS_SORT与NLS_COMP参数不一致,导致了Cursor 2无法利用上索引,从而选择了全表扫描的执行计划。但是从V$SQL_SHARED_CURSOR中发现了异常现象,LANGUAGE_MISMATCH很好理解(因为NLS_SORT与NLS_COMP参数不一),但是OPTIMIZER_MISMATCH为何也是’Y’?

20200105-16

利用alter system set events ‘trace[rdbms.SQL_Optimizer.*][sql_id]’ ;查看Trace文件与查询v$sql_optimizer_env确认游标优化器参数时,并没有发现异常现象:

20200105-17

20200105-18如果优化器参数不一,那么cursor执行计划中的Outline Data CBO参数部分会有所区别,但是上图中两个Cursor 执行计划中显示的CBO参数是一致的。虽然Oracle Support中并未发现相关Bug记录,但是猜测很有可能是触发了Bug导致。