ORA-600/0RA-7445 导致Instance Crash无法启动恢复案例

前几日,一套开发人员邮件求助,数据库在执行批量删除delete时数据库异常崩溃,无法启动,启动后PMON进程会自动停止实例,数据库后台日志如下,数据库异常时出现ORA-00600与ORA-7445报错:

Wed Aug 21 15:53:29 2019
 Dumping diagnostic data in directory=[cdmp_20190821155329], requested by (instance=1, osid=29995), summary=[incident=437112].
 Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x97ECFFC, kghalo()+570] [flags: 0x0, count: 1]
 Errors in file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_ora_29995.trc (incident=437114):
 ORA-07445: 出现异常错误: 核心转储 [kghalo()+570] [SIGSEGV] [ADDR:0x0] [PC:0x97ECFFC] [SI_KERNEL(general_protection)] []
 ORA-00600: 内部错误代码, 参数: [17182], [0x7F617B5DAE58], [], [], [], [], [], [], [], [], [], []
 ORA-00600: 内部错误代码, 参数: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_437114/Travelskydba_ora_29995_i437114.trc
 1562074,1 99%
 ORA-00600: 内部错误代码, 参数: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_437114/Travelskydba_ora_29995_i437114.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Wed Aug 21 15:53:31 2019
 Sweep [inc][437114]: completed
 Sweep [inc][437113]: completed
 Sweep [inc][437112]: completed
 Sweep [inc2][437113]: completed
 Sweep [inc2][437112]: completed
 Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x97ECFFC, kghalo()+570] [flags: 0x0, count: 2]
 Errors in file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_ora_29995.trc (incident=437115):
 ORA-07445: 出现异常错误: 核心转储 [kghalo()+570] [SIGSEGV] [ADDR:0x0] [PC:0x97ECFFC] [SI_KERNEL(general_protection)] []
 ORA-07445: 出现异常错误: 核心转储 [kghalo()+570] [SIGSEGV] [ADDR:0x0] [PC:0x97ECFFC] [SI_KERNEL(general_protection)] []
 ORA-00600: 内部错误代码, 参数: [17182], [0x7F617B5DAE58], [], [], [], [], [], [], [], [], [], []
 ORA-00600: 内部错误代码, 参数: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_437115/Travelskydba_ora_29995_i437115.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x87F3BD8, dbgexDumpErrDesc()+82] [flags: 0x0, count: 3]
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Wed Aug 21 15:53:57 2019
 Block recovery from logseq 8936, block 446894 to scn 843928251
 Recovery of Online Redo Log: Thread 1 Group 3 Seq 8936 Reading mem 0
 Mem# 0: /oracle/redo_Travelskydba/Travelskydba/redo31.log
 Mem# 1: /oracle/redo_Travelskydba/Travelskydba/redo32.log
 Block recovery completed at rba 8936.446897.16, scn 0.843928254
 Errors in file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_pmon_51643.trc (incident=432032):
 ORA-00600: internal error code, arguments: [17182], [0x7FDB167BDBF8], [], [], [], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_432032/Travelskydba_pmon_51643_i432032.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x97E6BEC, kghpmfal()+216] [flags: 0x0, count: 1]
 Errors in file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_pmon_51643.trc (incident=432033):
 ORA-07445: exception encountered: core dump [kghpmfal()+216] [SIGSEGV] [ADDR:0x0] [PC:0x97E6BEC] [SI_KERNEL(general_protection)] []
 ORA-00600: internal error code, arguments: [17182], [0x7FDB167BDBF8], [], [], [], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_432033/Travelskydba_pmon_51643_i432033.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Wed Aug 21 15:53:59 2019
 Dumping diagnostic data in directory=[cdmp_20190821155359], requested by (instance=1, osid=51643 (PMON)), summary=[incident=432032].
 1562114,1 99%
 ORA-07445: 出现异常错误: 核心转储 [kghalo()+570] [SIGSEGV] [ADDR:0x0] [PC:0x97ECFFC] [SI_KERNEL(general_protection)] []
 ORA-00600: 内部错误代码, 参数: [17182], [0x7F617B5DAE58], [], [], [], [], [], [], [], [], [], []
 ORA-00600: 内部错误代码, 参数: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_437115/Travelskydba_ora_29995_i437115.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x87F3BD8, dbgexDumpErrDesc()+82] [flags: 0x0, count: 3]
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Wed Aug 21 15:53:57 2019
 Block recovery from logseq 8936, block 446894 to scn 843928251
 Recovery of Online Redo Log: Thread 1 Group 3 Seq 8936 Reading mem 0
 Mem# 0: /oracle/redo_Travelskydba/Travelskydba/redo31.log
 Mem# 1: /oracle/redo_Travelskydba/Travelskydba/redo32.log
 Block recovery completed at rba 8936.446897.16, scn 0.843928254
 Errors in file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_pmon_51643.trc (incident=432032):
 ORA-00600: internal error code, arguments: [17182], [0x7FDB167BDBF8], [], [], [], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_432032/Travelskydba_pmon_51643_i432032.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x97E6BEC, kghpmfal()+216] [flags: 0x0, count: 1]
 Errors in file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_pmon_51643.trc (incident=432033):
 ORA-07445: exception encountered: core dump [kghpmfal()+216] [SIGSEGV] [ADDR:0x0] [PC:0x97E6BEC] [SI_KERNEL(general_protection)] []
 ORA-00600: internal error code, arguments: [17182], [0x7FDB167BDBF8], [], [], [], [], [], [], [], [], [], []
 Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_432033/Travelskydba_pmon_51643_i432033.trc
 Use ADRCI or Support Workbench to package the incident.
 See Note 411.1 at My Oracle Support for error and packaging details.
 Wed Aug 21 15:53:59 2019
 Dumping diagnostic data in directory=[cdmp_20190821155359], requested by (instance=1, osid=51643 (PMON)), summary=[incident=432032].
 System state dump requested by (instance=1, osid=51706 (CJQ0)), summary=[abnormal instance termination].
 Wed Aug 21 15:54:03 2019
 CJQ0 (ospid: 51706): terminating the instance due to error 472

从日志中看到,数据库从21日15:53开始报错:ORA-00600与ORA-07445,54分时,数据库实例被CJQ0进程所终止。

数据库无法启动:

Wed Aug 21 17:03:19 2019
Starting ORACLE instance (normal)
************************ Large Pages Information *******************
Per process system memlock (soft) limit = 64 KB
Total Shared Global Region in Large Pages = 0 KB (0%)
Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 0 (0 KB)
Large Pages configured system wide = 0 (0 KB)
Large Page size = 2048 KB
RECOMMENDATION:
 Total System Global Area size is 6386 MB. For optimal performance,
 prior to the next instance restart:
 1. Increase the number of unused large pages by
 at least 3193 (page size 2048 KB, total size 6386 MB) system wide to
 get 100% of the System Global Area allocated with large pages
 1562696,3 99%
 at least 3193 (page size 2048 KB, total size 6386 MB) system wide to
 get 100% of the System Global Area allocated with large pages
 2. Large pages are automatically locked into physical memory.
 Increase the per process memlock (soft) limit to at least 6394 MB to lock
 100% System Global Area s large pages into physical memory
********************************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 8
Number of processor cores in the system is 8
Number of processor sockets in the system is 2
CELL communication is configured to use 0 interface(s):
CELL IP affinity details:
 NUMA status: non-NUMA system
 cellaffinity.ora status: N/A
CELL communication will use 1 IP group(s):
 Grp 0:
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /opt/app/ora11g/orabase/product/11.2.0/db_1/dbs/arch
Autotune of undo retention is turned on.
IMODE=BR
ILAT =249
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options.
ORACLE_HOME = /opt/app/ora11g/orabase/product/11.2.0/db_1
System name: Linux
Node name: vm-kvm6194-ora
Release: 2.6.32-431.75.1.el6.x86_64
Version: #1 SMP Mon Oct 24 16:39:06 EDT 2016
Machine: x86_64
Using parameter settings in server-side spfile /opt/app/ora11g/orabase/product/11.2.0/db_1/dbs/spfileTravelskydba.ora
System parameters with non-default values:
 processes = 1500
 sessions = 2272
 resource_limit = TRUE
 sga_max_size = 6384M
 shared_pool_size = 1600M
 large_pool_size = 640M
 filesystemio_options = "NONE"
 1562736,2 99%
 large_pool_size = 640M
 filesystemio_options = "NONE"
 sga_target = 6384M
 control_files = "/oracle/oradata_Travelskydba/Travelskydba/control01.ctl"
 control_files = "/oracle/oradata_Travelskydba/Travelskydba/control02.ctl"
 db_block_size = 8192
 db_cache_size = 3200M
 compatible = "11.2.0.4.0"
 db_files = 500
 undo_tablespace = "UNDOTBS1"
 remote_login_passwordfile= "EXCLUSIVE"
 db_domain = ""
 service_names = "Travelskydba,agr,imc,jobmanager"
 service_names = "invavs,utils,shared,cfd"
 service_names = "invjmg,readonly,dbaas,pssiclcxt"
 dispatchers = "(PROTOCOL=TCP) (SERVICE=TravelskydbaXDB)"
 local_listener = "Travelskydba"
 remote_listener = ""
 audit_file_dest = "/opt/app/ora11g/orabase/admin/Travelskydba/adump"
 audit_trail = "NONE"
 db_name = "Travelskydba"
 open_cursors = 10000
 pga_aggregate_target = 1632504K
 diagnostic_dest = "/opt/app/ora11g/orabase"
Wed Aug 21 17:03:19 2019
PMON started with pid=2, OS id=4602
Wed Aug 21 17:03:19 2019
PSP0 started with pid=3, OS id=4604
Wed Aug 21 17:03:20 2019
VKTM started with pid=4, OS id=4606 at elevated priority
VKTM running at (1)millisec precision with DBRM quantum (100)ms
Wed Aug 21 17:03:20 2019
GEN0 started with pid=5, OS id=4610
Wed Aug 21 17:03:20 2019
DIAG started with pid=6, OS id=4612
Wed Aug 21 17:03:20 2019
DBRM started with pid=7, OS id=4614
Wed Aug 21 17:03:20 2019
DIA0 started with pid=8, OS id=4616
Wed Aug 21 17:03:20 2019
MMAN started with pid=9, OS id=4618
Wed Aug 21 17:03:20 2019
 1562776,3 99%
MMAN started with pid=9, OS id=4618
Wed Aug 21 17:03:20 2019
DBW0 started with pid=10, OS id=4620
Wed Aug 21 17:03:20 2019
LGWR started with pid=11, OS id=4622
Wed Aug 21 17:03:20 2019
CKPT started with pid=12, OS id=4624
Wed Aug 21 17:03:20 2019
SMON started with pid=13, OS id=4626
Wed Aug 21 17:03:20 2019
RECO started with pid=14, OS id=4628
Wed Aug 21 17:03:20 2019
MMON started with pid=15, OS id=4630
Wed Aug 21 17:03:20 2019
MMNL started with pid=16, OS id=4632
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
starting up 1 shared server(s) ...
ORACLE_BASE from environment = /opt/app/ora11g/orabase
Wed Aug 21 17:03:21 2019
ALTER DATABASE MOUNT
Successful mount of redo thread 1, with mount id 3069252569
Database mounted in Exclusive Mode
Lost write protection disabled
Completed: ALTER DATABASE MOUNT
Wed Aug 21 17:03:25 2019
ALTER DATABASE OPEN
Beginning crash recovery of 1 threads
 parallel recovery started with 7 processes
Started redo scan
Completed redo scan
 read 26 KB redo, 5 data blocks need recovery
Started redo application at
 Thread 1: logseq 8939, block 2, scn 843988459
Recovery of Online Redo Log: Thread 1 Group 3 Seq 8939 Reading mem 0
 Mem# 0: /oracle/redo_Travelskydba/Travelskydba/redo31.log
 Mem# 1: /oracle/redo_Travelskydba/Travelskydba/redo32.log
Completed redo application of 0.02MB
Completed crash recovery at
 Thread 1: logseq 8939, block 55, scn 844008501
 5 data blocks read, 5 data blocks written, 26 redo k-bytes read
Thread 1 advanced to log sequence 8940 (thread open)
Thread 1 opened at log sequence 8940
 1562816,1 99%
Thread 1 advanced to log sequence 8940 (thread open)
Thread 1 opened at log sequence 8940
 Current log# 1 seq# 8940 mem# 0: /oracle/redo_Travelskydba/Travelskydba/redo11.log
 Current log# 1 seq# 8940 mem# 1: /oracle/redo_Travelskydba/Travelskydba/redo12.log
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
SMON: enabling cache recovery
[4806] Successfully onlined Undo Tablespace 2.
Undo initialization finished serial:0 start:876374 end:876434 diff:60 (0 seconds)
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
Database Characterset is AL32UTF8
No Resource Manager plan active
Errors in file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_smon_4626.trc (incident=480123):
ORA-00600: internal error code, arguments: [17182], [0x7F730578D7A0], [], [], [], [], [], [], [], [], [], []
Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_480123/Travelskydba_smon_4626_i480123.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
Wed Aug 21 17:03:28 2019
QMNC started with pid=28, OS id=4999
Dumping diagnostic data in directory=[cdmp_20190821170328], requested by (instance=1, osid=4626 (SMON)), summary=[incident=480123].
Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x97E6567, kghalf()+537] [flags: 0x0, count: 1]
Errors in file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_smon_4626.trc (incident=480124):
ORA-07445: exception encountered: core dump [kghalf()+537] [SIGSEGV] [ADDR:0x0] [PC:0x97E6567] [SI_KERNEL(general_protection)] []
ORA-00600: internal error code, arguments: [17182], [0x7F730578D7A0], [], [], [], [], [], [], [], [], [], []
Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_480124/Travelskydba_smon_4626_i480124.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Wed Aug 21 17:03:30 2019
Dumping diagnostic data in directory=[cdmp_20190821170330], requested by (instance=1, osid=4626 (SMON)), summary=[incident=480124].
Completed: ALTER DATABASE OPEN
Wed Aug 21 17:03:31 2019
PMON (ospid: 4602): terminating the instance due to error 474
System state dump requested by (instance=1, osid=4602 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_diag_4612_20190821170331.trc
Dumping diagnostic data in directory=[cdmp_20190821170331], requested by (instance=1, osid=4602 (PMON)), summary=[abnormal instance termination].
Instance terminated by PMON, pid = 4602

数据库启动后,随之又被PMON进程terminated,经过分析PMON与SMON日志后,初步怀疑是数据库事务回滚时出现异常,导致无法将数据回滚到一致的状态,从而引起实例终止。

为了验证这个猜测,将数据库启动至MOUNT状态,设置10513 EVENT,阻止数据库SMON进行恢复:

10513

10513-2

设置10513 EVENT后,数据库成功OPEN,数据库启动后查询X$KTUXE发现了DEAD事务

数据库能够OPEN是因为10513事件,因影响了数据库SMON的事务恢复,但是数据库出现出现僵死事务,这个问题并没有彻底解决,所以需要将僵死事务清除。重新建立UNDOTBS2,并进行切换。

10513-4

10513-6 创建pfile,添加*._corrupted_rollback_segments参数指定出现DEAD事务的_SYSSMU8_517538920$回滚段,最后启动数据库。

10513-7

指定 *._corrupted_rollback_segments与回滚段后,DEAD事务已经消失,将EVENT Reset后重启数据库,数据库恢复正常。

出现此问题极有可能数据库中出现Corrupt block导致,将数据库文件进行DBV确认后,发现出现了5个坏块且正是故障前开发人员进行DELETE的表,告知应用尽快备份数据,避免该问题的重复发生。