ASM-VOTE磁盘损坏恢复实验: (1) 环境准备,实验环境中存在3块VOTE磁盘,分别进行模拟VOTE数据丢失状态: (2)确认OCR状态与OCR备份是否正常:
(3)dd vote磁盘,停止数据库集群
继续阅读
月度归档:2019年08月
同一时间不同Snapshot sql信息对比
继上一个新上线SQL引起数据库CPU飙高的案例后,受此启发,尝试动手写了一个SQL脚本,意义在于,用最新快照与24小时前快照对比,哪那些SQL没有出现在24小时之前的快照内却出现在当前快照中,结合SQL的性能数据,尝试识别出新上线或异常的SQL:
select to_char(b.END_INTERVAL_TIME, 'yyyymmddhh24') sampletime, a.* from (select * from (select sq.snap_id, sq.sql_id, nvl((round(sq.elap / 1000000, 3)), to_number(null)) elap_time, nvl((round(sq.CPUT / 1000000, 3)), to_number(null)) cpu_time, nvl((round(sq.IO_TIME / 1000000, 3)), to_number(null)) io_time, sq.exec, sq.buffer_get, sq.disk_read, to_char(decode(sq.exec, 0, to_number(null), nvl((round(sq.elap / sq.exec / 1000000, 3)), to_number(null)))) avag_time from (select * from (select st.snap_id, st.sql_id, max(module) module, sum(elapsed_time_delta) elap, sum(cpu_time_delta) cput, sum(executions_delta) exec, sum(IOWAIT_DELTA) io_time, sum(BUFFER_GETS_DELTA) buffer_get, sum(DISK_READS_DELTA) disk_read from dba_hist_sqlstat st, (select snap_id, dbid, instance_number from (select snap_id, dbid, instance_number, row_number() over(order by snap_id desc) sp from (select snap_id, dbid, so.instance_number from dba_hist_snapshot so, v$instance it where so.instance_number = it.instance_number)) where sp = 1) endsnap_id, (select snap_id, dbid, instance_number from (select snap_id, dbid, instance_number, row_number() over(order by snap_id desc) sp from (select snap_id, dbid, so.instance_number from dba_hist_snapshot so, v$instance it where so.instance_number = it.instance_number)) where sp = 2) beginsnap_id where st.dbid = endsnap_id.dbid and st.dbid = beginsnap_id.dbid and st.instance_number = endsnap_id.instance_number and st.instance_number = beginsnap_id.instance_number and beginsnap_id.snap_id < st.snap_id and st.snap_id <= endsnap_id.snap_id group by st.sql_id, st.snap_id, st.sql_id)) sq order by elap_time desc) differ where differ.sql_id not in (select sql_id from dba_hist_sqlstat where snap_id = (to_number(differ.snap_id) - 24))) a, dba_hist_snapshot b where a.snap_id = b.snap_id
执行效果如下,可以截取TOP 10语句进行分析,定位SQL具体的问题。
一则表连接方式错误的SQL优化案例
Library Cache中Latch与Mutex Event(5)—Cursor Pin S Wait on X
Cursor Pin S Wait on X
一.什么Cursor Pin S Wait on X:
官方解释:
Mutexes were introduced in Oracle 10.2. A session waits for “cursor: pin S wait on X” when it wants a mutex in S (share) mode on a specific cursor but some other session holds that same mutex in X (exclusive) mode. Hence the current session has to wait for the X mode holder to release the mutex.Mutexes are local to the current instance in RAC environments.
Versions 10.2.0.1 - 11.2.0.2 Parameter Description P1 idn P2 value P3 where (where|sleeps in 10.2)
• A session waits for “cursor: pin S wait on X” when it wants a mutex in S (share) mode on a specific cursor but some other session holds that same mutex in X (exclusive) mode.
Hence the current session has to wait for the X mode holder to release the mutex.
• Occurs when hard parse time for a cursor is too long for the session holding the X mutex and at the same time another session is trying to execute the cursor.
• For SQL with long parse times.
• For SQL with high version counts.
• Frequent hard parses
一个process为一个与pin相关的共享操作(such as executing a cursor)请求一个mutex时,该mutex不能被立即获取,因为该mutex正在被其他session以排他模式持有。
例如,大量并发执行的SQL+SQL 硬解析:举一个简单的例子,一个会话(SESSION_A)正在解析(硬解析)某一个sql语句(SQL_A),当另一个会话(SESSION_B)同时执行这条sql语句(SQL_A)时(执行前需要对该语句进行软解析),SESSION_B就会等待cursor:pin S wait on X 事件。(硬解析+执行次数多) 该等待事件的P1参数idn,实际上就是sql语句的hash_value,也就是说通过该等待事件的P1参数即可定位等待的实际对象。
二.模拟场景:
一个自定义函数引发的SQL优化需求
近期某业务应用人员询问,一个SQL是否还有优化的空间,SQL文本如下:
SELECT SPNRID, QUEUEID, NAME FROM (SELECT QUEUEITEMS.SPNRID, QUEUES.QUEUEID, QUEUES.NAME FROM TRAVELSKYDBA.QUEUEITEMS INNER JOIN QUEUES ON QUEUES.QUEUEID = QUEUEITEMS.QUEUEID WHERE QUEUES.NAME = :B1 AND QUEUES.ACTIVE = 1 AND ACCESSED < (OJ_TIMESTAMP() - INTERVAL '10' MINUTE) AND CREATED <= OJ_TIMESTAMP() ORDER BY URGENT DESC, CREATED ASC) WHERE ROWNUM = 1
我们从AWR报告来看一下这个SQL的执行效率如何:
ORA-600/0RA-7445 导致Instance Crash无法启动恢复案例
前几日,一套开发人员邮件求助,数据库在执行批量删除delete时数据库异常崩溃,无法启动,启动后PMON进程会自动停止实例,数据库后台日志如下,数据库异常时出现ORA-00600与ORA-7445报错:
Wed Aug 21 15:53:29 2019 Dumping diagnostic data in directory=[cdmp_20190821155329], requested by (instance=1, osid=29995), summary=[incident=437112]. Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x97ECFFC, kghalo()+570] [flags: 0x0, count: 1] Errors in file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_ora_29995.trc (incident=437114): ORA-07445: 出现异常错误: 核心转储 [kghalo()+570] [SIGSEGV] [ADDR:0x0] [PC:0x97ECFFC] [SI_KERNEL(general_protection)] [] ORA-00600: 内部错误代码, 参数: [17182], [0x7F617B5DAE58], [], [], [], [], [], [], [], [], [], [] ORA-00600: 内部错误代码, 参数: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], [] Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_437114/Travelskydba_ora_29995_i437114.trc 1562074,1 99% ORA-00600: 内部错误代码, 参数: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], [] Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_437114/Travelskydba_ora_29995_i437114.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Aug 21 15:53:31 2019 Sweep [inc][437114]: completed Sweep [inc][437113]: completed Sweep [inc][437112]: completed Sweep [inc2][437113]: completed Sweep [inc2][437112]: completed Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x97ECFFC, kghalo()+570] [flags: 0x0, count: 2] Errors in file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_ora_29995.trc (incident=437115): ORA-07445: 出现异常错误: 核心转储 [kghalo()+570] [SIGSEGV] [ADDR:0x0] [PC:0x97ECFFC] [SI_KERNEL(general_protection)] [] ORA-07445: 出现异常错误: 核心转储 [kghalo()+570] [SIGSEGV] [ADDR:0x0] [PC:0x97ECFFC] [SI_KERNEL(general_protection)] [] ORA-00600: 内部错误代码, 参数: [17182], [0x7F617B5DAE58], [], [], [], [], [], [], [], [], [], [] ORA-00600: 内部错误代码, 参数: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], [] Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_437115/Travelskydba_ora_29995_i437115.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x87F3BD8, dbgexDumpErrDesc()+82] [flags: 0x0, count: 3] Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Aug 21 15:53:57 2019 Block recovery from logseq 8936, block 446894 to scn 843928251 Recovery of Online Redo Log: Thread 1 Group 3 Seq 8936 Reading mem 0 Mem# 0: /oracle/redo_Travelskydba/Travelskydba/redo31.log Mem# 1: /oracle/redo_Travelskydba/Travelskydba/redo32.log Block recovery completed at rba 8936.446897.16, scn 0.843928254 Errors in file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_pmon_51643.trc (incident=432032): ORA-00600: internal error code, arguments: [17182], [0x7FDB167BDBF8], [], [], [], [], [], [], [], [], [], [] Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_432032/Travelskydba_pmon_51643_i432032.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x97E6BEC, kghpmfal()+216] [flags: 0x0, count: 1] Errors in file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_pmon_51643.trc (incident=432033): ORA-07445: exception encountered: core dump [kghpmfal()+216] [SIGSEGV] [ADDR:0x0] [PC:0x97E6BEC] [SI_KERNEL(general_protection)] [] ORA-00600: internal error code, arguments: [17182], [0x7FDB167BDBF8], [], [], [], [], [], [], [], [], [], [] Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_432033/Travelskydba_pmon_51643_i432033.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Aug 21 15:53:59 2019 Dumping diagnostic data in directory=[cdmp_20190821155359], requested by (instance=1, osid=51643 (PMON)), summary=[incident=432032]. 1562114,1 99% ORA-07445: 出现异常错误: 核心转储 [kghalo()+570] [SIGSEGV] [ADDR:0x0] [PC:0x97ECFFC] [SI_KERNEL(general_protection)] [] ORA-00600: 内部错误代码, 参数: [17182], [0x7F617B5DAE58], [], [], [], [], [], [], [], [], [], [] ORA-00600: 内部错误代码, 参数: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], [] Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_437115/Travelskydba_ora_29995_i437115.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x87F3BD8, dbgexDumpErrDesc()+82] [flags: 0x0, count: 3] Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Aug 21 15:53:57 2019 Block recovery from logseq 8936, block 446894 to scn 843928251 Recovery of Online Redo Log: Thread 1 Group 3 Seq 8936 Reading mem 0 Mem# 0: /oracle/redo_Travelskydba/Travelskydba/redo31.log Mem# 1: /oracle/redo_Travelskydba/Travelskydba/redo32.log Block recovery completed at rba 8936.446897.16, scn 0.843928254 Errors in file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_pmon_51643.trc (incident=432032): ORA-00600: internal error code, arguments: [17182], [0x7FDB167BDBF8], [], [], [], [], [], [], [], [], [], [] Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_432032/Travelskydba_pmon_51643_i432032.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x97E6BEC, kghpmfal()+216] [flags: 0x0, count: 1] Errors in file /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/trace/Travelskydba_pmon_51643.trc (incident=432033): ORA-07445: exception encountered: core dump [kghpmfal()+216] [SIGSEGV] [ADDR:0x0] [PC:0x97E6BEC] [SI_KERNEL(general_protection)] [] ORA-00600: internal error code, arguments: [17182], [0x7FDB167BDBF8], [], [], [], [], [], [], [], [], [], [] Incident details in: /opt/app/ora11g/orabase/diag/rdbms/Travelskydba/Travelskydba/incident/incdir_432033/Travelskydba_pmon_51643_i432033.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Aug 21 15:53:59 2019 Dumping diagnostic data in directory=[cdmp_20190821155359], requested by (instance=1, osid=51643 (PMON)), summary=[incident=432032]. System state dump requested by (instance=1, osid=51706 (CJQ0)), summary=[abnormal instance termination]. Wed Aug 21 15:54:03 2019 CJQ0 (ospid: 51706): terminating the instance due to error 472
从日志中看到,数据库从21日15:53开始报错:ORA-00600与ORA-07445,54分时,数据库实例被CJQ0进程所终止。
一个生产数据库HANG案例
之前公司一个运维的数据库每日凌晨3点左右均会HUNG住,使得应用无法正常处理业务作业,只能通过 kill session 进行释放。数据库的版本为 11.2.0.3,平台为 Linux x86-64 平台,数据量为2T 左右。 下面我们将对这一现象进行详细分析。
如上所示,故障期间数据库的等待事件主要为 cursor: mutex X。平均等待事件为 719s,
占了总 DB time 的 34%
什么是cursor: mutex X 等待事件: 等待的会话尝试独占( EXCL)方式持有 cursor 的 mutex,而当前 cursor 已经被其他进程以共享( SHRD)或独占方式持有,因此当前会话必须等待持有的 mutex 的进程释放,直到尝试持有 CURSOR 的 Ref count 等于 0,才能获取到游标。
核心系统业务交易超时,enq:SQ contention: 这个锅我不背。
本周三凌晨1点左右,公司某核心业务交易严重超时,触发了中间件层面的应急操作,因该业务系统过于重要,所以业务的IT全链路部门均开始隐患的排查工作,消除隐患,避免故障的二次发生。
该业务后台数据库 Linux X-86 Oracle 11.2.0.4 RAC 2节点架构
根据Cloud Control显示及历史信息查看,该数据库每日凌晨1点左右,均会产生一个激增的尖峰,等待Wait Class为 Concurrency。
关注一下ASH报告信息:
一个SQL选错索引的案例(New AP rejected (non-guess): favored index only range scan)
近日,在优化SQL测试时再次碰到了这个奇怪的现象,一个简单的DELETE SQL语句:DELETE FROM travelskyDBA WHERE pid=:b1 AND (ctype = :b2 OR ctype = :b3 OR ctype = :b4)执行缓慢,不巧,此类型的SQL语句在公司生产系统中出现过问题,导致某B2C网站交易严重超时,所以我再次做了一次模拟实验,分析并回顾一下这个SQL优化过程的来龙去脉:
DELETE FROM travelskyDBA WHERE pid=:b1 AND (ctype = :b2 OR ctype = :b3 OR ctype = :b4);
该SQL的索引情况与选择度概览,PID,CNAME,CTYPE存在组合索引,并且为主键,PID有单独存在的索引 TRAVELSKYDBA_IDX01,选择率如下图:
read by other session等待事件触发的SQL优化需求
昨日,在偶然登录到一个数据库查询资源配置后随手看了一下这个库的等待事件,果然发现了几个问题,其中一个便是:该数据库一致在经历read by other session的等待,并且观察其数据库的历史等待情况,该问题已为常态化出现,可能应用的敏感度要求没有很高,所以没有影响业务的正常使用,但是既然发现了这个问题,那就作为隐患来进行分析处理。