RAC集群节点关闭长时间Waiting for ASM to shutdown处理过程与案例分析

5月1日晚,接到值班同事求助电话,RAC集群重启节点后出现监听无法注册SERVICE的情况,应用连接在数据库实例重启后无法在维护节点上连接。

放下电话登录到系统中查看CRS状态出现如下错误

CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors。

CRS此时已经处于非正常状态,listener/service都是以资源的方式注册在CRSD中,所以出现SERVICE的异常现象是符合情理的。随后我准备重启CRS的时候,在stop的过程中却出现了长时间hang的情况,等待Waiting for ASM to shutdown。

[root@travelskydba-rac grid]# cd bin
[root@travelskydba-rac bin]# ./crsctl stat res tt
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[root@travelskydba-rac bin]# ./crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[root@travelskydba-rac bin]# ./crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.crsd' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.crsd' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.mdnsd' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.crf' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.ctssd' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.evmd' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.asm' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.mdnsd' on 'travelskydba-rac' succeeded
CRS-2677: Stop of 'ora.crf' on 'travelskydba-rac' succeeded
CRS-2677: Stop of 'ora.evmd' on 'travelskydba-rac' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'travelskydba-rac' succeeded

CRS-2675: Stop of 'ora.asm' on 'travelskydba-rac' failed  ---卡在这里很长时间,大概有8分钟左右
CRS-2679: Attempting to clean 'ora.asm' on 'travelskydba-rac'
CRS-2681: Clean of 'ora.asm' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.cssd' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.gipcd' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.drivers.acfs' on 'travelskydba-rac' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'travelskydba-rac' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'travelskydba-rac' has completed
CRS-4133: Oracle High Availability Services has been stopped.
2020-05-01 22:37:28.633:
[ctssd(2029)]CRS-2405:The Cluster Time Synchronization Service on host travelskydba-rac is shutdown by user
[client(89699)]CRS-10001:01-May-20 22:37 ACFS-9290: Waiting for ASM to shutdown
.... --省略重复输出
[client(124967)]CRS-10001:01-May-20 22:46 ACFS-9290: Waiting for ASM to shutdown.
[client(125047)]CRS-10001:01-May-20 22:46 ACFS-9290: Waiting for ASM to shutdown.
[client(125174)]CRS-10001:01-May-20 22:46 ACFS-9290: Waiting for ASM to shutdown.
[client(125491)]CRS-10001:01-May-20 22:47 ACFS-9290: Waiting for ASM to shutdown.
[client(125548)]CRS-10001:01-May-20 22:47 ACFS-9290: Waiting for ASM to shutdown.
[client(125614)]CRS-10001:01-May-20 22:47 ACFS-9290: Waiting for ASM to shutdown.
2020-05-01 22:47:28.727:
[/opt/app/11.2.0/grid/bin/oraagent.bin(759)]CRS-5818:Aborted command 'stop' for resource 'ora.asm'. Details at (:CRSAGF00113:) {0:0:52684} in /opt/app/11.2.0/grid/log/travelskydba-rac/agent/ohasd/oraagent_grid//oraagent_grid.log.
2020-05-01 22:47:30.730:
[ohasd(196541)]CRS-2757:Command 'Stop' timed out waiting for response from the resource 'ora.asm'. Details at (:CRSPE00111:) {0:0:52684} in /opt/app/11.2.0/grid/log/travelskydba-rac/ohasd/ohasd.log.
2020-05-01 22:47:33.404:
[cssd(872)]CRS-1603:CSSD on node travelskydba-rac shutdown by user.
2020-05-01 22:47:33.513:
[ohasd(196541)]CRS-2767:Resource state recovery not attempted for 'ora.cssdmonitor' as its target state is OFFLINE
2020-05-01 22:47:33.609:
[cssd(872)]CRS-1660:The CSS daemon shutdown has completed
2020-05-01 22:47:35.756:
[gpnpd(790)]CRS-2329:GPNPD on node travelskydba-rac shutdown.

DB alert日志:
NOTE: ASMB terminating
Errors in file /opt/app/ora11g/diag/rdbms/Albert/Albert1/trace/Albert1_asmb_188518.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1828 Serial number: 22601
Errors in file /opt/app/ora11g/diag/rdbms/Albert/Albert1/trace/Albert1_asmb_188518.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1828 Serial number: 22601
ASMB (ospid: 188518): terminating the instance due to error 15064
Fri May 01 22:47:30 2020
System state dump requested by (instance=1, osid=188518 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /opt/app/ora11g/diag/rdbms/Albert/Albert1/trace/Albert1_diag_188401_20200501224730.trc
Dumping diagnostic data in directory=[cdmp_20200501224730], requested by (instance=1, osid=188518 (ASMB)), summary=[abnormal instance termination].
Instance terminated by ASMB, pid = 188518

节点2:集群日志:
2020-05-01 22:47:33.505:
[cssd(133163)]CRS-1625:Node travelskydba-rac, number 1, was shut down
2020-05-01 22:47:33.517:
[cssd(133163)]CRS-1601:CSSD Reconfiguration complete. Active nodes are travelskydba2-rac travelskydba3-rac .


节点3集群日志:
2020-05-01 22:47:33.505:
[cssd(141351)]CRS-1625:Node travelskydba-rac, number 1, was shut down
2020-05-01 22:47:33.517:
[cssd(141351)]CRS-1601:CSSD Reconfiguration complete. Active nodes are tr730e67-rac travelskydba3-rac .

也就是说节点1集群完全停下来,耗时22:37-22:47 10分钟,hang期间,让值班员联系硬件重启服务器,但是在硬件重启操作前,CRS却停下来了。

保险起见,还是让硬件重启一下服务器(因21点左右,硬件团队同事刚刚对此节点集群的私网心跳线进行更换)。

继续阅读