RAC集群节点关闭长时间Waiting for ASM to shutdown处理过程与案例分析

5月1日晚,接到值班同事求助电话,RAC集群重启节点后出现监听无法注册SERVICE的情况,应用连接在数据库实例重启后无法在维护节点上连接。

放下电话登录到系统中查看CRS状态出现如下错误

CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors。

CRS此时已经处于非正常状态,listener/service都是以资源的方式注册在CRSD中,所以出现SERVICE的异常现象是符合情理的。随后我准备重启CRS的时候,在stop的过程中却出现了长时间hang的情况,等待Waiting for ASM to shutdown。

[root@travelskydba-rac grid]# cd bin
[root@travelskydba-rac bin]# ./crsctl stat res tt
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[root@travelskydba-rac bin]# ./crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[root@travelskydba-rac bin]# ./crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.crsd' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.crsd' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.mdnsd' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.crf' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.ctssd' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.evmd' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.asm' on 'travelskydba-rac'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.mdnsd' on 'travelskydba-rac' succeeded
CRS-2677: Stop of 'ora.crf' on 'travelskydba-rac' succeeded
CRS-2677: Stop of 'ora.evmd' on 'travelskydba-rac' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'travelskydba-rac' succeeded

CRS-2675: Stop of 'ora.asm' on 'travelskydba-rac' failed  ---卡在这里很长时间,大概有8分钟左右
CRS-2679: Attempting to clean 'ora.asm' on 'travelskydba-rac'
CRS-2681: Clean of 'ora.asm' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.cssd' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.gipcd' on 'travelskydba-rac' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'travelskydba-rac'
CRS-2677: Stop of 'ora.drivers.acfs' on 'travelskydba-rac' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'travelskydba-rac' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'travelskydba-rac' has completed
CRS-4133: Oracle High Availability Services has been stopped.
2020-05-01 22:37:28.633:
[ctssd(2029)]CRS-2405:The Cluster Time Synchronization Service on host travelskydba-rac is shutdown by user
[client(89699)]CRS-10001:01-May-20 22:37 ACFS-9290: Waiting for ASM to shutdown
.... --省略重复输出
[client(124967)]CRS-10001:01-May-20 22:46 ACFS-9290: Waiting for ASM to shutdown.
[client(125047)]CRS-10001:01-May-20 22:46 ACFS-9290: Waiting for ASM to shutdown.
[client(125174)]CRS-10001:01-May-20 22:46 ACFS-9290: Waiting for ASM to shutdown.
[client(125491)]CRS-10001:01-May-20 22:47 ACFS-9290: Waiting for ASM to shutdown.
[client(125548)]CRS-10001:01-May-20 22:47 ACFS-9290: Waiting for ASM to shutdown.
[client(125614)]CRS-10001:01-May-20 22:47 ACFS-9290: Waiting for ASM to shutdown.
2020-05-01 22:47:28.727:
[/opt/app/11.2.0/grid/bin/oraagent.bin(759)]CRS-5818:Aborted command 'stop' for resource 'ora.asm'. Details at (:CRSAGF00113:) {0:0:52684} in /opt/app/11.2.0/grid/log/travelskydba-rac/agent/ohasd/oraagent_grid//oraagent_grid.log.
2020-05-01 22:47:30.730:
[ohasd(196541)]CRS-2757:Command 'Stop' timed out waiting for response from the resource 'ora.asm'. Details at (:CRSPE00111:) {0:0:52684} in /opt/app/11.2.0/grid/log/travelskydba-rac/ohasd/ohasd.log.
2020-05-01 22:47:33.404:
[cssd(872)]CRS-1603:CSSD on node travelskydba-rac shutdown by user.
2020-05-01 22:47:33.513:
[ohasd(196541)]CRS-2767:Resource state recovery not attempted for 'ora.cssdmonitor' as its target state is OFFLINE
2020-05-01 22:47:33.609:
[cssd(872)]CRS-1660:The CSS daemon shutdown has completed
2020-05-01 22:47:35.756:
[gpnpd(790)]CRS-2329:GPNPD on node travelskydba-rac shutdown.

DB alert日志:
NOTE: ASMB terminating
Errors in file /opt/app/ora11g/diag/rdbms/Albert/Albert1/trace/Albert1_asmb_188518.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1828 Serial number: 22601
Errors in file /opt/app/ora11g/diag/rdbms/Albert/Albert1/trace/Albert1_asmb_188518.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1828 Serial number: 22601
ASMB (ospid: 188518): terminating the instance due to error 15064
Fri May 01 22:47:30 2020
System state dump requested by (instance=1, osid=188518 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /opt/app/ora11g/diag/rdbms/Albert/Albert1/trace/Albert1_diag_188401_20200501224730.trc
Dumping diagnostic data in directory=[cdmp_20200501224730], requested by (instance=1, osid=188518 (ASMB)), summary=[abnormal instance termination].
Instance terminated by ASMB, pid = 188518

节点2:集群日志:
2020-05-01 22:47:33.505:
[cssd(133163)]CRS-1625:Node travelskydba-rac, number 1, was shut down
2020-05-01 22:47:33.517:
[cssd(133163)]CRS-1601:CSSD Reconfiguration complete. Active nodes are travelskydba2-rac travelskydba3-rac .


节点3集群日志:
2020-05-01 22:47:33.505:
[cssd(141351)]CRS-1625:Node travelskydba-rac, number 1, was shut down
2020-05-01 22:47:33.517:
[cssd(141351)]CRS-1601:CSSD Reconfiguration complete. Active nodes are tr730e67-rac travelskydba3-rac .

也就是说节点1集群完全停下来,耗时22:37-22:47 10分钟,hang期间,让值班员联系硬件重启服务器,但是在硬件重启操作前,CRS却停下来了。

保险起见,还是让硬件重启一下服务器(因21点左右,硬件团队同事刚刚对此节点集群的私网心跳线进行更换)。

重启服务器后,节点1集群自动带起,且状态正常,并且之后手工再次执行CRS大重启操作再次验证,集群恢复正常。

节点1集群日志:
[ohasd(8862)]CRS-2112:The OLR service started on node travelskydba-rac.
2020-05-01 23:01:20.135:
[ohasd(8862)]CRS-1301:Oracle High Availability Service started on node travelskydba-rac.
.....
2020-05-01 23:02:12.366:
[cssd(9466)]CRS-1601:CSSD Reconfiguration complete. Active nodes are travelskydba-rac tr730e67-rac travelskydba3-rac 
[crsd(17117)]CRS-1012:The OCR service started on node travelskydba-rac.
2020-05-01 23:02:43.714:
[evmd(16409)]CRS-1401:EVMD started on node travelskydba-rac.
2020-05-01 23:02:45.990:
[crsd(17117)]CRS-1201:CRSD started on node travelskydba-rac.
2020-05-01 23:02:47.755:

节点2集群日志:
2020-05-01 23:02:12.505:
[cssd(133163)]CRS-1601:CSSD Reconfiguration complete. Active nodes are travelskydba-rac tr730e67-rac travelskydba3-rac .
2020-05-01 23:02:49.882:
[crsd(133970)]CRS-2772:Server 'travelskydba-rac' has been assigned to pool 'Generic'.
2020-05-01 23:02:49.883:
[crsd(133970)]CRS-2772:Server 'travelskydba-rac' has been assigned to pool 'ora.Albert'.

节点3集群日志:
2020-05-01 23:02:04.654:
[cssd(141351)]CRS-1601:CSSD Reconfiguration complete. Active nodes are travelskydba-rac tr730e67-rac travelskydba3-rac .

此时同事反馈的故障现象已经消失,集群的状态已恢复正常。但是跟因还没有找到,什么会出现集群节点1,crsctl stop crs hang了如此长时间的时间,并且集群状态会出现异常,抛出 CRS-4535: Cannot communicate with Cluster Ready Services与CRS-4000: Command Status failed, or completed with errors??

早在4月29日,集群节点1就已经不正常,集群重新配置过一次,没触发重启CRS(猜测是因为进程间通讯有问题,CRS没有完成大重启,所以监控也没有触发,结果就是集群是错乱的)
节点1:
2020-04-29 15:12:58.939:
[ohasd(196541)]CRS-2765:Resource 'ora.crsd' has failed on server 'travelskydba-rac'.
2020-04-29 15:13:06.598:
[crsd(128096)]CRS-1012:The OCR service started on node travelskydba-rac.
2020-04-29 15:18:07.772:
[crsd(128096)]CRS-1201:CRSD started on node travelskydba-rac.
2020-04-29 18:02:44.324:

4月29日就开始出现CRS异常报错,同时节点2开始驱逐节点1。
节点2:
2020-04-29 15:12:58.725:
[cssd(133163)]CRS-1663:Member kill issued by PID 133970 for 1 members, group ocr_Albert. Details at (:CSSGM00044:) in /opt/app/11.2.0/grid/log/tr730e67-rac/cssd/ocssd.log.
2020-04-29 15:12:58.726:
[crsd(133970)]CRS-1015:A request to terminate the Cluster Ready Service on node travelskydba-rac completed successfully. Details at (:PROCR00001:) in /opt/app/11.2.0/grid/log/tr730e67-rac/crsd/crsd.log.


节点3:无异常输出

为什么会驱逐节点1呢?我节点2的CRSD日志中找到线索,crsd进程通信出现问题s_update_remote_cache_int: FAILED TO RCV ACK FROM node 1 retcode 203

2020-04-29 15:12:58.892: [  OCRMAS][1559652096]rcfg_con:2: Member [1] left. Inc [38].
2020-04-29 15:12:58.892: [  OCRMSG][1310127872]prom_recv: Failed to receieve [3]
2020-04-29 15:12:58.892: [  OCRMSG][1310127872]GIPC error [3] msg [gipcretInvalidObject]
2020-04-29 15:12:58.892: [  OCRSRV][1310127872]s_update_remote_cache_int: FAILED TO RCV ACK FROM node 1 retcode 203
2020-04-29 15:12:58.893: [  OCRMAS][1559652096]th_master: Received event that a member was successfully killed. Should receive a recconfig event

gipc抛出error信息,进程间通讯出了问题,检查了一下网络状态,发现在抛出错误期间,操作系统出现大量包重组失败的现象packet reassembles failed

节点1 网络丢包现象,节点才会被驱逐:
    4134046 packet reassembles failed
    4134378 packet reassembles failed
    4134591 packet reassembles failed
    4134910 packet reassembles failed
    4135231 packet reassembles failed
    4135593 packet reassembles failed
    4136036 packet reassembles failed
    4136453 packet reassembles failed
    4136871 packet reassembles failed
    4137192 packet reassembles failed
    4137430 packet reassembles failed
    4137737 packet reassembles failed
    4138013 packet reassembles failed
    4138400 packet reassembles failed
    4138792 packet reassembles failed
    4139331 packet reassembles failed
    4139789 packet reassembles failed
    4140265 packet reassembles failed
    4140678 packet reassembles failed
    4140965 packet reassembles failed
    4141273 packet reassembles failed
    4141643 packet reassembles failed
    4141933 packet reassembles failed
    4142315 packet reassembles failed
    4142716 packet reassembles failed
    4143265 packet reassembles failed
    4143571 packet reassembles failed
    4143840 packet reassembles failed
    4144207 packet reassembles failed
    4144502 packet reassembles failed
    4144746 packet reassembles failed
    4145072 packet reassembles failed
    4145519 packet reassembles failed
    4145878 packet reassembles failed
    4146279 packet reassembles failed
    4146795 packet reassembles failed
    4147020 packet reassembles failed
    4147308 packet reassembles failed
    4147632 packet reassembles failed
    4147968 packet reassembles failed
    4148363 packet reassembles failed
    4148660 packet reassembles failed
    4149031 packet reassembles failed
    4149364 packet reassembles failed
    4149742 packet reassembles failed
    4149993 packet reassembles failed
    4150255 packet reassembles failed
    4150546 packet reassembles failed
    4150982 packet reassembles failed
    4151384 packet reassembles failed
    4151763 packet reassembles failed
    4152094 packet reassembles failed
    4152382 packet reassembles failed
    4152664 packet reassembles failed
    4152870 packet reassembles failed
    4153173 packet reassembles failed
    4153330 packet reassembles failed
    4153505 packet reassembles failed
    4153647 packet reassembles failed
    4153844 packet reassembles failed
    4153958 packet reassembles failed
    4154070 packet reassembles failed
    4154213 packet reassembles failed
    4154390 packet reassembles failed
    4154492 packet reassembles failed
    4154696 packet reassembles failed
    4154918 packet reassembles failed
    4155126 packet reassembles failed
    4155311 packet reassembles failed
    4155562 packet reassembles failed
    4155666 packet reassembles failed
    4155813 packet reassembles failed
    4155967 packet reassembles failed
    4156006 packet reassembles failed
    4156177 packet reassembles failed
    4156286 packet reassembles failed
    4156424 packet reassembles failed
    4156498 packet reassembles failed
    4156620 packet reassembles failed
    4156816 packet reassembles failed
    4156894 packet reassembles failed
    4157034 packet reassembles failed
    4157174 packet reassembles failed
    4157309 packet reassembles failed
    4157480 packet reassembles failed
    4157652 packet reassembles failed
    4157888 packet reassembles failed
    4158048 packet reassembles failed
    4158246 packet reassembles failed
    4158465 packet reassembles failed
    4158668 packet reassembles failed
    4158864 packet reassembles failed
    4159088 packet reassembles failed
    4159293 packet reassembles failed
    4159473 packet reassembles failed
    4159556 packet reassembles failed
    4159728 packet reassembles failed
    4159905 packet reassembles failed
    4160071 packet reassembles failed
    4160288 packet reassembles failed
    4160446 packet reassembles failed
    4160600 packet reassembles failed
4月29日,15:13时,集群重新配置将节点1加入,但是丢包现象仍未解决,节点1集群状态此时为半死不活的状态,很有可能集群状态是错乱的
从4月8日起,gipc进程就开始出现error状态:所以节点1集群异常资源切换,停不下来,太正常了。(gipc进程就是负责私网的建立,监控私网的ohasd代理进程,同样需要通讯)

2020-04-08 20:20:18.324: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-10 18:50:43.545: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-11 00:05:48.471: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-13 18:21:23.809: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-14 12:51:36.146: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-14 17:21:44.814: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-14 19:21:43.746: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-14 22:36:41.579: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-15 00:36:43.714: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-15 01:06:44.995: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-15 02:36:44.841: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-15 02:51:41.970: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-15 04:51:44.411: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-15 05:06:46.199: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-15 10:06:54.039: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-15 12:21:59.293: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-15 14:51:48.673: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-15 16:21:50.515: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-15 19:22:00.221: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-15 19:51:59.519: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-15 20:06:52.651: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-15 23:51:53.748: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 02:21:57.161: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 02:36:56.302: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 06:22:05.408: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 08:06:59.361: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 09:07:04.920: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 10:07:02.465: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 10:22:00.606: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 11:37:01.295: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 12:07:04.573: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 12:22:01.714: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 13:07:06.120: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 13:22:14.278: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 14:52:03.119: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 15:22:05.402: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 15:52:11.703: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 16:37:04.112: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 18:52:16.387: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-16 23:37:09.056: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-17 18:07:21.441: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-19 13:07:41.700: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-19 13:22:43.852: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-19 14:22:43.407: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-19 16:07:51.402: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-20 19:53:08.961: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-21 08:08:09.768: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-21 08:38:06.336: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-21 09:53:07.728: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-21 10:08:06.874: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-21 11:08:10.430: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 01:38:20.626: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 02:23:26.055: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 04:23:18.220: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 07:38:27.022: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 08:08:28.314: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 08:23:23.455: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 08:53:28.733: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 10:08:23.424: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 10:53:27.868: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 11:53:22.415: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 12:08:21.552: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 14:38:24.940: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 15:53:30.645: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 16:08:25.776: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 17:53:30.752: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 19:08:27.468: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-22 20:23:29.520: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 00:53:28.692: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 01:23:31.973: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 01:38:35.121: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 01:53:32.253: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 03:53:35.372: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 04:08:32.511: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 05:08:46.394: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 05:38:47.676: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 06:38:36.960: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 07:08:39.214: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 07:23:35.344: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 08:23:36.902: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 08:38:42.130: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 08:53:34.405: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 09:53:34.742: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 11:08:35.442: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-23 20:23:43.597: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 02:38:43.132: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 07:08:48.692: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 07:23:45.821: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 10:08:50.367: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 11:38:57.231: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 11:54:06.788: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 12:08:50.423: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 13:09:19.498: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 13:24:02.219: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 13:54:03.883: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 14:09:04.020: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 14:23:49.770: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 14:38:52.906: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 15:23:51.335: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 15:38:52.479: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 15:53:50.608: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 16:08:53.149: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 16:54:02.198: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 17:38:59.301: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 18:08:53.879: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 18:53:55.303: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 19:08:56.445: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 19:23:56.580: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 19:53:58.872: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-24 20:08:55.426: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-25 03:38:58.284: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-25 20:09:17.522: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 05:24:15.302: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 05:39:16.927: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 07:39:22.064: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 07:54:13.197: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 08:09:34.855: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 08:24:13.470: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 08:54:14.744: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 11:39:28.846: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 11:54:21.463: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 16:39:21.140: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 21:09:46.186: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 21:24:24.823: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 21:54:28.110: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 22:54:25.673: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 23:39:28.020: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-26 23:54:25.260: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 00:54:22.820: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 01:09:24.968: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 01:24:27.179: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 03:09:33.102: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 04:24:26.393: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 04:39:35.934: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 04:54:30.074: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 05:39:33.034: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 06:09:51.311: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 07:24:30.514: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 07:54:35.793: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 08:54:35.349: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 09:09:27.982: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 09:39:27.747: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 09:54:42.386: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 10:09:39.049: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 10:24:30.170: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 11:09:30.595: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 11:39:28.882: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-27 20:54:41.082: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 12:24:43.726: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 14:39:50.172: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 15:55:02.316: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 16:39:45.095: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 16:54:50.298: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 17:24:45.525: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 17:39:54.335: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 17:55:03.425: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 18:55:16.971: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 20:39:49.367: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 21:39:49.923: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 22:24:56.377: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 22:40:03.123: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 23:09:54.766: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 23:24:57.927: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-28 23:54:49.192: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 00:09:57.737: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 00:24:56.049: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 00:54:55.739: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 01:09:55.887: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 01:39:58.735: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 01:54:51.287: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 02:09:51.363: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 02:24:55.687: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 03:54:53.413: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 04:09:53.549: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 07:39:55.511: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 10:40:05.897: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 11:24:56.583: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 11:54:56.877: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 12:10:14.708: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 12:25:19.848: [  OCRMSG][2940180224]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 13:10:16.283: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 13:40:12.559: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 13:55:00.008: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 14:10:24.315: [  OCRMSG][2944382720]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 14:25:01.326: [  OCRMSG][2948585216]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 14:55:23.253: [  OCRMSG][3158234880]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 15:42:29.731: [  OCRMSG][12756736]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 15:57:11.832: [  OCRMSG][16959232]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 16:12:08.539: [  OCRMSG][4351744]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 16:42:06.799: [  OCRMSG][4351744]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 17:27:25.701: [  OCRMSG][16959232]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 17:42:16.828: [  OCRMSG][4351744]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 18:27:14.247: [  OCRMSG][16959232]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 18:42:07.930: [  OCRMSG][4351744]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 19:12:15.676: [  OCRMSG][4351744]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 20:12:20.805: [  OCRMSG][12756736]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 22:12:19.379: [  OCRMSG][4351744]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 22:42:19.646: [  OCRMSG][4351744]GIPC error [12] msg [gipcretConnectionLost]
2020-04-29 22:57:11.319: [  OCRMSG][8554240]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 01:12:19.059: [  OCRMSG][12756736]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 01:57:11.997: [  OCRMSG][8554240]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 03:42:26.492: [  OCRMSG][4351744]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 04:42:16.550: [  OCRMSG][12756736]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 05:42:16.103: [  OCRMSG][4351744]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 07:27:16.085: [  OCRMSG][16959232]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 07:57:16.375: [  OCRMSG][16959232]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 08:42:15.795: [  OCRMSG][12756736]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 08:57:22.359: [  OCRMSG][16959232]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 09:27:21.213: [  OCRMSG][16959232]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 09:57:21.478: [  OCRMSG][16959232]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 10:42:21.905: [  OCRMSG][12756736]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 11:12:25.668: [  OCRMSG][12756736]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 12:12:17.720: [  OCRMSG][4351744]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 17:27:27.171: [  OCRMSG][8554240]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 17:42:42.323: [  OCRMSG][12756736]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 17:57:33.453: [  OCRMSG][16959232]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 18:12:24.096: [  OCRMSG][4351744]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 18:28:26.762: [  OCRMSG][8554240]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 18:42:54.905: [  OCRMSG][12756736]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 19:12:21.650: [  OCRMSG][12756736]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 19:57:24.061: [  OCRMSG][8554240]GIPC error [12] msg [gipcretConnectionLost]
2020-04-30 20:42:23.478: [  OCRMSG][4351744]GIPC error [12] msg [gipcretConnectionLost]
2020-05-01 21:12:41.178: [  OCRMSG][4351744]GIPC error [12] msg [gipcretConnectionLost]  --硬件换完线后,就正常了,不在出现gipc error

总结一下,

1.4月8日起,数据库节点1出现了大量的gipc connect lost,说明此时私网通信已经出现了问题,存在大量包重组失败。

2.4月30日节点2因CRSD进程间通讯问题开始驱逐节点1,节点1 CRSD进程重启,没有进行集群重启。

3.5月1日晚, 变更操作人员在停止DB实例前,没有进行CRS确认,如果当时确认,就可以事先发现CRS集群状态不正常,并且硬件更换集群心跳线,应该整个集群而不是单单停止DB实例。

4.所有故障现象在硬件更换完心跳线、重启集群软件后恢复。