注册
主备集群的切换流程探究
专栏/技术分享/ 文章详情 /

主备集群的切换流程探究

BruceCD 2025/10/31 290 0 0
摘要

简述

DM的主备集群部署模式下,备集群可以作为热备提供只读服务,在主集群发生故障的情况下可以切换为新主,降低服务不可用时间,以下内容为主备集群的主动切换和故障情况下的被动切换过程分析。

集群版本信息

SQL> select * from v$instance;

行号     NAME       INSTANCE_NAME INSTANCE_NUMBER HOST_NAME      SVR_VERSION                DB_VERSION          START_TIME          STATUS$ MODE$   OGUID       DSC_SEQNO   DSC_ROLE
---------- ---------- ------------- --------------- -------------- -------------------------- ------------------- ------------------- ------- ------- ----------- ----------- --------
           BUILD_VERSION                      BUILD_TIME
           ---------------------------------- --------------------
1          GRP1_RT_01 GRP1_RT_01    1               kylin10-57-102 DM Database Server x64 V8  DB Version: 0x7000c 2025-10-11 12:06:10 OPEN    PRIMARY 453331      0           NULL
           1-4-6-2024.12.25-255012-20119-ENT  Dec 27 2024 06:02:42

主动切换

主动切换一般是为了对主库做一些维护操作,为了保证服务可用,就需要把备库切换为新主库保证服务不会中断。

主动切换可以使用普通监视器在前台运行时执行命令行指令,主要有两个步骤:

  1. 登录监视器,这一步监视器会验证用户权限,只有鉴权通过才能执行后续的switchover指令
  2. 执行监视器的switchover命令,需要指定一个备作为新主进行主备切换

监视器日志

以下使用普通监视器对主备集群进行手工切换的输出日志:

login
用户名:SYSDBA
密码:
[monitor]         2025-10-11 16:01:59: 登录监视器成功!
switchover GRP1.GRP1_RT_02
此操作需谨慎, 将会导致主库发生切换, 是否继续使用GRP1.GRP1_RT_02执行SWITCHOVER操作(YES/NO/Y/N)?
y
[monitor]         2025-10-11 16:02:32: 开始切换实例GRP1_RT_02
[monitor]         2025-10-11 16:02:32: 通知守护进程GRP1_RT_01切换SWITCHOVER状态
[monitor]         2025-10-11 16:02:32: 守护进程(GRP1_RT_01)状态切换 [OPEN-->SWITCHOVER]
[monitor]         2025-10-11 16:02:32: 切换守护进程GRP1_RT_01为SWITCHOVER状态成功
[monitor]         2025-10-11 16:02:32: 通知守护进程GRP1_RT_02切换SWITCHOVER状态
[monitor]         2025-10-11 16:02:32: 守护进程(GRP1_RT_02)状态切换 [OPEN-->SWITCHOVER]
[monitor]         2025-10-11 16:02:32: 切换守护进程GRP1_RT_02为SWITCHOVER状态成功
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_01开始执行SP_SET_GLOBAL_DW_STATUS(0, 6)语句
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_01执行SP_SET_GLOBAL_DW_STATUS(0, 6)语句成功
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_02开始执行SP_SET_GLOBAL_DW_STATUS(0, 6)语句
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_02执行SP_SET_GLOBAL_DW_STATUS(0, 6)语句成功
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_01开始执行ALTER DATABASE MOUNT语句
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_01执行ALTER DATABASE MOUNT语句成功
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_02开始执行SP_APPLY_KEEP_PKG()语句
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_02执行SP_APPLY_KEEP_PKG()语句成功
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_02开始执行ALTER DATABASE MOUNT语句
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_02执行ALTER DATABASE MOUNT语句成功
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_01开始执行ALTER DATABASE STANDBY语句
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_01执行ALTER DATABASE STANDBY语句成功
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_02开始执行ALTER DATABASE PRIMARY语句
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_02执行ALTER DATABASE PRIMARY语句成功
[monitor]         2025-10-11 16:02:32: 通知实例GRP1_RT_02修改所有归档状态无效
[monitor]         2025-10-11 16:02:32: 修改所有实例归档为无效状态成功
[monitor]         2025-10-11 16:02:32: 实例GRP1_RT_01开始执行ALTER DATABASE OPEN FORCE语句
[monitor]         2025-10-11 16:02:33: 实例GRP1_RT_01执行ALTER DATABASE OPEN FORCE语句成功
[monitor]         2025-10-11 16:02:33: 实例GRP1_RT_02开始执行ALTER DATABASE OPEN FORCE语句
[monitor]         2025-10-11 16:02:33: 实例GRP1_RT_02执行ALTER DATABASE OPEN FORCE语句成功
[monitor]         2025-10-11 16:02:33: 实例GRP1_RT_01开始执行SP_SET_GLOBAL_DW_STATUS(6, 0)语句
[monitor]         2025-10-11 16:02:33: 实例GRP1_RT_01执行SP_SET_GLOBAL_DW_STATUS(6, 0)语句成功
[monitor]         2025-10-11 16:02:33: 实例GRP1_RT_02开始执行SP_SET_GLOBAL_DW_STATUS(6, 0)语句
[monitor]         2025-10-11 16:02:33: 实例GRP1_RT_02执行SP_SET_GLOBAL_DW_STATUS(6, 0)语句成功
[monitor]         2025-10-11 16:02:33: 通知守护进程GRP1_RT_01切换OPEN状态
[monitor]         2025-10-11 16:02:33: 守护进程(GRP1_RT_01)状态切换 [SWITCHOVER-->OPEN]
[monitor]         2025-10-11 16:02:33: 切换守护进程GRP1_RT_01为OPEN状态成功
[monitor]         2025-10-11 16:02:33: 通知守护进程GRP1_RT_02切换OPEN状态
[monitor]         2025-10-11 16:02:33: 守护进程(GRP1_RT_02)状态切换 [SWITCHOVER-->OPEN]
[monitor]         2025-10-11 16:02:34: 切换守护进程GRP1_RT_02为OPEN状态成功
[monitor]         2025-10-11 16:02:34: 通知组(GRP1)的守护进程执行清理操作
[monitor]         2025-10-11 16:02:34: 清理守护进程(GRP1_RT_01)请求成功

2025-10-11 16:02:34
#================================================================================#
GROUP            OGUID       MON_CONFIRM     MODE            MPP_FLAG
GRP1             453331      FALSE           AUTO            FALSE


<<DATABASE GLOBAL INFO:>>
DW_IP           MAL_DW_PORT  WTIME                WTYPE     WCTLSTAT  WSTATUS        INAME            INST_OK   N_EP  N_OK  ISTATUS     IMODE     DSC_STATUS     RTYPE     RSTAT    DETACHED
192.168.57.103  52142        2025-10-11 16:02:34  GLOBAL    VALID     OPEN           GRP1_RT_02       OK        1     1     OPEN        PRIMARY   DSC_OPEN       REALTIME  VALID    FALSE

EP INFO:
INST_IP         INST_PORT  INST_OK   INAME            ISTATUS     IMODE     DSC_SEQNO  DSC_CTL_NODE RTYPE     RSTAT       FSEQ            FLSN            CSEQ            CLSN            DW_STAT_FLAG
192.168.57.103  5236       OK        GRP1_RT_02       OPEN        PRIMARY   0          0            REALTIME  VALID       4988            68955           4988            68955           NONE

<<DATABASE GLOBAL INFO:>>
DW_IP           MAL_DW_PORT  WTIME                WTYPE     WCTLSTAT  WSTATUS        INAME            INST_OK   N_EP  N_OK  ISTATUS     IMODE     DSC_STATUS     RTYPE     RSTAT    DETACHED
192.168.57.102  52141        2025-10-11 16:02:34  GLOBAL    VALID     OPEN           GRP1_RT_01       OK        1     1     OPEN        STANDBY   DSC_OPEN       REALTIME  INVALID  FALSE

EP INFO:
INST_IP         INST_PORT  INST_OK   INAME            ISTATUS     IMODE     DSC_SEQNO  DSC_CTL_NODE RTYPE     RSTAT       FSEQ            FLSN            CSEQ            CLSN            DW_STAT_FLAG
192.168.57.102  5236       OK        GRP1_RT_01       OPEN        STANDBY   0          0            REALTIME  INVALID     4986            68864           4986            68864           NONE

DATABASE(GRP1_RT_01) APPLY INFO FROM (GRP1_RT_02), REDOS_PARALLEL_NUM (1), WAIT_APPLY[FALSE]:
DSC_SEQNO[0], (RSEQ, SSEQ, KSEQ)[4986, 4986, 4986], (RLSN, SLSN, KLSN)[68864, 68864, 68864], N_TSK[0], TSK_MEM_USE[0]
REDO_LSN_ARR: (68864)


#================================================================================#

[monitor]         2025-10-11 16:02:34: 清理守护进程(GRP1_RT_02)请求成功
[monitor]         2025-10-11 16:02:34: 实例GRP1_RT_02切换成功

[monitor]         2025-10-11 16:02:36: 守护进程(GRP1_RT_02)状态切换 [OPEN-->RECOVERY]
                  WTIME                WSTATUS        INST_OK   INAME            ISTATUS     IMODE     RSTAT    N_OPEN   FLSN            CLSN
                  2025-10-11 16:02:36  RECOVERY       OK        GRP1_RT_02       OPEN        PRIMARY   VALID    19       68955           68955

[monitor]         2025-10-11 16:02:41: 守护进程(GRP1_RT_02)状态切换 [RECOVERY-->OPEN]
                  WTIME                WSTATUS        INST_OK   INAME            ISTATUS     IMODE     RSTAT    N_OPEN   FLSN            CLSN
                  2025-10-11 16:02:41  OPEN           OK        GRP1_RT_02       OPEN        PRIMARY   VALID    19       68955           68955

从以上日志中可以清晰地看到,将GRP1_RT_02切换为新主时,监视器主要做了这些操作:

  1. 将GRP1_RT_01和GRP1_RT_02的守护进程状态从OPEN切换为SWITCHOVER状态
  2. 主库GRP1_RT_01执行ALTER DATABASE MOUNT语句将实例转换为配置状态
  3. 备库GRP1_RT_02执行SP_APPLY_KEEP_PKG()语句,应用KEEP_PKG数据
  4. 备库GRP1_RT_02执行ALTER DATABASE MOUNT语句将实例转换为配置状态
  5. 主库GRP1_RT_01执行ALTER DATABASE STANDBY变成新的备库
  6. 备库GRP1_RT_02执行ALTER DATABASE PRIMARY变成新的主库
  7. 新主库GRP1_RT_02修改归档状态为INVALID状态
  8. 新备库GRP1_RT_01和新主库GRP1_RT_02分别执行ALTER DATABASE OPEN FORCE语句,将状态从MOUNT转为OPEN状态
  9. 新备库GRP1_RT_01和新主库GRP1_RT_02分别执行SP_SET_GLOBAL_DW_STATUS(6, 0)
  10. 守护进程从SWITCHOVER切换为OPEN状态
    1. 新备库GRP1_RT_01的守护进程从SWITCHOVER切换为OPEN状态
    2. 新主库GRP1_RT_02的守护进程从SWITCHOVER切换为OPEN状态
  11. 请求清理守护进程
    1. 新备库GRP1_RT_01守护进程清理
    2. 新主库GRP1_RT_02守护进程清理
  12. 新主库GRP1_RT_02的守护进程状态切换到RECOVERY再切换到OPEN
  13. 主备切换完成,共耗时9秒

从以上过程可以看到,主动切换主备的过程中,监视器会主动向主备库和主备守护进程发送一系列命令,先将所有实例状态转换为配置状态,再做归档配置,并且会对守护进程进行切换为切换状态,经过后续的一系列操作后最后会清理主备库守护进程的清理,主备才算切换完成。

原主库守护进程切换日志

切换过程中,原主库GRP1_RT_01的守护进程切换日志如下:

2025-10-11 16:02:31.013 [INFO] dmwatcher P0000025576 T0000000000000025615  dw2_group_set_mon_info org info, mid:210076262, cmd:0, dseq:0, code:100, status:0, mon_ip:, rarch_status:0, processed:0, sql:
2025-10-11 16:02:31.013 [INFO] dmwatcher P0000025576 T0000000000000025615  dw2_group_set_mon_info new info, mid:210076262, cmd:0, dseq:2357559919, code:0, status:1, mon_ip:::ffff:192.168.57.103, rarch_status:0, processed:0, sql:
2025-10-11 16:02:31.013 [INFO] dmwatcher P0000025576 T0000000000000025615  dw2_group_process_set_mid, old mid:210076262, new mid:210076262, dseq:2357559919
2025-10-11 16:02:31.127 [INFO] dmwatcher P0000025576 T0000000000000025615  dw2_group_set_mon_info org info, mid:210076262, cmd:0, dseq:2357559919, code:0, status:1, mon_ip:::ffff:192.168.57.103, rarch_status:0, processed:0, sql:
......(省略n行)
2025-10-11 16:02:36.769 [INFO] dmwatcher P0000025576 T0000000000000025580  Local(GRP1_RT_01) ohis is included in remote(GRP1_RT_02) ohis, and apply info matched, l_m_s_i(6), return act(ACT_KEEP)!
2025-10-11 16:02:37.266 [INFO] dmwatcher P0000025576 T0000000000000025580  Local apply info, p_db_magic:431410808, n_apply_ep:1, pkg_seq_arr:[4986, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], apply_lsn_arr:[68864, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]!
2025-10-11 16:02:37.266 [INFO] dmwatcher P0000025576 T0000000000000025580  Remote apply info in ctl, ctl_guid:GRP1_RT_02_19, p_db_magic:431410808, n_apply_ep:1, pkg_seq_arr:[4986, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], apply_lsn_arr:[68864, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]!
2025-10-11 16:02:37.266 [INFO] dmwatcher P0000025576 T0000000000000025580  Local(GRP1_RT_01) ohis is included in remote(GRP1_RT_02) ohis, and apply info matched, l_m_s_i(6), return act(ACT_KEEP)!
2025-10-11 16:02:39.775 [INFO] dmwatcher P0000025576 T0000000000000025591  Local db apply info changed, old info[p_db_magic:431410808, n_apply_ep:1], new info to set[p_db_magic:1810686012, n_apply_ep:1]!
2025-10-11 16:02:39.775 [INFO] dmwatcher P0000025576 T0000000000000025591  dw2_process_ohis_info, n_ohis changed from 18 to 19, from_ep:GRP1_RT_01, ep_seqno:0, hpc_master:0
2025-10-11 16:02:39.775 [INFO] dmwatcher P0000025576 T0000000000000025591  dw2_process_ohis_info local apply info, p_db_magic:1810686012, n_apply_ep:1, n_fill_ep:1 pkg_seq_arr:[4988, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], apply_lsn_arr:[68955, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]!
2025-10-11 16:02:41.095 [INFO] dmwatcher P0000025576 T0000000000000025747  远程实例的模式、状态或者归档状态发生变化,原状态是:
2025-10-11 16:02:41.095 [INFO] dmwatcher P0000025576 T0000000000000025747  Instance: 守护进程状态(RECOVERY) 实例状态(OK) 实例名(GRP1_RT_02) 模式(PRIMARY) 实例状态(OPEN) 归档状态(INVALID) POCNT(19) FLSN(68955) CLSN(68955) SLSN(68955) SSLSN(68955)
2025-10-11 16:02:41.152 [INFO] dmwatcher P0000025576 T0000000000000025747  远程实例的模式、状态或者归档状态发生变化,新状态是:
2025-10-11 16:02:41.215 [INFO] dmwatcher P0000025576 T0000000000000025747  Instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(GRP1_RT_02) 模式(PRIMARY) 实例状态(OPEN) 归档状态(INVALID) POCNT(19) FLSN(68955) CLSN(68955) SLSN(68955) SSLSN(68955)

主要步骤如下:

  1. 接收监视器命令,将自己的状态转换位SWITCHOVER状态
  2. 发送命令给GRP1_RT_01执行ALTER DATABASE MOUNT,再执行ALTER DATABASE STANDBY变成新的备库
  3. GRP1_RT_01执行ALTER DATABASE OPEN FORCE
  4. 将自己的状态从SWITCHOVER切换为OPEN状态

新主库守护进程切换日志

新主库GRP1_RT_02的守护进程切换日志如下:

2025-10-11 16:02:31.008 [INFO] dmwatcher P0000010386 T0000000000000010471  dw2_group_set_mon_info org info, mid:210076262, cmd:0, dseq:0, code:100, status:0, mon_ip:, rarch_status:0, processed:0, sql:
2025-10-11 16:02:31.008 [INFO] dmwatcher P0000010386 T0000000000000010471  dw2_group_set_mon_info new info, mid:210076262, cmd:0, dseq:2357559919, code:0, status:1, mon_ip:::ffff:192.168.57.103, rarch_status:0, processed:0, sql:
2025-10-11 16:02:31.008 [INFO] dmwatcher P0000010386 T0000000000000010471  dw2_group_process_set_mid, old mid:210076262, new mid:210076262, dseq:2357559919
2025-10-11 16:02:31.123 [INFO] dmwatcher P0000010386 T0000000000000010556  远程实例的模式、状态或者归档状态发生变化,原状态是:
2025-10-11 16:02:31.123 [INFO] dmwatcher P0000010386 T0000000000000010556  Instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(GRP1_RT_01) 模式(PRIMARY) 实例状态(OPEN) 归档状态(INVALID) POCNT(18) FLSN(68864) CLSN(68864) SLSN(68864) SSLSN(68864)
2025-10-11 16:02:31.184 [INFO] dmwatcher P0000010386 T0000000000000010556  远程实例的模式、状态或者归档状态发生变化,新状态是:
2025-10-11 16:02:31.246 [INFO] dmwatcher P0000010386 T0000000000000010556  Instance: 守护进程状态(LOGIN CHECK) 实例状态(OK) 实例名(GRP1_RT_01) 模式(PRIMARY) 实例状态(OPEN) 归档状态(INVALID) POCNT(18) FLSN(68864) CLSN(68864) SLSN(68864) SSLSN(68864)
......(省略n行)
2025-10-11 16:02:41.089 [INFO] dmwatcher P0000010386 T0000000000000010390  dw2_clear_ep_cmd_info_low, clear ep(GRP1_RT_02) cmd info, and reset curr_ep to NULL.
2025-10-11 16:02:41.089 [INFO] dmwatcher P0000010386 T0000000000000010390  notify ep(GRP1_RT_02) set dw_stat to NONE success!
2025-10-11 16:02:41.089 [INFO] dmwatcher P0000010386 T0000000000000010390  dw2_clear_ep_cmd_info_low, clear ep(GRP1_RT_02) cmd info.
2025-10-11 16:02:41.089 [INFO] dmwatcher P0000010386 T0000000000000010390  Clear all ep g_dw_status finished, Recovery finished!
2025-10-11 16:02:41.089 [INFO] dmwatcher P0000010386 T0000000000000010390  switch sub_state to sub_stat_start!
2025-10-11 16:02:41.089 [INFO] dmwatcher P0000010386 T0000000000000010390  设置GRP1守护进程为OPEN(SUB:STARTUP)状态

原主库切换日志

原主库GRP1_RT_01的切换日志如下:

2025-10-11 16:02:31.189 [INFO] database P0000025410 T0000000000000025590  utsk_cmd_add, cmd info: cmd=217, dseq=1760155593, name_in=, begin_lsn=-1!
2025-10-11 16:02:31.190 [INFO] database P0000025410 T0000000000000025590  utsk_set_global_dw_stat, begin, msg_dseq:1760155593
2025-10-11 16:02:31.190 [INFO] database P0000025410 T0000000000000025590  set g_dw_stat from NONE to MON_LOGIN_CHECK success, g_dw_recover_stop is 0
2025-10-11 16:02:31.190 [INFO] database P0000025410 T0000000000000025590  utsk_set_global_dw_stat, finished, msg_dseq:1760155593, set code:0
2025-10-11 16:02:31.316 [INFO] database P0000025410 T0000000000000025590  utsk_cmd_add, cmd info: cmd=13, dseq=1760155594, name_in=, begin_lsn=-1!
2025-10-11 16:02:31.316 [INFO] database P0000025410 T0000000000000025442  utsk_cmd_exec, cmd:13, sys_status:OPEN, dseq:1760155594
2025-10-11 16:02:32.136 [INFO] database P0000025410 T0000000000000025590  utsk_cmd_add, cmd info: cmd=217, dseq=1760155595, name_in=, begin_lsn=-1!
......(省略n行)
2025-10-11 16:02:40.056 [INFO] database P0000025410 T0000000000000025451  mtsk_process_get_lsn_arr_info, mal_post to site(1) (msg_len:326, msg_send code:0), apply lsn filled code:0
2025-10-11 16:02:40.071 [INFO] database P0000025410 T0000000000000025440  rapply_rlog_consecutive_check_for_async, EP[0]'s rpkg->max_lsn[68955] == apply_lsn[68955], ignore this pkg!
2025-10-11 16:02:40.075 [INFO] database P0000025410 T0000000000000025488  Receive mal cmd from ep(GRP1_RT_02) to set local dw_status flag
2025-10-11 16:02:40.075 [INFO] database P0000025410 T0000000000000025488  set g_dw_stat from DW_RECOVERY to NONE success, g_dw_recover_stop is 0
2025-10-11 16:02:41.093 [INFO] database P0000025410 T0000000000000025488  Receive mal cmd from ep(GRP1_RT_02) to set local dw_status flag
2025-10-11 16:02:41.093 [INFO] database P0000025410 T0000000000000025488  set g_dw_stat from NONE to NONE success, g_dw_recover_stop is 0

新主库切换日志

新主库GRP1_RT_02的切换日志如下:

2025-10-11 16:02:32.429 [INFO] database P0000010219 T0000000000000010401  utsk_cmd_add, received sql exec cmd:1, dseq:1760155593, sql:SP_SET_GLOBAL_DW_STATUS(0, 6)
2025-10-11 16:02:32.429 [INFO] database P0000010219 T0000000000000010401  utsk_cmd_add, cmd info: cmd=1, dseq=1760155593, name_in=, begin_lsn=-1!
2025-10-11 16:02:32.429 [INFO] database P0000010219 T0000000000000010401  set g_dw_stat from NONE to MON_SWITCHOVER success, g_dw_recover_stop is 0
2025-10-11 16:02:32.429 [INFO] database P0000010219 T0000000000000010401  utsk_dw_sql exec SP_SET_GLOBAL_DW_STATUS(0, 6), code=0
2025-10-11 16:02:32.549 [INFO] database P0000010219 T0000000000000010401  utsk_cmd_add, received sql exec cmd:1, dseq:1760155594, sql:SP_APPLY_KEEP_PKG()
2025-10-11 16:02:32.549 [INFO] database P0000010219 T0000000000000010401  utsk_cmd_add, cmd info: cmd=1, dseq=1760155594, name_in=, begin_lsn=-1!
2025-10-11 16:02:32.549 [INFO] database P0000010219 T0000000000000010401  utsk_dw_sql, before enqueue, len:0
2025-10-11 16:02:32.549 [INFO] database P0000010219 T0000000000000010401  utsk_dw_sql, after enqueue, len:1
......(省略n行)
2025-10-11 16:02:33.063 [INFO] database P0000010219 T0000000000000010259  hpc_dw_apply_info_check_make, p_db_magic(431410808), n_apply_ep(1), apply_pkg_seq([4986, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), apply_lsn_arr([68864, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), db_magic(1810686012), n_ep(1), apply_info_lsn(68864), pkg_seq_arr([4986, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), file_lsn([68864, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), startup_flag(1)
2025-10-11 16:02:33.066 [INFO] database P0000010219 T0000000000000010259  ohis_ctl_add success, ctl info: rguid:GRP1_RT_02_19, sys_mode:PRIMARY, p_iname:GRP1_RT_01, c_iname:GRP1_RT_02, p_db_magic:431410808, c_db_magic:1810686012, n_ep:1, pkg_seqno:[4986], lsn_arr:[68864]
......(省略n行)
2025-10-11 16:18:38.606 [INFO] database P0000010219 T0000000000000010253  mtsk_apply_lsn_info_fill, n_rapply=1, apply_lsn_arr[15]=0!
2025-10-11 16:18:38.606 [INFO] database P0000010219 T0000000000000010253  mtsk_process_get_lsn_arr_info, mal_post to site(0) (msg_len:326, msg_send code:0), apply lsn filled code:0
2025-10-11 16:18:38.619 [INFO] database P0000010219 T0000000000000010251  rapply_rlog_consecutive_check_for_async, EP[0]'s rpkg->max_lsn[69034] == apply_lsn[69034], ignore this pkg!
2025-10-11 16:18:38.622 [INFO] database P0000010219 T0000000000000010295  Receive mal cmd from ep(GRP1_RT_01) to set local dw_status flag
2025-10-11 16:18:38.622 [INFO] database P0000010219 T0000000000000010295  set g_dw_stat from DW_RECOVERY to NONE success, g_dw_recover_stop is 0
2025-10-11 16:18:39.627 [INFO] database P0000010219 T0000000000000010295  Receive mal cmd from ep(GRP1_RT_01) to set local dw_status flag
2025-10-11 16:18:39.628 [INFO] database P0000010219 T0000000000000010295  set g_dw_stat from NONE to NONE success, g_dw_recover_stop is 0

被动切换

被动切换可以有多种方式模拟主库故障,例如宕机、磁盘故障、网络故障等。
以下测试过程用网卡故障模拟主库的网络故障。

测试步骤

  1. 查看集群是否正常状态
  2. 对主库所在机器的网卡直接禁用
  3. 等待片刻查看原备的状态

监视器日志

[GRP1_RT_02]       2025-10-28 22:35:04.980 [INFO] dmwatcher P0000001623 T0000000000000001630  没有收到远程守护进程(GRP1_RT_01)消息,原状态为(OPEN),距上次接收消息时间间隔(11)s, 设置远程守护进程为ERROR状态
[GRP1_RT_02]       2025-10-28 22:35:04.980 [INFO] dmwatcher P0000001623 T0000000000000001630  Local instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(GRP1_RT_02) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) PO
CNT(30) FLSN(73347) CLSN(73347) SLSN(73347) SSLSN(73347)
[GRP1_RT_02]       2025-10-28 22:35:05.030 [INFO] dmwatcher P0000001623 T0000000000000001630  Instance: 守护进程状态(ERROR) 实例状态(OK) 实例名(GRP1_RT_01) 模式(PRIMARY) 实例状态(OPEN) 归档状态(INVALID) POCNT(3
0) FLSN(73347) CLSN(73347) SLSN(73347) SSLSN(73347)
[monitor]         2025-10-28 22:35:05: dwmon_check_watch_error_low, receive dmwatcher(GRP1_RT_01) msg timeout(far_err_time:10), close port(ip:192.168.57.102, port:52141, vio:5, n_fixed:1)!
[monitor]         2025-10-28 22:35:05: dwmon tcp port vio(5) close, inst_name:GRP1_RT_01, ip:192.168.57.102, port:52141, n_fixed:1.
[monitor]         2025-10-28 22:35:05: <RECEIVE TIMEOUT GRP1_RT_01>
[monitor]         2025-10-28 22:35:05: 接收守护进程(GRP1_RT_01)消息超时
                  WTIME                WSTATUS        INST_OK   INAME            ISTATUS     IMODE     RSTAT    N_OPEN   FLSN            CLSN
                  2025-10-28 22:34:54  ERROR          OK        GRP1_RT_01       OPEN        PRIMARY   VALID    30       73347           73347

[monitor]         2025-10-28 22:35:05: </RECEIVE TIMEOUT GRP1_RT_01>

[monitor]         2025-10-28 22:35:05: dwmon_tcp_recv failed, close port, vio:0, mid:1784258418, errno:0, code:-6007
[monitor]         2025-10-28 22:35:05: dwmon tcp port vio(0) close, inst_name:, ip:192.168.57.102, port:52141, n_fixed:1.
[monitor]         2025-10-28 22:35:05: dwmon_inst_can_takeover_by_prim_info, check_pstat = 1, n_pinfo = 1, but ok_cnt = 0, return FALSE!
[monitor]         2025-10-28 22:35:05: [!!! 实例GRP1_RT_01[PRIMARY, OPEN, ISTAT_SAME:TRUE]仍然处于活动状态,或者模式/状态或LSN值不符合被其他实例自动接管条件 !!!]

[monitor]         2025-10-28 22:35:05: [!!! Check primary[GRP1_RT_01] auto_failover is TRUE in watcher error, but find no standby instance can do takeover !!!]
[monitor]         2025-10-28 22:35:07: dwmon_inst_can_takeover_by_prim_info, primary name: GRP1_RT_01, instance name: GRP1_RT_02, n_pinfo = 1, ok_cnt = 1, return TRUE!
[monitor]         2025-10-28 22:35:07: [!!! 实例GRP1_RT_01[PRIMARY, OPEN, ISTAT_SAME:TRUE]故障,实例GRP1_RT_02[STANDBY, OPEN, ISTAT_SAME:TRUE]符合自动接管条件 !!!]

[monitor]         2025-10-28 22:35:07: dwmon_grp_auto_takeover_info_get, get tkov info of GRP1 success, n_inst = 1.
[monitor]         2025-10-28 22:35:07: 检测到PRIMARY实例故障,开始对组(GRP1)执行自动接管

[monitor]         2025-10-28 22:35:07: <AUTO TAKEOVER GRP1_RT_02>
[monitor]         2025-10-28 22:35:07: 通知组(GRP1)当前活动的守护进程设置MID
[monitor]         2025-10-28 22:35:07: Begin to wait site(GRP1_RT_02) complete...
[monitor]         2025-10-28 22:35:07: [!!! dwmon_tcp_msg_send, get master tcp port failed, send cmd msg(cmd:5, name_sendto:GRP1) to dmwatcher() !!!]
[monitor]         2025-10-28 22:35:07: Wait site(GRP1_RT_02) finished, code=0!
[monitor]         2025-10-28 22:35:07: 通知组(GRP1)当前活动的守护进程设置MID成功
[monitor]         2025-10-28 22:35:07: 开始使用实例GRP1_RT_02接管
[monitor]         2025-10-28 22:35:07: 通知守护进程GRP1_RT_02切换TAKEOVER状态
[monitor]         2025-10-28 22:35:07: 守护进程(GRP1_RT_02)状态切换 [OPEN-->TAKEOVER]
[monitor]         2025-10-28 22:35:07: 切换守护进程GRP1_RT_02为TAKEOVER状态成功
[monitor]         2025-10-28 22:35:07: 实例GRP1_RT_02开始执行SP_SET_GLOBAL_DW_STATUS(0, 7)语句
[monitor]         2025-10-28 22:35:07: 实例GRP1_RT_02执行SP_SET_GLOBAL_DW_STATUS(0, 7)语句成功
[monitor]         2025-10-28 22:35:07: 实例GRP1_RT_02开始执行SP_APPLY_KEEP_PKG()语句
[monitor]         2025-10-28 22:35:07: 实例GRP1_RT_02执行SP_APPLY_KEEP_PKG()语句成功
[monitor]         2025-10-28 22:35:07: 实例GRP1_RT_02开始执行ALTER DATABASE MOUNT语句
[monitor]         2025-10-28 22:35:07: Begin to wait site(GRP1_RT_02) complete...
[monitor]         2025-10-28 22:35:07: Wait site(GRP1_RT_02) finished, code=0!
[monitor]         2025-10-28 22:35:07: 实例GRP1_RT_02执行ALTER DATABASE MOUNT语句成功
[monitor]         2025-10-28 22:35:07: 实例GRP1_RT_02开始执行ALTER DATABASE PRIMARY语句
[monitor]         2025-10-28 22:35:07: 实例GRP1_RT_02执行ALTER DATABASE PRIMARY语句成功
[monitor]         2025-10-28 22:35:07: 通知实例GRP1_RT_02修改所有归档状态无效
[monitor]         2025-10-28 22:35:07: 修改所有实例归档为无效状态成功
[monitor]         2025-10-28 22:35:07: 实例GRP1_RT_02开始执行ALTER DATABASE OPEN FORCE语句
[monitor]         2025-10-28 22:35:07: ohis_inst_info_copy_low, inst(GRP1_RT_02) apply info changed, old info[p_db_magic:431410808, n_apply_ep:1], new info to set[p_db_magic:1810686012, n_apply_ep:0]!
[monitor]         2025-10-28 22:35:07: 实例GRP1_RT_02执行ALTER DATABASE OPEN FORCE语句成功
[monitor]         2025-10-28 22:35:07: 实例GRP1_RT_02开始执行SP_SET_GLOBAL_DW_STATUS(7, 0)语句
[monitor]         2025-10-28 22:35:08: 实例GRP1_RT_02执行SP_SET_GLOBAL_DW_STATUS(7, 0)语句成功
[monitor]         2025-10-28 22:35:08: 通知守护进程GRP1_RT_02切换OPEN状态
[monitor]         2025-10-28 22:35:08: 守护进程(GRP1_RT_02)状态切换 [TAKEOVER-->OPEN]
[monitor]         2025-10-28 22:35:08: 切换守护进程GRP1_RT_02为OPEN状态成功
[monitor]         2025-10-28 22:35:08: 通知组(GRP1)的守护进程执行清理操作
[monitor]         2025-10-28 22:35:08: Notify instance(GRP1_RT_02) to clear monitor info and wait complete!
[monitor]         2025-10-28 22:35:08: Begin to wait site(GRP1_RT_02) complete...
[monitor]         2025-10-28 22:35:08: dwmon_cmd_msg_send_low failed, get tcp_port failed(inst_name:, ip:192.168.57.102, port:52141)!
[monitor]         2025-10-28 22:35:08: [!!! dwmon_tcp_msg_send to master tcp_port failed, code:-6010, (inst_name:, ip:192.168.57.102, port:52141, vio:0) !!!]
[monitor]         2025-10-28 22:35:08: 清理守护进程(GRP1_RT_02)请求成功
[monitor]         2025-10-28 22:35:08: Wait site(GRP1_RT_02) finished, code=0!
[monitor]         2025-10-28 22:35:08: 使用实例GRP1_RT_02接管成功

[monitor]         2025-10-28 22:35:08: </AUTO TAKEOVER GRP1_RT_02>

[monitor]         2025-10-28 22:35:08: 组(GRP1)使用实例GRP1_RT_02自动接管成功

监视器主要执行过程如下:

  1. 检测到主库实例GRP1_RT_01故障,备库实例GRP1_RT_02符合自动接管条件
  2. 通知组(GRP1)当前活动的守护进程设置MID
  3. 通知实例GRP1_RT_02切换TAKEOVER状态
  4. 实例GRP1_RT_02执行SP_SET_GLOBAL_DW_STATUS(0, 7)语句
  5. 实例GRP1_RT_02执行SP_APPLY_KEEP_PKG()语句
  6. 实例GRP1_RT_02执行ALTER DATABASE MOUNT语句,再执行ALTER DATABASE PRIMARY语句
  7. 实例GRP1_RT_02修改所有归档状态无效
  8. GRP1_RT_02执行ALTER DATABASE OPEN FORCE语句
  9. GRP1_RT_02执行SP_SET_GLOBAL_DW_STATUS(7, 0)语句
  10. 守护进程GRP1_RT_02切换OPEN状态
  11. 通知组(GRP1)的守护进程执行清理操作
  12. 组(GRP1)使用实例GRP1_RT_02自动接管成功

整个切换过程耗时4秒左右

原主库守护进程切换日志

2025-10-28 22:35:05.318 [INFO] dmwatcher P0000001621 T0000000000000001626  没有收到远程守护进程(GRP1_RT_02)消息,原状态为(OPEN),距上次接收消息时间间隔(11)s, 设置远程守护进程为ERROR状态
2025-10-28 22:35:05.318 [INFO] dmwatcher P0000001621 T0000000000000001626  Local instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(GRP1_RT_01) 模式(PRIMARY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(30) FLSN(73347) CLSN(73347) SLSN(73347) SSLSN(73347)
2025-10-28 22:35:05.369 [INFO] dmwatcher P0000001621 T0000000000000001626  Instance: 守护进程状态(ERROR) 实例状态(OK) 实例名(GRP1_RT_02) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(30) FLSN(73347) CLSN(73347) SLSN(73347) SSLSN(73347)
2025-10-28 22:35:05.419 [WARNING] dmwatcher P0000001621 T0000000000000001626  Tcp port from dmwatcher vio(6) close, ip:192.168.57.103, port:52142, mid(-1), is confirm(FALSE), from name:GRP1_RT_02
2025-10-28 22:35:05.420 [INFO] dmwatcher P0000001621 T0000000000000001626  dw2_tcp_port_close_low, reset report_err to TRUE, ip:192.168.57.103, port:52142!
2025-10-28 22:35:05.420 [INFO] dmwatcher P0000001621 T0000000000000002005  dw2_group_tcp_recv_thread, receive tcp msg failed, close tcp port, vio:0, mid:-1, from_flag:0, from_name:GRP1_RT_02, dw_closed:1, ip:192.168.57.103, port:52142, errno:0, error:Failure occurs in data_recv_inet_once, code(0) len(8128, 0), code:-6007
2025-10-28 22:35:05.420 [INFO] dmwatcher P0000001621 T0000000000000002005  dw2_tcp_port_close_low, reset report_err to TRUE, ip:192.168.57.103, port:52142!
2025-10-28 22:35:05.420 [INFO] dmwatcher P0000001621 T0000000000000002004  dw2_tcp_port_close_low, reset report_err to TRUE, ip:192.168.57.103, port:52142!
2025-10-28 22:35:11.212 [ERROR] dmwatcher P0000001621 T0000000000000001628  Can't connect to DM server on '192.168.57.103' port(52142) errno(115)
2025-10-28 22:35:35.338 [INFO] dmwatcher P0000001621 T0000000000000001626  Tcp msg send over (10)s, close tcp port(from_flag: 1, vio: 7)
2025-10-28 22:35:35.338 [WARNING] dmwatcher P0000001621 T0000000000000001626  Tcp port from dmmonitor vio(7) close, ip:::ffff:192.168.57.103, port:50950, mid(1784258418), is confirm(TRUE), from name:dmmonitor
2025-10-28 22:35:35.338 [INFO] dmwatcher P0000001621 T0000000000000001626  dw2_tcp_port_close_low, reset report_err to TRUE, ip:, port:0!
2025-10-28 22:35:35.338 [INFO] dmwatcher P0000001621 T0000000000000002007  dw2_group_send_tcp_msg failed, from flag:1, from name:dmmonitor, close tcp port
2025-10-28 22:35:35.338 [INFO] dmwatcher P0000001621 T0000000000000002007  dw2_tcp_port_close_low, reset report_err to TRUE, ip:, port:0!
2025-10-28 22:35:35.338 [INFO] dmwatcher P0000001621 T0000000000000002007  dw2_group_info_send, send to port(from flag:1, from name:dmmonitor, ip:::ffff:192.168.57.103) over 5s, code:-6007!
2025-10-28 22:35:35.338 [INFO] dmwatcher P0000001621 T0000000000000002007  dw2_group_tcp_send_thread, code:-6007, from flag:1, from name:dmmonitor, ip:::ffff:192.168.57.103
2025-10-28 22:35:35.338 [INFO] dmwatcher P0000001621 T0000000000000002007  dw2_tcp_port_close_low, reset report_err to TRUE, ip:, port:0!
2025-10-28 22:35:35.338 [INFO] dmwatcher P0000001621 T0000000000000002008  dw2_group_tcp_recv_thread, receive tcp msg failed, close tcp port, vio:0, mid:-1, from_flag:1, from_name:dmmonitor, dw_closed:1, ip:::ffff:192.168.57.103, port:50950, errno:0, error:Failure occurs in data_recv_inet_once, code(0) len(8128, 0), code:-6007
2025-10-28 22:35:35.338 [INFO] dmwatcher P0000001621 T0000000000000002008  dw2_tcp_port_close_low, reset report_err to TRUE, ip:, port:0!

原主库守护进程将自己的状态置为ERROR状态,并不断尝试连接备库实例

新主库守护进程切换日志

2025-10-28 22:35:04.980 [INFO] dmwatcher P0000001623 T0000000000000001630  没有收到远程守护进程(GRP1_RT_01)消息,原状态为(OPEN),距上次接收消息时间间隔(11)s, 设置远程守护进程为ERROR状态
2025-10-28 22:35:04.980 [INFO] dmwatcher P0000001623 T0000000000000001630  Local instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(GRP1_RT_02) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(30) FLSN(73347)
 CLSN(73347) SLSN(73347) SSLSN(73347)
2025-10-28 22:35:05.030 [INFO] dmwatcher P0000001623 T0000000000000001630  Instance: 守护进程状态(ERROR) 实例状态(OK) 实例名(GRP1_RT_01) 模式(PRIMARY) 实例状态(OPEN) 归档状态(INVALID) POCNT(30) FLSN(73347) CLSN
(73347) SLSN(73347) SSLSN(73347)
2025-10-28 22:35:05.081 [WARNING] dmwatcher P0000001623 T0000000000000001630  Tcp port from dmwatcher vio(4) close, ip:::ffff:192.168.57.102, port:55472, mid(0), is confirm(FALSE), from name:GRP1_RT_01
2025-10-28 22:35:05.081 [INFO] dmwatcher P0000001623 T0000000000000001630  dw2_tcp_port_close_low, reset report_err to TRUE, ip:, port:0!
2025-10-28 22:35:05.081 [INFO] dmwatcher P0000001623 T0000000000000001643  dw2_group_tcp_recv_thread, receive tcp msg failed, close tcp port, vio:0, mid:-1, from_flag:0, from_name:GRP1_RT_01, dw_closed:1, ip:::
......(省略n行)
dseq:3931742076, code:100, status:1, mon_ip:::ffff:192.168.57.103, rarch_status:0, processed:0, sql:
2025-10-28 22:35:08.018 [INFO] dmwatcher P0000001623 T0000000000000001630  收到监视器通知守护进程切换OPEN状态命令
2025-10-28 22:35:08.018 [INFO] dmwatcher P0000001623 T0000000000000001630  After monitor execute SWITCHOVER/TAKEOVER/OPEN INSTANCE command, reset all standby instance recover info
2025-10-28 22:35:08.018 [INFO] dmwatcher P0000001623 T0000000000000001630  switch sub_state to sub_stat_start!
2025-10-28 22:35:08.018 [INFO] dmwatcher P0000001623 T0000000000000001630  设置GRP1守护进程为OPEN(SUB:STARTUP)状态
2025-10-28 22:35:08.069 [INFO] dmwatcher P0000001623 T0000000000000001630  dw2_set_mon_info_code_low, set code:0, current dseq:3931742076, last_code:100, mid:1784258418!
2025-10-28 22:35:08.127 [INFO] dmwatcher P0000001623 T0000000000000001891  dw2_group_reset_mon_info, in dw2 process mon clear, from mid:1784258418, old mid:1784258418, new mid:0
2025-10-28 22:35:08.127 [INFO] dmwatcher P0000001623 T0000000000000001891  收到监视器CLEAR命令
2025-10-28 22:35:08.127 [INFO] dmwatcher P0000001623 T0000000000000001891  监视器CLEAR命令执行成功

主要步骤如下:

  1. 检测主库GRP1_RT_01通信超时
  2. 向GRP1_RT_02报告本地实例状态
  3. 关闭到主库实例和主库守护进程的TCP连接
  4. 接收监视器设置MID
  5. 将自己切换到TAKEOVER状态
  6. 向GRP1_RT_02执行SP_SET_GLOBAL_DW_STATUS(0, 7)
  7. 向GRP1_RT_02执行SP_APPLY_KEEP_PKG()执行ALTER DATABASE MOUNT后再执行ALTER DATABASE PRIMARY
  8. 修改GRP1_RT_02的归档状态为无效状态
  9. 向GRP1_RT_02执行ALTER DATABASE OPEN FORCE
  10. 向GRP1_RT_02执行SP_SET_GLOBAL_DW_STATUS(7, 0)
  11. 将自己切换回OPEN状态
  12. 收到监视器CLEAR命令执行清理操作

原主库切换日志

2025-10-28 22:35:09.941 [INFO] database P0000001613 T0000000000000001948  mal_site_link_check->mal_recv_force_wait failed, code:-6010!, wait_time:10
2025-10-28 22:35:09.941 [WARNING] database P0000001613 T0000000000000001948  [!!!DSC INFO!!!]Network connect to IP address 192.168.57.103 has been disconnected.
2025-10-28 22:35:09.941 [WARNING] database P0000001613 T0000000000000001948  [!!!DSC INFO!!!]Site(0) or site(1) interface with IP address no longer running or dsc_ep service on mal_site[192.168.57.103:61142] has broken.It cause network disconnection.
2025-10-28 22:35:10.941 [INFO] database P0000001613 T0000000000000001948  mal_site_link_check->mal_recv_force_wait failed, code:-6010!, wait_time:1
2025-10-28 22:35:10.941 [WARNING] database P0000001613 T0000000000000001948  [!!!DSC INFO!!!]Network connect to IP address 192.168.57.103 has been disconnected.
2025-10-28 22:35:10.941 [WARNING] database P0000001613 T0000000000000001948  [!!!DSC INFO!!!]Site(0) or site(1) interface with IP address no longer running or dsc_ep service on mal_site[192.168.57.103:61142] has broken.It cause network disconnection.
2025-10-28 22:35:10.941 [INFO] database P0000001613 T0000000000000001948  mal_site_link_check from mal_site(0) to mal_site(1) failed, call mal_site_port_close!
2025-10-28 22:35:10.941 [WARNING] database P0000001613 T0000000000000001948  MAL mgr thrd site(0) lost connect to site(1), ctl_handle(53), data_handle(54), dsc_handle(0)
2025-10-28 22:35:10.941 [WARNING] database P0000001613 T0000000000000001942  mal_site_tsk_check site(0) connect lost to site(1), socket handle = 0, mal sys status = 0, try get port again
2025-10-28 22:35:10.941 [INFO] database P0000001613 T0000000000000001942  send  CMD_MAL_LINK_CHECK(350): (mal_id:12044914, stmt_id:65676, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:-1)
2025-10-28 22:35:10.942 [WARNING] database P0000001613 T0000000000000001943  mal_site_letter_recv code=-6007, errno=0, site(0) recv from site(1) failed, socket handle = 0
2025-10-28 22:35:10.942 [WARNING] database P0000001613 T0000000000000001943  MAL receive site(0) lost connect to site(1), ctl_handle(0), data_handle(0), dsc_handle(0)
2025-10-28 22:35:10.942 [WARNING] database P0000001613 T0000000000000001944  mal_site_letter_recv code=-6007, errno=0, site(0) recv from site(1) failed, socket handle = 0
2025-10-28 22:35:10.942 [WARNING] database P0000001613 T0000000000000001943  site(0) ctl_link mal_site_letter_recv from site(1) failed, socket handle = 0, mal sys status is 0, try to get mal_port again
2025-10-28 22:35:10.942 [WARNING] database P0000001613 T0000000000000001944  MAL receive site(0) lost connect to site(1), ctl_handle(0), data_handle(0), dsc_handle(0)
2025-10-28 22:35:10.942 [WARNING] database P0000001613 T0000000000000001944  site(0) data_link mal_site_letter_recv from site(1) failed, socket handle = 0, mal sys status is 0, try to get mal_port again
2025-10-28 22:35:10.942 [WARNING] database P0000001613 T0000000000000001941  mal_site_tsk_check site(0) connect lost to site(1), socket handle = 0, mal sys status = 0, try get port again
2025-10-28 22:35:10.942 [INFO] database P0000001613 T0000000000000001941  send  CMD_MAL_LINK_CHECK(350): (mal_id:12044914, stmt_id:65677, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:175)
2025-10-28 22:35:10.942 [INFO] database P0000001613 T0000000000000001941  send  CMD_MAL_LINK_CHECK(350): (mal_id:14205394, stmt_id:65679, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:-1)
2025-10-28 22:35:10.942 [INFO] database P0000001613 T0000000000000001941  send  CMD_MAL_LINK_CHECK(350): (mal_id:12044914, stmt_id:65680, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:138)
2025-10-28 22:35:26.059 [ERROR] database P0000001613 T0000000000000001948  Can't connect to DM server on '192.168.57.103' port(61142) errno(115)

主要步骤如下:

  1. 检测备库GRP1_RT_01通信超时
  2. 关闭MAL连接,并不断尝试重连
  3. 检测到监视器连接超时,关闭监视器的连接

新主库切换日志

2025-10-28 22:35:05.955 [INFO] database P0000001619 T0000000000000001959  mal_site_link_check->mal_recv_force_wait failed, code:-6010!, wait_time:10
2025-10-28 22:35:05.955 [WARNING] database P0000001619 T0000000000000001959  [!!!DSC INFO!!!]Network connect to IP address 192.168.57.102 has been disconnected.
2025-10-28 22:35:05.955 [WARNING] database P0000001619 T0000000000000001959  [!!!DSC INFO!!!]Site(1) or site(0) interface with IP address no longer running or dsc_ep service on mal_site[192.168.57.102:61141] ha
s broken.It cause network disconnection.
2025-10-28 22:35:06.955 [INFO] database P0000001619 T0000000000000001959  mal_site_link_check->mal_recv_force_wait failed, code:-6010!, wait_time:1
2025-10-28 22:35:06.955 [WARNING] database P0000001619 T0000000000000001959  [!!!DSC INFO!!!]Network connect to IP address 192.168.57.102 has been disconnected.
2025-10-28 22:35:06.956 [WARNING] database P0000001619 T0000000000000001959  [!!!DSC INFO!!!]Site(1) or site(0) interface with IP address no longer running or dsc_ep service on mal_site[192.168.57.102:61141] ha
s broken.It cause network disconnection.
2025-10-28 22:35:06.956 [INFO] database P0000001619 T0000000000000001959  mal_site_link_check from mal_site(1) to mal_site(0) failed, call mal_site_port_close!
2025-10-28 22:35:06.956 [WARNING] database P0000001619 T0000000000000001959  MAL mgr thrd site(1) lost connect to site(0), ctl_handle(8), data_handle(9), dsc_handle(0)
......(省略n行)
2025-10-28 22:35:07.921 [INFO] database P0000001619 T0000000000000001693  altdb open pseg recv end, code = 0!
2025-10-28 22:35:07.921 [INFO] database P0000001619 T0000000000000001693  nsvr_process_before_open begin.
2025-10-28 22:35:07.937 [INFO] database P0000001619 T0000000000000001693  nsvr_process_before_open success.
2025-10-28 22:35:07.937 [INFO] database P0000001619 T0000000000000001693  altdb open rlog_flush_notify_ex start!
2025-10-28 22:35:07.937 [INFO] database P0000001619 T0000000000000001693  altdb open rlog_flush_notify_ex end!
2025-10-28 22:35:07.937 [INFO] database P0000001619 T0000000000000001693  alter database open success!
2025-10-28 22:35:07.937 [INFO] database P0000001619 T0000000000000001693  utsk_process_dw_sql_exec, prepare sql:ALTER DATABASE OPEN FORCE, code:0, dseq:1761661753
2025-10-28 22:35:07.937 [INFO] database P0000001619 T0000000000000001693  hpc_clear_active_trx adjust n_crash_active_trx from 0 to 0.
2025-10-28 22:35:07.941 [INFO] database P0000001619 T0000000000000001977  purge over, total 0 trx, 0 pages purged
2025-10-28 22:35:07.941 [INFO] database P0000001619 T0000000000000001977  n_roll_trx:0, n_purged_trx:0, n_active_trx:0, n_cmt_trx:0, n_pre_cmt_trx:0, n_to_release_trx:0
2025-10-28 22:35:07.966 [INFO] database P0000001619 T0000000000000002011  utsk_cmd_add, received sql exec cmd:1, dseq:1761661754, sql:SP_SET_GLOBAL_DW_STATUS(7, 0)
2025-10-28 22:35:07.966 [INFO] database P0000001619 T0000000000000002011  utsk_cmd_add, cmd info: cmd=1, dseq=1761661754, name_in=, begin_lsn=-1!
2025-10-28 22:35:07.966 [INFO] database P0000001619 T0000000000000002011  set g_dw_stat from MON_TAKEOVER to NONE success, g_dw_recover_stop is 0
2025-10-28 22:35:07.966 [INFO] database P0000001619 T0000000000000002011  utsk_dw_sql exec SP_SET_GLOBAL_DW_STATUS(7, 0), code=0
2025-10-28 22:35:08.118 [INFO] database P0000001619 T0000000000000001689  trx4_min_tid_collect set min_active_id_opt, min_active_id: 158450, first_tid: 158445

主要步骤如下:

  1. 检测MAL链路故障,尝试连接失败后关闭MAL连接
  2. 执行守护进程发送过来的SP_SET_GLOBAL_DW_STATUS(0, 7)语句
  3. 执行守护进程发送过来的SP_APPLY_KEEP_PKG()语句
  4. 执行守护进程发送过来的ALTER DATABASE MOUNT语句,事务回滚和日志刷新
  5. 执行守护进程发送过来的ALTER DATABASE PRIMARY语句, 重做日志初始化
  6. 修改原主库GRP1_RT_01的归档状态为INVALID
  7. 执行守护进程发送过来的ALTER DATABASE OPEN FORCE
  8. 执行守护进程发送过来的SP_SET_GLOBAL_DW_STATUS(7, 0)语句
  9. 切换过程完成

总结

主备模式下的监视器还有很多高可用功能需要深入去了解,例如探测主库备库是否为在线状态机制、网络不稳定状态下的主备切换过程、如何防止脑裂问题等,需要在实际使用过程中去探究深层次的原理进行后续实践问题解决。

评论
后发表回复

作者

文章

阅读量

获赞

扫一扫
联系客服