注册

数据库主库宕机

醉酒方知浓 2025/08/14 81 1 已解决

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。
【DM版本】:03134284336-20250319-265200-20132 Pack9
【操作系统】:麒麟V10
【CPU】:kunpeng 920
【问题描述】*:
2025-08-14 19:22:50.268 [INFO] database P0000013116 T0000000000000013284 ckpt2_log_adjust: full_status: 160, ptx_reserved: 0
2025-08-14 19:22:50.268 [INFO] database P0000013116 T0000000000000013284 ckpt2_log_adjust: ckpt_lsn(1401472), ckpt_fil(0), ckpt_off(667496448), cur_lsn(1401476), l_next_seq(147701), g_next_seq(147701), cur_free(667504640), total_space(8589926400), used_space(8192), free_space(8589918208), n_ep(1), db_open_id(3)
2025-08-14 19:22:50.268 [INFO] database P0000013116 T0000000000000013284 checkpoint end, 3 pages flushed, used_space[8192], free_space[8589918208].
2025-08-14 19:23:40.085 [INFO] database P0000013116 T0000000000000013271 socket_err_should_retry errno:110
2025-08-14 19:23:40.086 [WARNING] database P0000013116 T0000000000000013271 mal_site_letter_recv code=-6007, errno=110, site(0) recv from site(1) failed, socket handle = 10
2025-08-14 19:23:40.086 [WARNING] database P0000013116 T0000000000000013271 MAL receive site(0) lost connect to site(1), ctl_handle(10), data_handle(11), dsc_handle(0)
2025-08-14 19:23:40.086 [WARNING] database P0000013116 T0000000000000013270 mal_site_tsk_check site(0) connect lost to site(1), socket handle = 0, mal sys status = 0, try get port again
2025-08-14 19:23:40.086 [INFO] database P0000013116 T0000000000000013270 send CMD_MAL_LINK_CHECK(350): (mal_id:4224841870, stmt_id:429862, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:1, dest_site:1, build_time:-1)
2025-08-14 19:23:40.086 [WARNING] database P0000013116 T0000000000000013272 mal_site_letter_recv code=-6007, errno=0, site(0) recv from site(1) failed, socket handle = 0
2025-08-14 19:23:40.086 [ERROR] database P0000013116 T0000000000000013285 self_site(0) to dest_site(1) port_closed, return EC_MAL_LINK_LOST
2025-08-14 19:23:40.086 [WARNING] database P0000013116 T0000000000000013272 MAL receive site(0) lost connect to site(1), ctl_handle(0), data_handle(0), dsc_handle(0)
2025-08-14 19:23:40.086 [ERROR] database P0000013116 T0000000000000013285 [mal recv for arch] mal receive from site(DM2) failed, begin lsn:1401533, end lsn:1401545, code:-6021
2025-08-14 19:23:40.086 [WARNING] database P0000013116 T0000000000000013272 site(0) data_link mal_site_letter_recv from site(1) failed, socket handle = 0, mal sys status is 0, try to get mal_port again
2025-08-14 19:23:40.086 [ERROR] database P0000013116 T0000000000000013285 send realtime archive to instance[DM2] failed, code = -6021, begin_lsn = 1401533, end_lsn = 1401545!
2025-08-14 19:23:40.086 [WARNING] database P0000013116 T0000000000000013269 mal_site_tsk_check site(0) connect lost to site(1), socket handle = 0, mal sys status = 0, try get port again
2025-08-14 19:23:40.086 [WARNING] database P0000013116 T0000000000000013271 site(0) ctl_link mal_site_letter_recv from site(1) failed, socket handle = 0, mal sys status is 0, try to get mal_port again
2025-08-14 19:23:40.086 [INFO] database P0000013116 T0000000000000013285 rlog4_process_arch_failed, need_suspend:1
2025-08-14 19:23:40.086 [INFO] database P0000013116 T0000000000000013285 rlog4_process_arch_failed, reset req_ep_arr and res_ep_arr.
2025-08-14 19:23:40.086 [ERROR] database P0000013116 T0000000000000013137 self_site(0) to dest_site(1) port_closed, return EC_MAL_LINK_LOST
2025-08-14 19:23:40.093 [INFO] database P0000013116 T0000000000000013285 Send archive log to remote instance failed, switch all ep to SUSPEND status success!
2025-08-14 19:24:17.045 [ERROR] database P0000013116 T0000000000000013276 Can't connect to DM server on '10.10.1.62' port(5336) errno(115)
2025-08-14 19:24:22.248 [INFO] database P0000013116 T0000000000000013276 mal_site_ctl_link_create startup from mal_site(0) to mal_site(1)!
2025-08-14 19:24:22.248 [INFO] database P0000013116 T0000000000000013276 mal_site_magic_gen site_magic[42577], src_site:0, dst_site:1
2025-08-14 19:24:22.248 [INFO] database P0000013116 T0000000000000013271 mal_site_port_get site_magic:42577, src_site:0, dst_site:1
2025-08-14 19:24:22.248 [INFO] database P0000013116 T0000000000000013270 mal_site_port_get site_magic:42577, src_site:0, dst_site:1
2025-08-14 19:24:22.249 [INFO] database P0000013116 T0000000000000013276 site[0] mal_site_ctl_port_set to site[1, IP: 10.10.1.62, port_num: 5336], socket handle = 10, site_magic = 42577, link_seq = 3
2025-08-14 19:24:22.250 [INFO] database P0000013116 T0000000000000013271 mal site[1]: mal_msg_ver is 128, mal_msg_ver_sub is 131071
2025-08-14 19:24:22.252 [INFO] database P0000013116 T0000000000000013275 mal_site_process_startup received link from ::ffff:10.10.1.62
2025-08-14 19:24:22.252 [INFO] database P0000013116 T0000000000000013275 site[0] mal_site_data_port_set from site[1, IP: 10.10.1.62, port_num: 5336], socket handle = 11, site_magic = 42577, link_seq = 3
2025-08-14 19:24:22.252 [INFO] database P0000013116 T0000000000000013272 mal_site_port_get site_magic:42577, src_site:0, dst_site:1
2025-08-14 19:24:22.252 [INFO] database P0000013116 T0000000000000013269 mal_site_port_get site_magic:42577, src_site:0, dst_site:1
2025-08-14 19:24:24.204 [INFO] database P0000013116 T0000000000000013372 utsk_dw_sql_exec, exec sql SHUTDOWN ABORT!
2025-08-14 19:24:24.204 [FATAL] database P0000013116 T0000000000000013372 [for dem]SYSTEM SHUTDOWN ABORT.

是一套主备集群,主库实例宕机,主库守护并没有拉起主库实例,主备发生切换。

这是dmwatcher.ini

DW_TYPE                  = GLOBAL  #全局守护类型
DW_MODE                  = AUTO  #MANUAL:故障手切 AUTO:故障自切
DW_ERROR_TIME            = 60  #远程守护进程故障认定时间
INST_ERROR_TIME          = 30  #本地实例故障认定时间
INST_RECOVER_TIME        = 60  #主库守护进程启动恢复的间隔时间
INST_OGUID               = 453331  #守护系统唯一 OGUID 值
INST_INI                 = /dmdata/DAMENG/dm.ini  #dm.ini 文件路径
INST_AUTO_RESTART        = 1  #打开实例的自动启动功能
INST_STARTUP_CMD         = /home/dmdba/dmdbms/bin/DmServiceDM1  #命令行方式启动
RLOG_SEND_THRESHOLD      = 0  #指定主库发送日志到备库的时间阈值,默认关闭
RLOG_APPLY_THRESHOLD     = 0  #指定备库重演日志的时间阈值,默认关闭

问题1:宕机原因是?
问题2:主库宕机了,守护为什么没有自动拉起呢?

回答 0
暂无回答
扫一扫
联系客服