注册
模拟ASM磁盘头损坏修复
专栏/培训园地/ 文章详情 /

模拟ASM磁盘头损坏修复

DM_271235 2024/03/27 1020 0 0
摘要

1.概述
共享存储集群运行环境中,一旦出现ASM磁盘头损坏,若ASM磁盘头信息没有备份,数据库将无法启动,只能重新搭建数据库使用备份还原方式恢复,而生产数据库数据量较大,使用备份还原恢复耗时较大,因此对ASM磁盘头进行自动备份尤为重要,能大大减少因ASM磁盘头损坏带来的影响,DM 8新版本已支持对ASM磁盘元数据进行自动备份,若ASM磁盘头损坏可使用自动备份的ASM磁盘元数据进行修复。下面我们来模拟ASM磁盘头损坏并使用自动备份的ASM磁盘元数据进行修复。目前仅确保磁盘头,即0号AU前1024字节被破坏能进行修复。
2.DSC集群测试环境
image.png
3.磁盘头工具使用说明

[dmdba@dmdsc01 bin]$ ./dmasmmgt
dmasmmgt V8
version: 03134284132-20231226-213242-20081
ASM>help
目的:指定磁盘路径并查看磁盘信息
形式:show path (op=[DISK/ATTR/GROUP/ALL])
示例:show /dev_data/asmdisks/disk0.asm
示例:show /dev_data/asmdisks/disk0.asm op=disk
目的:手动备份磁盘头、磁盘组头信息到本地
形式:backup disk_path to local_path
示例:backup /dev_data/asmdisks/disk0.asm to /localdir/disk0.txt
目的:备份磁盘(0-size)的二进制信息到本地,size单位为byte
形式:backup disk_path bdata to local_path size=k
示例:backup /dev_data/asmdisks/disk0.asm bdata to /localdir/disk0.bin size=1048576
目的:用本地txt备份文件恢复磁盘头1K元数据
形式:repair disk_path from local_path
示例:repair /dev_data/asmdisks/disk0.asm from /localdir/disk0.txt
目的:用本地元数据文件恢复磁盘头从i偏移开始,大小为j字节的元数据
形式:repair disk_path from local_path start=i size=j
示例:repair /dev_data/asmdisks/disk0.asm from /localdir/disk0.bin start=0 size=512
目的:用磁盘自动备份恢复磁盘头从i偏移开始,大小为j字节的元数据
形式:repair disk_path start=i size=j
示例:repair /dev_data/asmdisks/disk0.asm start=0 size=512
目的:用磁盘头信息恢复磁盘头备份从i偏移开始,大小为j字节的元数据
形式:repair disk_path mirror start=i size=j
示例:repair /dev_data/asmdisks/disk0.asm mirror start=0 size=512
目的:校验磁盘头元数据是否被损坏
形式:check disk_path
示例:check /dev_data/asmdisks/disk0.asm
目的:校验磁盘头镜像是否被损坏
形式:check mirror disk_path
示例:check mirror /dev_data/asmdisks/disk0.asm
目的:校验本地磁盘txt备份文件是否被损坏
形式:check local disk_path
示例:check local /dev_data/asmdisks/disk0.txt
目的:校验本地磁盘二进制备份文件是否被损坏
形式:check bin disk_path
示例:check bin /localdir/disk0.bin
提示: 输入exit可退出工具
提示: 以上所有文件路径可以为单个文件路径或为搜索路径

4.测试场景准备
4.1.模拟损坏磁盘头前备份数据库

对数据库做一次全备,避免ASM磁盘头损坏导致数据库不可用。
[dmdba@dmdsc01 bin]$ ./disql SYSDBA/密码
SQL> BACKUP DATABASE TO FULL_BAK BACKUPSET  '/dm/dmback/BAKFULL_20240321'
DEVICE TYPE DISK BACKUPINFO 'db_fullbak' COMPRESSED LEVEL 2 PARALLEL 4;
对备份做检查:
[dmdba@dmdsc01 bin]$ ./dmrman 
dmrman V8
RMAN> CHECK BACKUPSET '/dm/dmback/BAKFULL_20240321';
CHECK BACKUPSET '/dm/dmback/BAKFULL_20240321';
[Percent:100.00%][Speed:0.00M/s][Cost:00:00:00][Remaining:00:00:00]                                 
check backupset successfully.  --备份集校验检查成功
time used: 182.911(ms)

4.2.校验自动备份的ASM磁盘头和镜像是否可用

ASM>check mirror /dev/asm/
file path:[/dev/asm/asm-dmdata01], the content of file head is correct!
file path:[/dev/asm/asm-dmdata01], the content of file head mirror is correct!
file path:[/dev/asm/asm-dmredo], the content of file head is correct!
file path:[/dev/asm/asm-dmredo], the content of file head mirror is correct!
file path:[/dev/asm/asm-dmdata02], the content of file head is correct!
file path:[/dev/asm/asm-dmdata02], the content of file head mirror is correct!
file path:[/dev/asm/asm-dmvote], the content of file head is correct!
file path:[/dev/asm/asm-dmvote], the content of file head mirror is correct!
file path:[/dev/asm/asm-dmdcr], the content of file head is correct!
file path:[/dev/asm/asm-dmdcr], the content of file head mirror is correct!
Used time: 64.075(ms).
The ASMMGT executed "check" end
检查自动备份的ASM磁盘头和镜像可用。

4.3.模拟业务持续写入数据

CREATE TABLE TEST(C1 INT,C2 VARCHAR2(50),C3 VARCHAR2(1000));
DECLARE
V_SQL VARCHAR2(2000);
BEGIN
V_SQL :='INSERT INTO TEST SELECT LEVEL,DBMS_RANDOM.STRING (''U'',50),DBMS_RANDOM.STRING (''U'',500) FROM DUAL CONNECT BY LEVEL <1000;';
FOR I IN 1..100000
LOOP
EXECUTE IMMEDIATE V_SQL;
COMMIT;
END LOOP;
END;
/

4.4.模拟ASM磁盘头损坏

使用dd命令损坏所有ASM磁盘头部1024字节。
dd if=/dev/urandom of=/dev/asm/asm-dmdcr bs=1024 count=1 conv=notrunc
dd if=/dev/urandom of=/dev/asm/asm-dmvote bs=1024 count=1 conv=notrunc
dd if=/dev/urandom of=/dev/asm/asm-dmredo bs=1024 count=1 conv=notrunc
dd if=/dev/urandom of=/dev/asm/asm-dmdata01 bs=1024 count=1 conv=notrunc
dd if=/dev/urandom of=/dev/asm/asm-dmdata02 bs=1024 count=1 conv=notrunc

4.5.模拟数据库作业需重启数据库集群
数据库集群因作业实施,需要重启数据库,下面将模拟数据库作业重启数据库集群。
4.5.1.作业前检查数据库运行情况
1)检查数据库运行正常
image.png
2)作业重启数据库前读写操作正常

创建测试表:
SQL>CREATE TABLE TEST2(C1 INT,C2 VARCHAR2(50),C3 VARCHAR2(1000));
写入测试数据:
SQL>INSERT INTO TEST2 SELECT LEVEL,DBMS_RANDOM.STRING ('U',50),DBMS_RANDOM.STRING ('U',500) FROM DUAL CONNECT BY LEVEL <1000;
影响行数 999
SQL> commit;
更新数据:
SQL> update test2 set c2='aaccdd' where c1=2;
SQL> commit;
查询数据:
SQL> select c1,c2 from  test2 where c1=2;
行号     C1          C2    
---------- ----------- ------
1          2           aaccdd

4.5.2.模拟数据库作业重启数据库集群
1)关闭数据库集群

数据库作业开始,下面关闭数据库集群。
关闭数据库前做一次检查点:
[dmdba@dmdsc01 bin]$ ./disql SYSDBA/密码
SQL>select checkpoint (100);
$cd /dm/dmdbms/dm8/bin
$./dmcssm ini_path=dmcssm.ini
#停止DMSERVER服务
ep stop GRP_DSC
#停止ASM服务
ep stop GRP_ASM
#关闭DMCSS服务(两节点都执行)
./DmCSSServiceCSS stop

2)作业完后启动数据库集群

#启动DMCSS服务
节点1:
[dmdba@dmdsc01 bin]$ ./DmCSSServiceCSS start
Starting DmCSSServiceCSS:                                  [ OK ]
节点2:
[dmdba@dmdsc02 bin]$ ./DmCSSServiceCSS start
Starting DmCSSServiceCSS:                                  [ OK ]
#启动DMASM服务
节点1:
[dmdba@dmdsc01 bin]$ ./DmASMSvrServiceASM start
Starting DmASMSvrServiceASM:                               [ OK ]
节点2:
[dmdba@dmdsc02 bin]$ ./DmASMSvrServiceASM start
Starting DmASMSvrServiceASM:                               [ OK ]
#启动DM数据库服务
节点1:
[dmdba@dmdsc01 bin]$ ./DmServiceDSC start
Starting DmServiceDSC: failed to connnect dmasmtool(dmasmtoolm), please wait 5 seconds...
failed to connnect dmasmtool(dmasmtoolm), please wait 5 seconds...
failed to connnect dmasmtool(dmasmtoolm).
[ FAILED ]
启动节点1数据库服务时,报错提示无法连接dmasmtool
节点2:
[dmdba@dmdsc02 bin]$ ./DmServiceDSC start
Starting DmServiceDSC: failed to connnect dmasmtool(dmasmtoolm), please wait 5 seconds...
failed to connnect dmasmtool(dmasmtoolm), please wait 5 seconds...
failed to connnect dmasmtool(dmasmtoolm).
[ FAILED ]
启动节点2数据库服务时,报错提示无法连接dmasmtool
经检查确认DMASM和DMSERVER启动失败,DMCSS启动成功。

5.数据库启动失败排查分析
现象描述:数据库作业重启数据库集群后两节点数据库服务启动均失败,报错如下。
failed to connnect dmasmtool(dmasmtoolm), please wait 5 seconds...
5.1.检查DMSERVER日志

节点1 DMSERVER日志:
2024-03-21 18:07:05.715 [INFO] database P0000003995 T0000000000000003995  INI parameter DPC_2PC changed, the original value 1, new value 0
2024-03-21 18:07:05.756 [INFO] database P0000003995 T0000000000000003995  info get from dcr disk, asm_host = localhost, asm_port = 5436, global_dcr_ep_host = , global_dcr_ep_port = 5236, global_dcr_css_host = 10.10.10.104, global_dcr_css_port = 122810688
2024-03-21 18:07:05.765 [ERROR] database P0000003995 T0000000000000003995  Can't connect to DM server on 'localhost' port(5436) errno(111) --5436为节点1 ASM的MAL监听端口,此处报错不能连接localhost 5436端口
2024-03-21 18:07:05.765 [ERROR] database P0000003995 T0000000000000003995  os_asm_env_init->g_os_asm_func.os_asm_connect(localhost,5436): [CODE:-11041] ASM connection exception --ASM 连接异常
2024-03-21 18:07:05.766 [FATAL] database P0000003995 T0000000000000003995  dmasm api init failed, [code: -11041]ASM connection exception --dmasm api初始化失败,ASM 连接异常
检查节点2 DMSERVER日志也报同样错误:
[dmdba@dmdsc02 log]$ tail -f /dm/dmdbms/dm8/log/dm_DSC1_202403.log
2024-03-21 18:09:53.308 [INFO] database P0000004053 T0000000000000004053  INI parameter DPC_2PC changed, the original value 1, new value 0
2024-03-21 18:09:53.367 [INFO] database P0000004053 T0000000000000004053  info get from dcr disk, asm_host = localhost, asm_port = 5437, global_dcr_ep_host = , global_dcr_ep_port = 5236, global_dcr_css_host = 10.10.10.104, global_dcr_css_port = 122810688
2024-03-21 18:09:53.385 [ERROR] database P0000004053 T0000000000000004053  Can't connect to DM server on 'localhost' port(5437) errno(111)  --5437为节点2 ASM的MAL监听端口,此处报错不能连接localhost 5437端口
2024-03-21 18:09:53.385 [ERROR] database P0000004053 T0000000000000004053  os_asm_env_init->g_os_asm_func.os_asm_connect(localhost,5437): [CODE:-11041] ASM connection exception --ASM 连接异常
2024-03-21 18:09:53.386 [FATAL] database P0000004053 T0000000000000004053  dmasm api init failed, [code: -11041]ASM connection exception --dmasm api初始化失败,ASM 连接异常
经排查两个节点 DMSERVER日志都报错ASM连接异常,接下来检查ASM日志是否有异常信息。

5.2.检查DMASM日志

节点1 DMASM日志:
[dmdba@dmdsc01 log]$ tail -f /dm/dmdbms/dm8/log/dm_ASM0_202403.log
2024-03-21 18:05:08.328 [INFO] dmasmsvr P0000003702 T0000000000000003707  mal_site_port_get site_magic:33202, src_site:0, dst_site:0
2024-03-21 18:05:08.328 [INFO] dmasmsvr P0000003702 T0000000000000003713  site[0] mal_site_data_port_set from site[0, IP: 10.10.10.104, port_num: 5636], socket handle = 10, site_magic = 33202, link_seq = 1
2024-03-21 18:05:08.329 [INFO] dmasmsvr P0000003702 T0000000000000003708  mal_site_port_get site_magic:33202, src_site:0, dst_site:0
2024-03-21 18:05:08.330 [ERROR] dmasmsvr P0000003702 T0000000000000003714  Can't connect to DM server on '10.10.10.105' port(5637) errno(111) --不能连接节点2 10.10.10.105 ASM MAL 监听端口5637,因为此时节点2 ASM还未启动。
2024-03-21 18:05:28.373 [INFO] dmasmsvr P0000003702 T0000000000000003714  mal_site_ctl_link_create startup from mal_site(0) to mal_site(1)!
2024-03-21 18:05:28.373 [INFO] dmasmsvr P0000003702 T0000000000000003714  mal_site_magic_gen site_magic[46433], src_site:0, dst_site:1
2024-03-21 18:05:28.374 [INFO] dmasmsvr P0000003702 T0000000000000003714  site[0] mal_site_ctl_port_set to site[1, IP: 10.10.10.105, port_num: 5637], socket handle = 11, site_magic = 46433, link_seq = 1
2024-03-21 18:05:28.374 [INFO] dmasmsvr P0000003702 T0000000000000003710  mal_site_port_get site_magic:46433, src_site:0, dst_site:1
2024-03-21 18:05:28.374 [INFO] dmasmsvr P0000003702 T0000000000000003711  mal_site_port_get site_magic:46433, src_site:0, dst_site:1
2024-03-21 18:05:28.381 [INFO] dmasmsvr P0000003702 T0000000000000003713  site[0] mal_site_data_port_set from site[1, IP: 10.10.10.105, port_num: 5637], socket handle = 12, site_magic = 46433, link_seq = 1
2024-03-21 18:05:28.381 [INFO] dmasmsvr P0000003702 T0000000000000003712  mal_site_port_get site_magic:46433, src_site:0, dst_site:1
2024-03-21 18:05:28.381 [INFO] dmasmsvr P0000003702 T0000000000000003709  mal_site_port_get site_magic:46433, src_site:0, dst_site:1
2024-03-21 18:05:29.379 [INFO] dmasmsvr P0000003702 T0000000000000003702  dmshm2_create: shm created success, shm id 98304
2024-03-21 18:05:29.379 [INFO] dmasmsvr P0000003702 T0000000000000003702  dmshm2_attach, success, shm id 98304
2024-03-21 18:05:29.417 [INFO] dmasmsvr P0000003702 T0000000000000003702  os_sema2_create_low_ex shm_open:DMSHM40633069, size:96 success!
2024-03-21 18:05:29.422 [INFO] dmasmsvr P0000003702 T0000000000000003702  DMASMSVR SYSTEM IS READY.
2024-03-21 18:05:29.422 [INFO] dmasmsvr P0000003702 T0000000000000003702  [for dem]SYSTEM IS READY.
2024-03-21 18:05:30.425 [INFO] dmasmsvr P0000003702 T0000000000000003725  check css cmd: START NOTIFY, cmd_seq: 2
2024-03-21 18:05:31.433 [INFO] dmasmsvr P0000003702 T0000000000000003725  check css cmd: EP START, cmd_seq: 3
2024-03-21 18:05:31.447 [FATAL] dmasmsvr P0000003702 T0000000000000003725  Load dcr disk or vote disk in [/dev/asm] failed, please check and try again --在/dev/asm中加载dcr磁盘或vote磁盘失败
2024-03-21 18:05:31.447 [FATAL] dmasmsvr P0000003702 T0000000000000003725  [for dem]SYSTEM SHUTDOWN ABORT.--dmasmsvr被关闭终止
检查节点2 DMASM日志也报同样错误:
[dmdba@dmdsc02 log]$ tail -f /dm/dmdbms/dm8/log/dm_ASM1_202403.log
2024-03-21 18:05:30.913 [WARNING] dmasmsvr P0000003743 T0000000000000003750  MAL receive site(1) lost connect to site(0), ctl_handle(9), data_handle(10), dsc_handle(0)
2024-03-21 18:05:30.913 [WARNING] dmasmsvr P0000003743 T0000000000000003750  site(1) ctl_link mal_site_letter_recv from site(0) failed, socket handle = 0, mal sys status is 0, try to get mal_port again
2024-03-21 18:05:30.913 [WARNING] dmasmsvr P0000003743 T0000000000000003751  mal_site_letter_recv code=-6007, errno=104, site(1) recv from site(0) failed, socket handle = 0
2024-03-21 18:05:30.913 [WARNING] dmasmsvr P0000003743 T0000000000000003751  MAL receive site(1) lost connect to site(0), ctl_handle(0), data_handle(0), dsc_handle(0)
2024-03-21 18:05:30.914 [WARNING] dmasmsvr P0000003743 T0000000000000003751  site(1) data_link mal_site_letter_recv from site(0) failed, socket handle = 0, mal sys status is 0, try to get mal_port again
2024-03-21 18:05:30.914 [WARNING] dmasmsvr P0000003743 T0000000000000003749  mal_site_tsk_check site(1) connect lost to site(0), socket handle = 0, mal sys status = 0, try get port again
2024-03-21 18:05:30.914 [INFO] dmasmsvr P0000003743 T0000000000000003749  send  CMD_MAL_LINK_CHECK(350): (mal_id:0, stmt_id:0, mppexec_id:0, pln_op_id:0, org_site :0, src_site:0, dest_site:0, build_time:-1)
2024-03-21 18:05:30.914 [INFO] dmasmsvr P0000003743 T0000000000000003749  send  CMD_MAL_LINK_CHECK(350): (mal_id:0, stmt_id:0, mppexec_id:0, pln_op_id:0, org_site :0, src_site:0, dest_site:0, build_time:0)
2024-03-21 18:05:30.914 [WARNING] dmasmsvr P0000003743 T0000000000000003748  mal_site_tsk_check site(1) connect lost to site(0), socket handle = 0, mal sys status = 0, try get port again
2024-03-21 18:05:30.914 [INFO] dmasmsvr P0000003743 T0000000000000003748  send  CMD_MAL_LINK_CHECK(350): (mal_id:0, stmt_id:0, mppexec_id:0, pln_op_id:0, org_site :0, src_site:0, dest_site:0, build_time:0)
2024-03-21 18:05:30.914 [INFO] dmasmsvr P0000003743 T0000000000000003748  send  CMD_MAL_LINK_CHECK(350): (mal_id:0, stmt_id:0, mppexec_id:0, pln_op_id:0, org_site :0, src_site:0, dest_site:0, build_time:0)
2024-03-21 18:08:03.621 [INFO] dmasmsvr P0000003743 T0000000000000003759  check css cmd: START NOTIFY, cmd_seq: 6
2024-03-21 18:08:04.652 [INFO] dmasmsvr P0000003743 T0000000000000003759  check css cmd: EP START, cmd_seq: 7
2024-03-21 18:08:04.698 [FATAL] dmasmsvr P0000003743 T0000000000000003759  Load dcr disk or vote disk in [/dev/asm] failed, please check and try again  --在/dev/asm中加载dcr磁盘或vote磁盘失败
2024-03-21 18:08:04.698 [FATAL] dmasmsvr P0000003743 T0000000000000003759  [for dem]SYSTEM SHUTDOWN ABORT.--dmasmsvr被关闭终止
经排查两个节点DMASM日志报错:“在/dev/asm中加载dcr磁盘或vote磁盘失败”,怀疑dcr磁盘或vote磁盘绑定异常或磁盘状态异常,下面检查dcr和vote磁盘以及DMCSS日志。

5.3.检查DMCSS日志

节点1 DMCSS日志:
[dmdba@dmdsc01 log]$tail -f /dm/dmdbms/dm8/log/dm_CSS0_202403.log
2024-03-21 18:05:07.613 [INFO] dmcss P0000003622 T0000000000000003626  css set ASM [ASM0] guid [1502677]
2024-03-21 18:05:27.671 [INFO] dmcss P0000003622 T0000000000000003626  css set ASM [ASM1] guid [1502716]
2024-03-21 18:05:29.679 [INFO] dmcss P0000003622 T0000000000000003626  [ASM]: 设置EP ASM0[0]为控制节点
2024-03-21 18:05:29.682 [INFO] dmcss P0000003622 T0000000000000003626  [ASM]: status change from (OPEN, STARTUP) to (Control Node STARTUP, WAIT_STARTUP)
2024-03-21 18:05:29.682 [INFO] dmcss P0000003622 T0000000000000003626  [ASM]: Control Node[0], break eps:(NULL), recover ep[255], ok eps:[0, 1]
2024-03-21 18:05:29.682 [INFO] dmcss P0000003622 T0000000000000003626  [ASM]: 设置命令[START NOTIFY], 目标站点 ASM0[0], 命令序号[2]
2024-03-21 18:05:30.687 [INFO] dmcss P0000003622 T0000000000000003626  [ASM]: 设置命令[EP START], 目标站点 ASM0[0], 命令序号[3]
2024-03-21 18:05:41.052 [INFO] dmcss P0000003622 T0000000000000003626  Instance ASM [ASM0] has not been detected for about 10 seconds, CSS may probably exclude the instance from the cluster after 50 seconds 
2024-03-21 18:05:51.378 [INFO] dmcss P0000003622 T0000000000000003626  Instance ASM [ASM0] has not been detected for about 20 seconds, CSS may probably exclude the instance from the cluster after 40 seconds
2024-03-21 18:06:01.759 [INFO] dmcss P0000003622 T0000000000000003626  Instance ASM [ASM0] has not been detected for about 30 seconds, CSS may probably exclude the instance from the cluster after 30 seconds
2024-03-21 18:06:12.154 [INFO] dmcss P0000003622 T0000000000000003626  Instance ASM [ASM0] has not been detected for about 40 seconds, CSS may probably exclude the instance from the cluster after 20 seconds
2024-03-21 18:06:22.525 [INFO] dmcss P0000003622 T0000000000000003626  Instance ASM [ASM0] has not been detected for about 50 seconds, CSS may probably exclude the instance from the cluster after 10 seconds --已经50S没有检测到ASM实例
2024-03-21 18:06:32.859 [ERROR] dmcss P0000003622 T0000000000000003626  [CSS]: detect ASM [ASM0] broken, need to force halt the same-site DB instance [DSC0]. --检测ASM 实例ASM0已中断,需要强制停止DB实例DSC0
2024-03-21 18:06:32.859 [INFO] dmcss P0000003622 T0000000000000003626  The timestamp of instance ASM [ASM0] has not changed for about 60 seconds, Last timestamp:1502700, css sta:Control Node STARTUP. CSS will exclude the instance from the cluster and launch crash recovery process --ASM实例ASM0的时间戳约60秒未更改,剔除ASM0实例
节点2 DMCSS日志报同样错误:
[dmdba@dmdsc02 log]$ tail -f /dm/dmdbms/dm8/log/dm_CSS1_202403.log
2024-03-21 18:05:07.465 [INFO] dmcss P0000003566 T0000000000000003570  css set ASM [ASM0] guid [1502677]
2024-03-21 18:05:27.546 [INFO] dmcss P0000003566 T0000000000000003570  css set ASM [ASM1] guid [1502716]
2024-03-21 18:05:41.595 [INFO] dmcss P0000003566 T0000000000000003570  Instance ASM [ASM0] has not been detected for about 10 seconds, CSS may probably exclude the instance from the cluster after 50 seconds
2024-03-21 18:05:51.619 [INFO] dmcss P0000003566 T0000000000000003570  Instance ASM [ASM0] has not been detected for about 20 seconds, CSS may probably exclude the instance from the cluster after 40 seconds
2024-03-21 18:06:01.659 [INFO] dmcss P0000003566 T0000000000000003570  Instance ASM [ASM0] has not been detected for about 30 seconds, CSS may probably exclude the instance from the cluster after 30 seconds
2024-03-21 18:06:11.697 [INFO] dmcss P0000003566 T0000000000000003570  Instance ASM [ASM0] has not been detected for about 40 seconds, CSS may probably exclude the instance from the cluster after 20 seconds
2024-03-21 18:06:21.732 [INFO] dmcss P0000003566 T0000000000000003570  Instance ASM [ASM0] has not been detected for about 50 seconds, CSS may probably exclude the instance from the cluster after 10 seconds
2024-03-21 18:06:31.758 [ERROR] dmcss P0000003566 T0000000000000003570  [CSS]: detect ASM [ASM0] broken, need to force halt the same-site DB instance [DSC0]. --检测ASM 实例ASM0已中断,需要强制停止DB实例DSC0
2024-03-21 18:06:31.758 [INFO] dmcss P0000003566 T0000000000000003570  The timestamp of instance ASM [ASM0] has not changed for about 60 seconds, Last timestamp:1502700, css sta:OPEN. CSS will exclude the instance from the cluster and launch crash recovery process   --ASM实例ASM0的时间戳约60秒未更改,剔除ASM0实例
2024-03-21 18:08:15.247 [INFO] dmcss P0000003566 T0000000000000003570  Instance ASM [ASM1] has not been detected for about 10 seconds, CSS may probably exclude the instance from the cluster after 50 seconds
2024-03-21 18:08:25.284 [INFO] dmcss P0000003566 T0000000000000003570  Instance ASM [ASM1] has not been detected for about 20 seconds, CSS may probably exclude the instance from the cluster after 40 seconds
2024-03-21 18:08:35.313 [INFO] dmcss P0000003566 T0000000000000003570  Instance ASM [ASM1] has not been detected for about 30 seconds, CSS may probably exclude the instance from the cluster after 30 seconds
2024-03-21 18:08:45.348 [INFO] dmcss P0000003566 T0000000000000003570  Instance ASM [ASM1] has not been detected for about 40 seconds, CSS may probably exclude the instance from the cluster after 20 seconds
2024-03-21 18:08:55.387 [INFO] dmcss P0000003566 T0000000000000003570  Instance ASM [ASM1] has not been detected for about 50 seconds, CSS may probably exclude the instance from the cluster after 10 seconds
2024-03-21 18:09:05.416 [ERROR] dmcss P0000003566 T0000000000000003570  [CSS]: detect ASM [ASM1] broken, need to force halt the same-site DB instance [DSC1]. --检测ASM 实例ASM1已中断,需要强制停止DB实例DSC1
2024-03-21 18:09:05.416 [INFO] dmcss P0000003566 T0000000000000003570  The timestamp of instance ASM [ASM1] has not changed for about 60 seconds, Last timestamp:1502872, css sta:OPEN. CSS will exclude the instance from the cluster and launch crash recovery process --ASM实例ASM1的时间戳约60秒未更改,剔除ASM1实例
经排查DMCSS日志发现ASM实例ASM0和ASM1的时间戳约60秒未更改,DMCSS已剔除ASM0实例和ASM1实例。

5.4.检查操作系统日志

节点1:
[root@dmdsc01 log]# tail -f /var/log/messages
Mar 21 18:00:01 dmdsc01 systemd: Started Session 11 of user root.
Mar 21 18:00:01 dmdsc01 systemd: Starting Session 11 of user root.
Mar 21 18:01:01 dmdsc01 systemd: Started Session 12 of user root.
Mar 21 18:01:01 dmdsc01 systemd: Starting Session 12 of user root.
Mar 21 18:10:01 dmdsc01 systemd: Started Session 13 of user root.
Mar 21 18:10:01 dmdsc01 systemd: Starting Session 13 of user root.
Mar 21 18:20:01 dmdsc01 systemd: Started Session 14 of user root.
Mar 21 18:20:01 dmdsc01 systemd: Starting Session 14 of user root.
Mar 21 18:25:01 dmdsc01 systemd: Created slice User Slice of pcp.
Mar 21 18:25:01 dmdsc01 systemd: Starting User Slice of pcp.
Mar 21 18:25:01 dmdsc01 systemd: Started Session 15 of user pcp.
Mar 21 18:25:01 dmdsc01 systemd: Starting Session 15 of user pcp.
Mar 21 18:25:01 dmdsc01 systemd: Removed slice User Slice of pcp.
Mar 21 18:25:01 dmdsc01 systemd: Stopping User Slice of pcp.
Mar 21 18:27:01 dmdsc01 rsyslogd: [origin software="rsyslogd" swVersion="8.24.0" x-pid="1297" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
节点2:
[root@dmdsc02 log]# tail -f /var/log/messages
Mar 21 18:08:02 dmdsc02 systemd: Created slice User Slice of pcp.
Mar 21 18:08:02 dmdsc02 systemd: Starting User Slice of pcp.
Mar 21 18:08:02 dmdsc02 systemd: Started Session 12 of user pcp.
Mar 21 18:08:02 dmdsc02 systemd: Starting Session 12 of user pcp.
Mar 21 18:08:02 dmdsc02 systemd: Removed slice User Slice of pcp.
Mar 21 18:08:02 dmdsc02 systemd: Stopping User Slice of pcp.
Mar 21 18:08:25 dmdsc02 systemd: Created slice User Slice of root.
Mar 21 18:08:25 dmdsc02 systemd: Starting User Slice of root.
Mar 21 18:08:25 dmdsc02 systemd: Started Session 13 of user root.
Mar 21 18:08:25 dmdsc02 systemd-logind: New session 13 of user root.
Mar 21 18:08:25 dmdsc02 systemd: Starting Session 13 of user root.
操作系统日志检查正常,无磁盘相关报错。

5.5.检查dcr和vote磁盘

节点1:
1)检查dcr和vote磁盘UDEV绑定是否生效
[dmdba@dmdsc01 log]$ ls -rlt /dev/asm/
总用量 0
lrwxrwxrwx. 1 root root 7 3月  21 21:14 asm-dmvote -> ../dm-2  
lrwxrwxrwx. 1 root root 7 3月  21 21:14 asm-dmredo -> ../dm-5
lrwxrwxrwx. 1 root root 7 3月  21 21:14 asm-dmdata02 -> ../dm-6
lrwxrwxrwx. 1 root root 7 3月  21 21:14 asm-dmdata01 -> ../dm-4
lrwxrwxrwx. 1 root root 7 3月  21 21:27 asm-dmdcr -> ../dm-3
2)检查dcr和vote磁盘的UDEV绑定配置正常
[root@dmdsc01 ~]# cat /etc/udev/rules.d/88-dm-asmdevices.rules 
KERNEL=="dm-*",ENV{DM_UUID}=="mpath-VBOX_HARDDISK_VB1c21e9b6-d310e9f1",SYMLINK+="asm/asm-dmdcr",OWNER="dmdba",GROUP="dinstall",MODE="0660"
KERNEL=="dm-*",ENV{DM_UUID}=="mpath-VBOX_HARDDISK_VB15e5500e-599af8bc",SYMLINK+="asm/asm-dmvote",OWNER="dmdba",GROUP="dinstall",MODE="0660"
KERNEL=="dm-*",ENV{DM_UUID}=="mpath-VBOX_HARDDISK_VB4b2e91c8-edf2d843",SYMLINK+="asm/asm-dmredo",OWNER="dmdba",GROUP="dinstall",MODE="0660"
KERNEL=="dm-*",ENV{DM_UUID}=="mpath-VBOX_HARDDISK_VB8c0ec026-2d6fa987",SYMLINK+="asm/asm-dmdata01",OWNER="dmdba",GROUP="dinstall",MODE="0660"
KERNEL=="dm-*",ENV{DM_UUID}=="mpath-VBOX_HARDDISK_VB10c96c68-f8b266f8",SYMLINK+="asm/asm-dmdata02",OWNER="dmdba",GROUP="dinstall",MODE="0660"
3)检查多路径共享盘dcr和vote磁盘正常
[root@dmdsc01 ~]# multipath -ll
asm-dmdcr (VBOX_HARDDISK_VB1c21e9b6-d310e9f1) dm-3 ATA     ,VBOX HARDDISK   
size=2.0G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 4:0:0:0 sdc 8:32 active ready running
asm-dmvote (VBOX_HARDDISK_VB15e5500e-599af8bc) dm-2 ATA     ,VBOX HARDDISK   
size=2.0G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 3:0:0:0 sdb 8:16 active ready running
asm-dmdata02 (VBOX_HARDDISK_VB10c96c68-f8b266f8) dm-6 ATA     ,VBOX HARDDISK   
size=3.0G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 5:0:0:0 sdd 8:48 active ready running
asm-dmdata01 (VBOX_HARDDISK_VB8c0ec026-2d6fa987) dm-4 ATA     ,VBOX HARDDISK   
size=3.0G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 7:0:0:0 sdf 8:80 active ready running
asm-dmredo (VBOX_HARDDISK_VB4b2e91c8-edf2d843) dm-5 ATA     ,VBOX HARDDISK   
size=3.0G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 6:0:0:0 sde 8:64 active ready running
节点2:
1)检查dcr和vote磁盘UDEV绑定是否生效
[dmdba@dmdsc02 bin]$ ls -rlt /dev/asm/
总用量 0
lrwxrwxrwx 1 root root 7 3月  21 20:51 asm-dmvote -> ../dm-2
lrwxrwxrwx 1 root root 7 3月  21 20:51 asm-dmdcr -> ../dm-3
lrwxrwxrwx 1 root root 7 3月  21 20:51 asm-dmdata02 -> ../dm-4
lrwxrwxrwx 1 root root 7 3月  21 20:51 asm-dmredo -> ../dm-6
lrwxrwxrwx 1 root root 7 3月  21 20:51 asm-dmdata01 -> ../dm-5
2)检查dcr和vote磁盘UDEV绑定配置正常
[root@dmdsc02 ~]# cat /etc/udev/rules.d/88-dm-asmdevices.rules 
KERNEL=="dm-*",ENV{DM_UUID}=="mpath-VBOX_HARDDISK_VB8c0ec026-2d6fa987",SYMLINK+="asm/asm-dmdata01",OWNER="dmdba",GROUP="dinstall",MODE="0660"
KERNEL=="dm-*",ENV{DM_UUID}=="mpath-VBOX_HARDDISK_VB10c96c68-f8b266f8",SYMLINK+="asm/asm-dmdata02",OWNER="dmdba",GROUP="dinstall",MODE="0660"
KERNEL=="dm-*",ENV{DM_UUID}=="mpath-VBOX_HARDDISK_VB1c21e9b6-d310e9f1",SYMLINK+="asm/asm-dmdcr",OWNER="dmdba",GROUP="dinstall",MODE="0660"
KERNEL=="dm-*",ENV{DM_UUID}=="mpath-VBOX_HARDDISK_VB4b2e91c8-edf2d843",SYMLINK+="asm/asm-dmredo",OWNER="dmdba",GROUP="dinstall",MODE="0660"
KERNEL=="dm-*",ENV{DM_UUID}=="mpath-VBOX_HARDDISK_VB15e5500e-599af8bc",SYMLINK+="asm/asm-dmvote",OWNER="dmdba",GROUP="dinstall",MODE="0660"
3)检查多路径共享盘dcr和vote磁盘正常
[root@dmdsc02 ~]#  multipath -ll
asm-dmdcr (VBOX_HARDDISK_VB1c21e9b6-d310e9f1) dm-3 ATA     ,VBOX HARDDISK   
size=2.0G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 4:0:0:0 sdc 8:32 active ready running
asm-dmvote (VBOX_HARDDISK_VB15e5500e-599af8bc) dm-2 ATA     ,VBOX HARDDISK   
size=2.0G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 3:0:0:0 sdb 8:16 active ready running
asm-dmdata02 (VBOX_HARDDISK_VB10c96c68-f8b266f8) dm-4 ATA     ,VBOX HARDDISK   
size=3.0G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 5:0:0:0 sdd 8:48 active ready running
asm-dmdata01 (VBOX_HARDDISK_VB8c0ec026-2d6fa987) dm-5 ATA     ,VBOX HARDDISK   
size=3.0G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 7:0:0:0 sdf 8:80 active ready running
asm-dmredo (VBOX_HARDDISK_VB4b2e91c8-edf2d843) dm-6 ATA     ,VBOX HARDDISK   
size=3.0G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 6:0:0:0 sde 8:64 active ready running
4)检查ASM磁盘属性:
ASM>listdisks '/dev/asm/'
[/dev/asm//asm-dmdcr]: Used ASM disk, name:[DMASMdcr], size:[2048M], group_id:[126], disk_id:[0]
[/dev/asm//asm-dmvote]: Raw device
[/dev/asm//asm-dmdata01]: Raw device
[/dev/asm//asm-dmredo]: Raw device
[/dev/asm//asm-dmdata02]: Raw device
Used time: 40.953(ms).
查看ASM磁盘属性发现除dmdcr盘外,其它ASM磁盘已变成裸设备,怀疑其它ASM磁盘已损坏,下面使用磁盘头工具dmasmmgt对dcr和vote磁盘头元数据校验。
5)校验ASM磁盘头元数据
[dmdba@dmdsc01 bin]$ ./dmasmmgt
dmasmmgt V8
version: 03134284132-20231226-213242-20081
ASM>check /dev/asm
file path:[/dev/asm/asm-dmdcr], the content of file head is correct!
file path:[/dev/asm/asm-dmvote], the content of file head has been destroyed !
file path:[/dev/asm/asm-dmdata01], the content of file head has been destroyed !
file path:[/dev/asm/asm-dmredo], the content of file head has been destroyed !
file path:[/dev/asm/asm-dmdata02], the content of file head has been destroyed !
Used time: 37.136(ms).
The ASMMGT executed "check" end
使用ASM磁盘头校验工具(dmasmmgt)发现除dmdcr以外的ASM磁盘头元数据已损坏,由于vote磁盘损坏导致DMASM服务无法启动。

综上,排查分析除dmdcr以外的ASM磁盘头元数据已损坏,由于vote磁盘损坏导致DMASM服务启动失败,DMSERVER无法连接DMASM启动失败。
6.使用自动备份的ASM磁盘头镜像修复ASM磁盘头
6.1.校验自动备份的磁盘头镜像

[dmdba@dmdsc01 bin]$ ./dmasmmgt
dmasmmgt V8
version: 03134284132-20231226-213242-20081
ASM>check mirror /dev/asm/
file path:[/dev/asm/asm-dmdata02], the content of file head has been destroyed !
file path:[/dev/asm/asm-dmdata02], the content of file head mirror is correct!
file path:[/dev/asm/asm-dmredo], the content of file head has been destroyed !
file path:[/dev/asm/asm-dmredo], the content of file head mirror is correct!
file path:[/dev/asm/asm-dmdata01], the content of file head has been destroyed !
file path:[/dev/asm/asm-dmdata01], the content of file head mirror is correct!
file path:[/dev/asm/asm-dmdcr], the content of file head is correct!
file path:[/dev/asm/asm-dmdcr], the content of file head mirror is correct!
file path:[/dev/asm/asm-dmvote], the content of file head has been destroyed !
file path:[/dev/asm/asm-dmvote], the content of file head mirror is correct!
Used time: 35.555(ms).
The ASMMGT executed "check" end
ASM磁盘头镜像校验正常。

6.2.修复vote磁盘头元数据

使用自动备份的ASM磁盘头元数据先恢复vote磁盘头从0偏移开始,大小为1024字节的元数据:
[dmdba@dmdsc01 bin]$ ./dmasmmgt
dmasmmgt V8
version: 03134284132-20231226-213242-20081
ASM>repair /dev/asm/asm-dmvote start=0 size=1024
file path:[/dev/asm/asm-dmvote], the content of file head has been destroyed !
file path:[/dev/asm/asm-dmvote], the content of file head mirror is correct!
do you want to continue?(y/n) input y to continue,input n to quit
y
asm file(/dev/asm/asm-dmvote),repair successfully!
Used time: 00:00:01.997.
The ASMMGT executed "repair" end
重新校验vote磁盘头元数据正确,vote磁盘头已修复,校验结果如下:
ASM>check /dev/asm/asm-dmvote
file path:[/dev/asm/asm-dmvote], the content of file head is correct!  --vote磁盘头元数据校验正确
Used time: 2.762(ms).
The ASMMGT executed "check" end

6.3.修复vote盘后尝试启动数据库服务失败
1)尝试启动DMCSS、DMASM、DMSERVER服务

$cd /dm/dmdbms/bin
#两节点皆要执行
[dmdba@dmdscXX bin]$ ./DmCSSServiceCSS start
稍等CSS会自动把ASM和数据库自动拉起,确认集群状态正常
注意:若是没有配置自动拉起,则需要执行以下命令手动拉起DMASM/DMSERVER服务
#两节点皆要执行
[dmdba@dmdscXX bin]$ ./DmASMSvrServiceASM start
[dmdba@dmdscXX bin]$ ./DmServiceDSC start
启动节点1数据库失败:
[dmdba@dmdsc01 bin]$ ./DmServiceDSC start
Starting DmServiceDSC: connnect dmasmtool(dmasmtoolm) successfully.
[ FAILED ]
启动节点2数据库失败:
[dmdba@dmdsc02 bin]$ ./DmServiceDSC start
Starting DmServiceDSC: connnect dmasmtool(dmasmtoolm) successfully.
[ FAILED ]

2)检查DMSERVER日志

节点1 DMSERVER日志报错如下:
2024-03-21 22:31:22.429 [ERROR] database P0000004613 T0000000000000004613  os_file_open_low->os_asm_file_open: [path: +DMDATA/data/dsc/dm.ctl]: [CODE:-2405] File or Directory [+DMDATA/data/dsc/dm.ctl] does not exist --报错+DMDATA/data/dsc/dm.ctl文件或目录不存在
2024-03-21 22:31:22.431 [ERROR] database P0000004613 T0000000000000004613  os_file_open_low->os_asm_file_open: [path: +DMDATA/data/dsc/dm.ctl]: [CODE:-2405] File or Directory [+DMDATA/data/dsc/dm.ctl] does not exist
2024-03-21 22:31:22.431 [ERROR] database P0000004613 T0000000000000004613  ctl file:[+DMDATA/data/dsc/dm.ctl] is on asm file system, please check the dcr_ini parameter is correct or dmasmsvr is active, then try again. --+DMDATA/data/dsc/dm.ctl存放在asm上,请检查dcr_ini参数是否正确或者dmasmsvr服务是否正常
2024-03-21 22:31:22.431 [FATAL] database P0000004613 T0000000000000004613  ctl file info get failed --控制文件信息获取失败
节点2 DMSERVER日志报错和节点1一样,报错如下:
2024-03-21 22:40:35.688 [ERROR] database P0000004662 T0000000000000004662  os_file_open_low->os_asm_file_open: [path: +DMDATA/data/dsc/dm.ctl]: [CODE:-2405] File or Directory [+DMDATA/data/dsc/dm.ctl] does not exist  --报错+DMDATA/data/dsc/dm.ctl文件或目录不存在
2024-03-21 22:40:35.699 [ERROR] database P0000004662 T0000000000000004662  os_file_open_low->os_asm_file_open: [path: +DMDATA/data/dsc/dm.ctl]: [CODE:-2405] File or Directory [+DMDATA/data/dsc/dm.ctl] does not exist
2024-03-21 22:40:35.700 [ERROR] database P0000004662 T0000000000000004662  ctl file:[+DMDATA/data/dsc/dm.ctl] is on asm file system, please check the dcr_ini parameter is correct or dmasmsvr is active, then try again.--+DMDATA/data/dsc/dm.ctl存放在asm上,请检查dcr_ini参数是否正确或者dmasmsvr服务是否正常
2024-03-21 22:40:35.700 [FATAL] database P0000004662 T0000000000000004662  ctl file info get failed--控制文件信息获取失败
1、检查+DMDATA/data/dsc/dm.ctl文件或目录是否存在。
[dmdba@dmdsc01 bin]$ /dm/dmdbms/dm8/bin/dmasmtool DCR_INI=/dm/dsc/config/dmdcr.ini
DMASMTOOL V8
ASM>ls +DMDATA/data/dsc/dm.ctl
ls +DMDATA/data/dsc/dm.ctl failed:[code : -2405] 文件或目录[+DMDATA/data/dsc/dm.ctl]不存在
Used time: 4.275(ms).
ASM>cd +DMDATA
[code : -2405] 文件或目录[+DMDATA]不存在
ASM>lsdg
total 2 groups......
1 disk_group:
         name: VOTE
         id: 125
         au_size: 1.00 MB
         extent_size: 4
         total_size: 2.00 GB
         free_size: 1.98 GB
         total_file_num: 2
2 disk_group:
         name: DCR
         id: 126
         au_size: 1.00 MB
         extent_size: 4
         total_size: 2.00 GB
         free_size: 1.98 GB
         total_file_num: 2
Used time: 13.677(ms).
检查情况:检查未发现DMDATA ASM磁盘。
2、检查dmdcr.ini配置正确
节点1:
[dmdba@dmdsc01 log]$ cat /dm/dsc/config/dmdcr.ini
DMDCR_PATH                 = /dev/asm/asm-dmdcr
DMDCR_MAL_PATH             = /dm/dsc/config/dmasvrmal.ini
DMDCR_SEQNO                = 0
DMDCR_AUTO_OPEN_CHECK      = 90   #指定时间内如果节点实例未启动,DMCSS 会自动将节点踢出集群环境,单位为秒
#DMDCR_ASM_RESTART_INTERVAL = 30  #CSS认定ASM故障重启的时间
#DMDCR_ASM_STARTUP_CMD      = /dm/dmdbms/dm8/bin/dmasmsvr dcr_ini=/dm/dsc/config/dmdcr.ini
#DMDCR_DB_RESTART_INTERVAL  = 60  #CSS认定DSC故障重启的时间
#DMDCR_DB_STARTUP_CMD       = /dm/dmdbms/dm8/bin/dmserver path=/dm/dsc/config/dsc0_config/dm.ini dcr_ini=/dm/dsc/config/dmdcr.ini
节点2:
[dmdba@dmdsc02 bin]$ cat /dm/dsc/config/dmdcr.ini
DMDCR_PATH                 = /dev/asm/asm-dmdcr
DMDCR_MAL_PATH             = /dm/dsc/config/dmasvrmal.ini
DMDCR_SEQNO                = 1
DMDCR_AUTO_OPEN_CHECK      = 90
#DMDCR_ASM_RESTART_INTERVAL = 30
#DMDCR_ASM_STARTUP_CMD      = /dm/dmdbms/dm8/bin/dmasmsvr dcr_ini=/dm/dsc/config/dmdcr.ini
#DMDCR_DB_RESTART_INTERVAL  = 60
#DMDCR_DB_STARTUP_CMD       = /dm/dmdbms/dm8/bin/dmserver path=/dm/dsc/config/dsc1_config/dm.ini dcr_ini=/dm/dsc/config/dmdcr.ini

3、检查dmasmsvr服务正常
image.png
vote 磁盘修复后,因DMDATA等其它ASM磁盘还未修复,导致无法访问控制文件,dmdcr.ini配置文件和dmasmsvr服务检查正常,下面对其它ASM磁盘头进行修复。
6.4.检查修复其它已损坏的ASM磁盘头
1)检查其它已损坏的ASM磁盘

[dmdba@dmdsc01 bin]$ ./dmasmmgt
dmasmmgt V8
version: 03134284132-20231226-213242-20081
ASM>check /dev/asm
file path:[/dev/asm/asm-dmdata02], the content of file head has been destroyed !
file path:[/dev/asm/asm-dmredo], the content of file head has been destroyed !
file path:[/dev/asm/asm-dmdata01], the content of file head has been destroyed !
file path:[/dev/asm/asm-dmdcr], the content of file head is correct!
file path:[/dev/asm/asm-dmvote], the content of file head is correct!
Used time: 28.717(ms).
The ASMMGT executed "check" end
检查确认asm-dmdata01、asm-dmdata02、asm-dmredo磁盘头元数据已损坏

2)修复其它ASM磁盘元数据

[dmdba@dmdsc01 bin]$ ./dmasmmgt
dmasmmgt V8
version: 03134284132-20231226-213242-20081
ASM>repair /dev/asm/asm-dmdata01 start=0 size=1024
file path:[/dev/asm/asm-dmdata01], the content of file head has been destroyed !
file path:[/dev/asm/asm-dmdata01], the content of file head mirror is correct!
do you want to continue?(y/n) input y to continue,input n to quit
y
asm file(/dev/asm/asm-dmdata01),repair successfully!
Used time: 00:00:01.292.
The ASMMGT executed "repair" end
ASM>repair /dev/asm/asm-dmdata02 start=0 size=1024   
file path:[/dev/asm/asm-dmdata02], the content of file head has been destroyed !
file path:[/dev/asm/asm-dmdata02], the content of file head mirror is correct!
do you want to continue?(y/n) input y to continue,input n to quit
y
asm file(/dev/asm/asm-dmdata02),repair successfully!
Used time: 00:00:01.304.
The ASMMGT executed "repair" end
ASM>repair /dev/asm/asm-dmredo start=0 size=1024
file path:[/dev/asm/asm-dmredo], the content of file head has been destroyed !
file path:[/dev/asm/asm-dmredo], the content of file head mirror is correct!
do you want to continue?(y/n) input y to continue,input n to quit
y
asm file(/dev/asm/asm-dmredo),repair successfully!
Used time: 00:00:01.211.
The ASMMGT executed "repair" end
ASM>check /dev/asm
file path:[/dev/asm/asm-dmdata02], the content of file head is correct!
file path:[/dev/asm/asm-dmredo], the content of file head is correct!
file path:[/dev/asm/asm-dmdata01], the content of file head is correct!
file path:[/dev/asm/asm-dmdcr], the content of file head is correct!
file path:[/dev/asm/asm-dmvote], the content of file head is correct!
Used time: 31.046(ms).
The ASMMGT executed "check" end
修复asm-dmdata01、asm-dmdata02、asm-dmredo磁盘元数据后校验正常。

6.5.修复所有ASM磁盘后尝试启动数据库集群服务
1)关闭DMASM服务

#两节点皆要执行
cd /dm/dmdbms/dm8/bin
./DmASMSvrServiceASM stop

2)重新启动DMASM和DMSERVER服务

#启动DMASM服务(两节点皆要执行)
cd /dm/dmdbms/dm8/bin
./DmASMSvrServiceASM start
DMASM服务启动后,已经可以查看到所有磁盘组以及控制文件。
[dmdba@dmdsc01 bin]$ /dm/dmdbms/dm8/bin/dmasmtool DCR_INI=/dm/dsc/config/dmdcr.ini
DMASMTOOL V8
ASM>lsdg
total 4 groups......
1 disk_group:
         name: DMLOG
         id: 0
         au_size: 1.00 MB
         extent_size: 4
         total_size: 3.00 GB
         free_size: 1012.00 MB
         total_file_num: 6
2 disk_group:
         name: DMDATA
         id: 1
         au_size: 1.00 MB
         extent_size: 4
         total_size: 6.00 GB
         free_size: 1.42 GB
         total_file_num: 45
3 disk_group:
         name: VOTE
         id: 125
         au_size: 1.00 MB
         extent_size: 4
         total_size: 2.00 GB
         free_size: 1.98 GB
         total_file_num: 2
4 disk_group:
         name: DCR
         id: 126
         au_size: 1.00 MB
         extent_size: 4
         total_size: 2.00 GB
         free_size: 1.98 GB
         total_file_num: 2
Used time: 6.741(ms).
查看控制文件路径正常:
ASM>ls +DMDATA/data/dsc/dm.ctl
        file : dm.ctl
Used time: 3.621(ms).
ASM>
#启动DMSERVER服务(两节点皆要执行)
./DmServiceDSC start

6.6.ASM磁盘头修复后验证
1)验证数据库集群状态

$cd /dm/dmdbms/dm8/bin
$./dmcssm ini_path=dmcssm.ini
show

image.png
ASM磁盘头修复后DMCSS、DMASM、DMSERVER集群服务正常。
2)验证查询测试写入数据

验证查询ASM磁盘头损坏后测试写入的数据:
[dmdba@dmdsc01 bin]$ ./disql SYSDBA/密码
SQL> select c1,c2 from  test2 where c1=2;
行号     C1          C2    
---------- ----------- ------
1          2           aaccdd
测试写入更新数据:
SQL> insert into test2 values(1000,'test1000','abcd');
SQL> commit;
SQL> select * from test2 where c1=1000;
行号     C1          C2       C3  
---------- ----------- -------- ----
1          1000        test1000 abcd
SQL> update test2 set c2='test2' where  c1=2;
SQL> commit;
SQL> select c1,c2 from  test2 where c1=2;

行号     C1          C2   
---------- ----------- -----
1          2           test2
SQL> 
ASM磁盘头修复后能数据集群恢复正常运行,且能查询到在ASM磁盘头损坏后测试写入的数据,数据未丢失。

7.结论
1、ASM磁盘头前1024字节损坏后,数据库集群不重启的情况下,不影响数据库读写操作,数据库集群重启后,DMASM和DMSERVER服务无法启动,因此数据库集群作业重启前应使用磁盘头工具检测ASM磁盘头元数据是否正常,避免ASM磁盘头损坏后,数据库重启无法启动的问题。
2、数据库集群运行过程中,DCR磁盘头部1024字节元数据损坏后,由于DMCSS会定时更新集群相关配置信息到DCR磁盘头,被损坏的DCR磁盘头元数据会被修复。
3、DM Database 64 V8 03134284132-20231226-213242-20081版本使用自动备份的ASM磁盘元数据可以成功修复ASM磁盘头0号AU前1024字节,ASM磁盘头修复后数据库集群恢复正常运行,ASM磁盘头损坏后写入的数据不会丢失。

评论
后发表回复

作者

文章

阅读量

获赞

扫一扫
联系客服