为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。
【DM版本】:V8 1-1-190-21.03.12-136419-SEC
【操作系统】:麒麟v10
【CPU】:arm架构
【问题描述】*:单机,今天数据库宕机2次(日志见附件,数据库日志未报错)
有生产core文件,文件是截断的
操作系统日志
操作系统将数据库进程杀死,和内存有关系,请问能看出具体问题吗?
Aug 2 14:26:08 oadb kernel: [18654779.419773] {360}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
Aug 2 14:26:08 oadb kernel: [18654779.428884] {360}[Hardware Error]: event severity: recoverable
Aug 2 14:26:08 oadb kernel: [18654779.435388] {360}[Hardware Error]: Error 0, type: recoverable
Aug 2 14:26:08 oadb kernel: [18654779.441889] {360}[Hardware Error]: section_type: ARM processor error
Aug 2 14:26:08 oadb kernel: [18654779.449081] {360}[Hardware Error]: MIDR: 0x00000000481fd010
Aug 2 14:26:08 oadb kernel: [18654779.455497] {360}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x00000000811d0300
Aug 2 14:26:08 oadb kernel: [18654779.465023] {360}[Hardware Error]: error affinity level: 0
Aug 2 14:26:08 oadb kernel: [18654779.471351] {360}[Hardware Error]: running state: 0x1
Aug 2 14:26:08 oadb kernel: [18654779.477249] {360}[Hardware Error]: Power State Coordination Interface state: 0
Aug 2 14:26:08 oadb kernel: [18654779.485305] {360}[Hardware Error]: Error info structure 0:
Aug 2 14:26:08 oadb kernel: [18654779.491632] {360}[Hardware Error]: num errors: 1
Aug 2 14:26:08 oadb kernel: [18654779.497097] {360}[Hardware Error]: error_type: 0, cache error
Aug 2 14:26:08 oadb kernel: [18654779.503770] {360}[Hardware Error]: error_info: 0x0000000024400014
Aug 2 14:26:08 oadb kernel: [18654779.510788] {360}[Hardware Error]: cache level: 1
Aug 2 14:26:08 oadb kernel: [18654779.516514] {360}[Hardware Error]: the error has been corrected
Aug 2 14:26:08 oadb kernel: [18654779.523445] {360}[Hardware Error]: virtual fault address: 0x0000000000000000
Aug 2 14:26:08 oadb kernel: [18654779.531416] {360}[Hardware Error]: physical fault address: 0x00000024738712a0
Aug 2 14:26:08 oadb kernel: [18654779.539474] {360}[Hardware Error]: Vendor specific error info has 16 bytes:
Aug 2 14:26:08 oadb kernel: [18654779.547274] {360}[Hardware Error]: 00000000: 00000000 00000000 00000000 00000000 ................
Aug 2 14:26:08 oadb kernel: [18654779.557155] Uncorrected hardware memory error in user-access at 0000000055108590
Aug 2 14:26:08 oadb kernel: [18654779.703431] {361}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Aug 2 14:26:08 oadb kernel: [18654779.703456] Memory failure: 0x247387: Killing dmserver:1977295 due to hardware memory corruption
Aug 2 14:26:08 oadb kernel: [18654779.712532] {361}[Hardware Error]: event severity: recoverable
Aug 2 14:26:08 oadb kernel: [18654779.712533] {361}[Hardware Error]: Error 0, type: recoverable
Aug 2 14:26:08 oadb kernel: [18654779.712534] {361}[Hardware Error]: section_type: memory error
Aug 2 14:26:08 oadb kernel: [18654779.712535] {361}[Hardware Error]: error_status: 0x0000000000000000
Aug 2 14:26:08 oadb kernel: [18654779.712535] {361}[Hardware Error]: physical_address: 0x0000002473871280
Aug 2 14:26:08 oadb kernel: [18654779.712536] {361}[Hardware Error]: physical_address_mask: 0x0000000000000000
Aug 2 14:26:08 oadb kernel: [18654779.712541] {361}[Hardware Error]: node: 0 card: 3 module: 0 rank: 0 bank: 5 device: 1 row: 16320 column: 560 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000
Aug 2 14:26:08 oadb kernel: [18654779.721993] Memory failure: 0x247387: recovery action for dirty LRU page: Recovered
Aug 2 14:26:08 oadb kernel: [18654779.728493] {361}[Hardware Error]: error_type: 17, unknown
Aug 2 14:26:08 oadb audit[1977295]: ANOM_ABEND auid=0 uid=12345 gid=12349 ses=40806 pid=1977295 comm="dm_sql_thd" exe="/dmdbms/bin/dmserver" sig=7 res=1
Aug 2 14:26:08 oadb systemd[1]: Started Process Core Dump (PID 1981819/UID 0).
Aug 2 14:26:08 oadb audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@5-1981819-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 2 14:26:08 oadb kernel: [18654779.796548] EDAC MC0: 1 UE reserved error (17) on unknown label (node:0 card:3 module:0 rank:0 bank:5 row:16320 col:560 bit_pos:0 page:0x247387 offset:0x1280 grain:-1 - status(0x0000000000000000): reserved requestorID: 0x0000000000000000 responderID: 0x0000000000000000 targetID: 0x0000000000000000)
Aug 2 14:26:08 oadb kernel: [18654779.823539] Memory failure: 0x247387: already hardware poisoned
Aug 2 14:26:09 oadb systemd-coredump[1981820]: Core file was truncated to 2147483648 bytes.
Aug 2 14:26:09 oadb kernel: [18654780.956532] EDAC MC0: 1 UE reserved error (17) on unknown label (node:0 card:6 module:0 rank:0 bank:11 row:27654 col:536 bit_pos:0 page:0x20d80d offset:0x3c0 grain:-1 - status(0x0000000000000000): reserved requestorID: 0x0000000000000000 responderID: 0x0000000000000000 targetID: 0x0000000000000000)
Aug 2 14:26:09 oadb kernel: [18654780.983525] Memory failure: 0x20d80d: already hardware poisoned
Aug 2 14:26:11 oadb systemd-coredump[1981820]: Process 1977295 (dmserver) of user 12345 dumped core.#012#012Stack trace of thread 1981237:#012#0 0x00000000005081b8 bdta3_cpy_str (/dmdbms/bin/dmserver)#012#1 0x00000000005081b4 bdta3_cpy_str (/dmdbms/bin/dmserver)
Aug 2 14:26:11 oadb systemd[1]: systemd-coredump@5-1981819-0.service: Succeeded.
Aug 2 14:26:11 oadb audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@5-1981819-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 2 14:26:46 oadb esfdaemon[998483]: 0
数据库配置文件 dm.ini
dm.ini
数据库宕机 2段日志 + 对应时间段 2段操作系统日志
dm_sys_log.txt
从日志看是比较明确的硬件错误,应该是内存条存在故障