Jun's Pocket Plane: Setting up mcelog to work with systemd

2015년 6월 14일 일요일

Setting up mcelog to work with systemd

If you regularly observe system logs such as /var/log/messages, dmesg, or journalctl (systemd) you will eventually encounter a Machine Check Event (mce) which warns you that some kind of hardware error has occurred. For example, a common mce is caused when an incorrect bit is flipped in RAM. For server ECC memory, this is less of a problem because such bit errors can be fixed automatically. When encountering servers in the field with uptime greater than 365 days, it is not hard to find mce errors logged here and there. In the case of RHEL 5/6 machines I encounter in the field, mce errors are logged in the file /var/log/mcelog and the mcelog service runs by default.

Recently I noticed that every few days the kernel ring buffer dmesg on my work laptop gives the following error:

[Jun11 23:49] mce: [Hardware Error]: Machine check events logged

However, when I navigate to /var/log/ I cannot see any file named mcelog. Some old posts floating around the Internet recommend redirecting mcelog to some output file, i.e. /usr/sbin/mcelog > mcelog.out but this didn't work for me. Make sure you have the mcelog package (as it is called in Arch) installed . To enable the daemon in systemd, systemctl enable mcelog. When running mcelog on a Linux machine running systemd instead of the old syslog, you need to make some changes to /etc/mcelog/mcelog.conf

What led me astray was the Archwiki page on MCE Handling, which recommends uncommenting the line

syslog = yes

If you are running systemd you do NOT want the above setting! The problem is that systemd handles system logging through journalctl. You can follow the other suggestions in the Archwiki to run mcelog as a daemon (daemon = yes), but make sure the syslog lines are commented out. Also you need to specify an output log file for mce errors by uncommenting the following in /etc/mcelog/mcelog.conf:

logfile = /var/log/mcelog

Also uncomment the following in /etc/mcelog/mcelog.conf

run-credentials-user = root

Restart the mcelog service

systemctl restart mcelog

Next time a Machine Check Event occurs, it will be written to /var/log/mcelog. Here is some sample output:

Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 5
MISC b8a0000086 ADDR ffb07500
TIME 1434034181 Thu Jun 11 23:49:41 2015
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 69
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 5
MISC 78a0000086 ADDR ffb07500

This seems to be indicating a memory error in the CPU cache.

Here is my /etc/mcelog/mcelog.conf file:

댓글 1개:

Nyan Nyan2020년 1월 23일 오후 8:33
Thanks for the info.Very neat and concise.
답글삭제
답글

댓글 추가

GPG Public Key

-----BEGIN PGP PUBLIC KEY BLOCK-----

Version: GnuPG v2.0.22 (GNU/Linux)

mQENBE7vMHIBCADicCu8p52h2LRAaWZYLoR8BsKptqeJ9O5BWnDtobQGAFa5Xua7

FmrZmhYxKp2vzvlonWmloOP60Zgbxj9rt13S33SLthO+PcKneQkg4dBy/L8fxUaX

8L3n++I/i/qh4l7udUH9QoKNXeHDrAxgJfWcK4eXfImFkIc3EQhz/Ib7mEhIRSbP

gViohOjfwLNy07uf00DjEMvnlF/KY6LfoEEQUvIDmqembQrRXc2castWjL/Hjxae

seEOonMMuvPkvfcJrzfG8F7HJnRs+7e5/HNYA3iNap7JE1cb1huwXIqU7vh4Rd/R

Gq0kVOcGqszhyfuMNmbbbTNBEmzFmSBFtAGLABEBAAG0G0p1biBHbyA8Z29qdW4w

NzdAZ21haWwuY29tPokBOAQTAQIAIgUCTu8wcgIbAwYLCQgHAwIGFQgCCQoLBBYC

AwECHgECF4AACgkQpWkwz3I1E42NBggA19ciarF8DoPk+myx0AhEw8daDsZQ4sl4

j7EBDuB0hHDDthX2jKgPwqenYutF0+2EZQ5VS6kiyFCenK4wtYzkSVwlYuoiUbla

m0EPv8dA0f46/dxRhO2zoF8kfpmnR6BTR+EB+jVM+Mwpmc3shbpspnWPcRH/xTph

YLcURumTrfyIN++SeqSGcw20wg/+zqxclgOkwzZi9K4qIbdI9alPFsP14/xB+dcG

Ukows/TY7/eG3XGgiAE9tLLh99viBdLpSZ5T3GJlOGGnK+8EVZc9VG3yqxSURvBq

+X2CzYUmjsQUo0b7mQxmvFtWmjTbNc4lfWP3kkQdrUOQIe5J8rGfvLkBDQRO7zBy

AQgAsQC6mcxBHSZQzy8NwgZlQZSSx9zFjFVYkgr4xHym67PnkGs2opEvH0SawwMm

LM1/rCWVEeFcHQVaQ41z0Iu2WRIrqzbHreT730R8DqpYGICSp6wbPR5/AfVnwhcf

5I1Vos+cGzhW3kgsrpBCkKfhhDtRY5tseRm/TDMv1SGowsXVEIM/eSqvcNPkPa0f

am7Ah/sXYDg9om7wXbmLhPUz2RfPfHRYSDvRV9lIcvU0+jjVAwfpf3niPgZfsnU2

5smARZtjS1o0/pcFkrcKLE6VeVKFe2VxqJvFtKf4juxaIO1okxtwVcWfCbAGQplb

YiIYz2M7YuD5vqeexrEzxosveQARAQABiQEfBBgBAgAJBQJO7zByAhsMAAoJEKVp

MM9yNRONCb0H/1mV9EPt32R3ZYbUfO07V7GiMNYRZwfTW7ccGROwH1pzaI3ljQKM

FvXOmWg71yNTSiG9eBeSBIpLUXtIwmZvFzOG3B30msZBTStM605hZ9QV0PLxJNdm

61MlZ2EqFqTQYPMKz4Jsn5nZ9FH8wxUJ3QL5zMunE80AjQY4KV7cBswUKQjoDYVq

YIPVjsnchFduIcAMcpwKzTuMbqQih+mrjhr68Zusd44Lhr1g2qGQGCZXRn9/9oOQ

jBMXpeMhJMG/iyTdbO8PNbLFqu4QpHJJzRMphFVkFSBmlqDPcVgoeMazWhQMBg37

No+8Bq/f7QdNm+EJ/DHttuaJXDehVAFYnWE=

=PeRG

-----END PGP PUBLIC KEY BLOCK-----