How to check memory error count in Linux

We have 8 DIMMs, 4 on each controller in this system.

sudo dmidecode -t memory  |grep -A10 'Locator: DIMM' | grep Serial.Number | grep -v NO.DIMM
    Serial Number: E22C60A2
    Serial Number: E12C4EA2
    Serial Number: DD2C61A2
    Serial Number: E32C4EA2
    Serial Number: E32C53A2
    Serial Number: DE2C5DA2
    Serial Number: E02C50A2
    Serial Number: E22C65A2
sudo dmidecode -t memory  |grep -A10 'Locator: DIMM' | grep Serial.Number | grep -v NO.DIMM | wc -l
8

Correctable error count

ls  -l /sys/devices/system/edac/mc/mc*/csrow*/*ce_count
-r--r--r-- 1 root root 4096 Apr 21 01:48 /sys/devices/system/edac/mc/mc0/csrow0/ce_count .     - Correctable error count for this row
-r--r--r-- 1 root root 4096 Apr 21 01:48 /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count   - Correctable error count for this channel
-r--r--r-- 1 root root 4096 Apr 21 01:48 /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count
-r--r--r-- 1 root root 4096 Apr 21 01:48 /sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count
-r--r--r-- 1 root root 4096 Apr 21 01:48 /sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count
-r--r--r-- 1 root root 4096 Apr 21 01:48 /sys/devices/system/edac/mc/mc1/csrow0/ce_count
-r--r--r-- 1 root root 4096 Apr 21 01:48 /sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count
-r--r--r-- 1 root root 4096 Apr 21 01:48 /sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count
-r--r--r-- 1 root root 4096 Apr 21 01:48 /sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count
-r--r--r-- 1 root root 4096 Apr 21 01:48 /sys/devices/system/edac/mc/mc1/csrow0/ch3_ce_count

There are no correctable errors.

cat /sys/devices/system/edac/mc/mc*/csrow*/*ce_count
0
0
0
0
0
0
0
0
0
0

Uncorrectable errors

cat  /sys/devices/system/edac/mc/mc*/csrow*/ue_count
0
0

This his provided by a kernel module edac (Error Detection and Correction)

lsmod  | grep edac
sb_edac                27005  0
edac_core              57973  1 sb_edac

There is a utility called edac-util.
http://fibrevillage.com/sysadmin/240-how-to-identify-defective-dimm-from-edac-error-on-linux-2

Good reference
http://www.admin-magazine.com/Articles/Monitoring-Memory-Errors

Advertisements

Remove interface from a bond

[cohesity@benjamin_ve ~]$ cat /proc/net/bonding/bond0 | grep Int
MII Polling Interval (ms): 100
Slave Interface: eno16784128
Slave Interface: eno33555456
[cohesity@benjamin_ve ~]$ sudo ifenslave -d bond0 eno33555456
[cohesity@benjamin_ve ~]$ cat /proc/net/bonding/bond0 | grep Int
MII Polling Interval (ms): 100
Slave Interface: eno16784128
[cohesity@benjamin_ve ~]$

Delete host key from known_hosts file

I normally delete the known_hosts file on my machine (not server) when I ssh to a server because the fingerprint has changed. You can delete host’s key as follows.

ssh-keygen -R hostname
# Host hostname found: line 6
/Users/benjaminr/.ssh/known_hosts updated.
Original contents retained as /Users/benjaminr/.ssh/known_hosts.old

Of course, you don’t want to simply do this unless you are aware something changed.

awk examples

The Basics

An awk script usually takes the form

awk ‘pattern { action }’ file

where both pattern and action are optional. If there is no pattern, the action is performed on all lines. If there is action, the default action is to print all matching lines. Awk scripts are placed within single quotes when run in the command line to prevent the shell from interpreting it. Long awk scripts can be placed in a file and run as

awk -f the_script input_file

We can pass multiples to awk as well.

awk ‘pattern { action }’ file1 file2 file3

Examples:

Print all lines in a file(cat file)
awk ‘{print $0}’ file
awk ‘{print}’ file
awk ‘1’ file
Continue reading