Nothing new here but I’ll use this post as a quick reminder to enable kernel dump on red hat enterprise 4. Quite useful when you don’t know why your box keep crashing this week when it was perfectly fine for the last 2 years :D
Red Hat historically use netdump in order to save a kernel dump file but since RHEL 4, they introduced diskdump. Which is good as I don’t want to bother set-up a netdump server (even if there is few interesting thinks in favour of netdump, like “what if your system doesn’t recover…”). You can easily understand the main difference between them… The first one create the file over the network using a server (like if you have a syslog server) while the other one use a local disk.
Here the few steps to follow:
Everything should be already installed, but in case it’s not on your system:
# up2date crash diskdumputils
crash is an utility used to analyse the dump file.
Check the loaded kernel modules, if not already loaded, then do it.
# lsmod | grep diskdump
# modprob diskdump
Configure what partition you want to use. 2 choices: 1st you have a partition only for this purpose, 2nd you can use a swap partition. if you don’t use a swap partition, you’ll have to format it. The size of the partition you choose must be at least the size of your ram.
# cat /proc/swaps
# vi /etc/sysconfig/diskdump
You can specify more than one partition if you separate them by ‘:’
DEVICE=/dev/sda4:/dev/sda5
If you didn’t choose a swap partition, then you have to format your device in order use it as a dump device.
# service diskdump initialformat
Tell your system to start diskdump automatically at the next reboot and start it.
# chkconfig diskdump on
# service diskdump start
If you get a warning, it may just means that your partition is not good (wrong partition specified in /etc/sysconfig/diskdump, or forgot to format, …)
check that the module works fine. the output must be something like the following:
# cat /proc/diskdump
# sample_rate: 8
# block_order: 2
# fallback_on_err: 1
# allow_risky_dumps: 1
# dump_level: 0
# compress: 0
# total_blocks: 2097059
#
sda4 4401810 4192965
sda5 8594838 4192902
When a crash occures, the data from the disk dump partition are not directly readable. You need to gather them in a file on a readable partition. This is done with the following command:
# savecore -vD /dev/sda4
You can add this command to /etc/rc.local if you want to run it automatically after a crash.
This is it! Next time your kernel crashes, you’ll get a file called vmcore stored in /var/crash/127.0.0.1-\/. If for any reason the dump is not complete, the file will then be named vmcore-incomplete
You can test if it works fine by doing one of the following things. It will result on you server crashing therfore creating a dumpfile.
# Alt-SysRq-C or
# echo c > /proc/sysrq-trigger or
# echo 1 > /proc/sys/kernel/sysrq
you can also quickly compile the following code [cc -c -I/usr/src/linux/include panic.c]
and load this nice module :D [insmod panic.o]
#### panic.c #####
#define __KERNEL__
# MODULE
# include init_module(void)
int init_module (void)
{
panic(" panic has been called");
return 0;
}
now in order to analyse your dump, crash require the kernel debuginfo package corresponding to the kernel you are running. You can find it there. Then you need to install it.
wget http://updates.redhat.com/enterprise/4AS/en/os/Debuginfo/i386/RPMS/
kernel-debuginfo-2.6.9-89.EL.i686.rpm && \
su -c "rpm -Uvh kernel-debuginfo-2.6.9-89.EL.i686.rpm"
After your kernel panic and a reboot, you can now run crash with the following arguments (vmlinux_path vmcore_path).
# crash /usr/lib/debug/lib/modules/2.6.9-89.EL/vmlinux \
/var/crash/127.0.0.1-2009-06-07-14\:44/vmcore
OUTPUT
KERNEL: /usr/lib/debug/lib/modules/2.6.9-89.EL/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2009-06-07-14:44/vmcore
CPUS: 1
DATE: Tue Jun 8 15:02:44 2009
UPTIME: 00:51:24
LOAD AVERAGE: 0.05, 0.05, 0.00
TASKS: 81
NODENAME: rhxxx.test.xxx.org.uk
RELEASE: 2.6.9-89.EL
VERSION: #1 Fri Feb 24 16:44:51 EST 2006
MACHINE: i686 (3600 Mhz)
MEMORY: 1.5 GB
PID: 3514
COMMAND: "crash"
TASK: f562ecd0 [THREAD_INFO: f4b7d000]
CPU: 0
STATE: TASK_RUNNING (ACTIVE)
I leave you find out all the things you can do with crash (try a “man crash” or help once you ran crash) but the command ‘log’ and ’sys’ will be the one you are going to use the most.
you can find much more details here
Enjoy :D