Monthly Archive for January, 2012

Monitor RAID Arrays and Get E-mail Alerts Using mdadm

In my Desktop computer I use a software RAID1 to protect against a data loss due to a hard drive failure. I have two hard drives, each with four identically sized partitions. Partition 1 on disk A is mirrored with partition 1 on disk B. Together they create the "multiple-device" device node md1 which can then be treated like any block device. Partitions 2, 3, 4 on the disks make up md2, md3, and md4 respectively.

You can use mdadm to configure a software raid in Linux. To see the status of the raid you can view the contents of the /proc/mdstat file. For my software raid the contents of the file should look like:

media:~ # cat /proc/mdstat Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md2 : active raid1 sdb2[1] sda2[0] 4194240 blocks [2/2] [UU] md4 : active raid1 sdb4[2] sda4[0] 867607168 blocks [2/2] [UU] md1 : active raid1 sdb1[1] sda1[0] 102336 blocks [2/2] [UU] md3 : active raid1 sdb3[1] sda3[0] 104857536 blocks [2/2] [UU] unused devices:

Note that the components of each raid device are listed as well as the status of the raid. For md1 it shows that sda1 and sdb1 are members of the array. It also shows that both sides of the array are good; indicated by the [UU].

Periodically I check the status of the raid by checking /proc/mdstat, but not nearly often enough. The other day I was surprised to find that one of the disks in my software raid had failed. The /proc/mdstat showed me this:

media:~ # cat /proc/mdstat Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md2 : active raid1 sda2[0] 4194240 blocks [2/1] [U_] md4 : active raid1 sda4[0] 867607168 blocks [2/1] [U_] md1 : active raid1 sda1[0] 102336 blocks [2/1] [U_] md3 : active raid1 sda3[0] 104857536 blocks [2/1] [U_] unused devices:

Basically, the disk /dev/sdb had been removed from the array for some reason. I know this because /dev/sdb is not listed in any of the arrays and the status of each array is [U_], which means only one side of the mirror is active.

After combing through /var/log/messages a bit I found some errors that explain:

kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen kernel: ata3.00: failed command: FLUSH CACHE EXT kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) kernel: ata3.00: status: { DRDY } kernel: ata3: hard resetting link kernel: ata3: link is slow to respond, please be patient (ready=0) kernel: ata3: SRST failed (errno=-16) kernel: ata3: hard resetting link kernel: ata3: link is slow to respond, please be patient (ready=0) kernel: ata3: SRST failed (errno=-16) kernel: ata3: hard resetting link kernel: ata3: link is slow to respond, please be patient (ready=0) kernel: ata3: SRST failed (errno=-16) kernel: ata3: limiting SATA link speed to 1.5 Gbps kernel: ata3: hard resetting link kernel: ata3: SRST failed (errno=-16) kernel: ata3: reset failed, giving up kernel: ata3.00: disabled kernel: ata3.00: device reported invalid CHS sector 0 kernel: ata3: EH complete kernel: sd 2:0:0:0: [sdb] Unhandled error code kernel: sd 2:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK kernel: sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 0d 03 27 80 00 00 08 00 kernel: end_request: I/O error, dev sdb, sector 218310528 kernel: end_request: I/O error, dev sdb, sector 218310528 kernel: md: super_written gets error=-5, uptodate=0 kernel: md/raid1:md3: Disk failure on sdb3, disabling device. kernel: <1>md/raid1:md3: Operation continuing on 1 devices.

From the timestamps on the messages (not shown above) I found out that it had been over a month since this had occured. A failure on the other disk during this time could have left me with data loss (luckily, I also backup my important files to an offsite server, but let's pretend I don't).

I decided at this point that I need a better way to be notified when things like this happen. It turns out you can configure mdadm to monitor the raid arrays and notify you via email when errors occur. To achieve this effect I added the following line to the /etc/mdadm.conf file:


Now, as long as mdadm is configured to run and monitor the arrays (on SUSE it is the mdadmd service), then you will get email alerts when things go wrong.

To verify your emails are working you can use the following command, which will send out a test email:

sudo mdadm --monitor --scan --test -1

Having the email monitoring will be sure to give you fair warning before you have a data loss. Make sure you send out the test email and verify it isn't getting filtered out as spam.


Dusty Mabe

Recover Space By Finding Deleted Files That Are Still Held Open.

The other day I was trying to clean out some space on an almost full filesystem that I use to hold some video files. The output from df looked like:

media:~ # df -kh /docs/videos/ Filesystem Size Used Avail Use% Mounted on /dev/mapper/vgvolume-videos 5.0G 4.2G 526M 90% /docs/videos

I then found the largest file I wanted to delete (a 700M avi video I had recently watched), and removed it. df should now report that I freed up some space right? NOPE!

media:~ # df -kh /docs/videos/ Filesystem Size Used Avail Use% Mounted on /dev/mapper/vgvolume-videos 5.0G 4.2G 526M 90% /docs/videos

Why wasn't I able to recover the space on the filesystem? At first I didn't know. I then decided to unmount and remount the filesystem to see if the changes would take effect. During this process, I found out that I couldn't unmont the fs because the device was busy:

media:~ # umount /docs/videos/ umount: /docs/videos: device is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1))

Ahh, so a program still has some files open on the fs. lsof lets us find out what files.

media:~ # lsof /docs/videos/ COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME bash 10889 dustymabe cwd DIR 253,6 4096 2 /docs/videos vlc 11244 dustymabe cwd DIR 253,6 4096 2 /docs/videos vlc 11244 dustymabe 11r REG 253,6 732297098 14 /docs/videos/video1.avi (deleted) xdg-scree 11281 dustymabe cwd DIR 253,6 4096 2 /docs/videos xprop 11285 dustymabe cwd DIR 253,6 4096 2 /docs/videos

So lsof shows us what files are open but it also let me us know something else key to my investigation. The file I had deleted was actually still being held open by VLC media player (I had recently been watching the video and vlc was still up and paused in the middle of playback). lsof let us know that the file had been deleted as well. After closing vlc the space was then released back to the filesystem :)
media:~ # df -kh /docs/videos/ Filesystem Size Used Avail Use% Mounted on /dev/mapper/vgvolume-videos 5.0G 3.5G 1.2G 75% /docs/videos


As a side note it is worth mentioning that all files that have been deleted but are still be held open by running processes are listed as "(deleted)" when performing an ls -l in the proc file system. For example, to find the file that I deleted that was still being held open by vlc then I would use ls -lR /proc/11244/fd/ | grep deleted. An example of the output is shown below:

media:~ # ls -l /proc/11244/fd/ | grep deleted lr-x------ 1 dustymabe users 64 Jan 22 21:58 11 -> /docs/videos/video1.avi (deleted)


Create a screencast of a terminal session using scriptreplay.

I recently ran into an issue where I needed to demo a project without actually being present for the demo. I thought about recording (into some video format) a screencast of my terminal window and then having my audience play it at the time of the demo. This would have worked just fine, but, as I was browsing the internet searching for exactly how to record a screencast of this nature, I ran across a blog post talking about how to play back terminal sessions using the output of the script program. This piqued my interest for several reasons:
  1. I have used script many times to log results of tests.
  2. This method of recording a shell session is much more efficient than recording it into video because it only stores the actual text of the output as well as some timing information.

In order to "record" using script you store the terminal output and the timing information to two different files using a command like the following:

script -t 2> screencast.timing screencast.log

The -t in the command causes script to output timing data to standard error. The 2> screencast.timing causes standard error to be redirected to the file screencast.timing. The screencast.log file will hold everything printed on the terminal during the course of the screencast session.

After running the command, execute a few programs (ls, echo "Hi Planet", etc...) and then type exit to end the session.

Now the screencast is stored in the two files. You can play back the screencast using:

scriptreplay screencast.timing screencast.log

And Voila. As long as your target audience has a box with script/scriptreplay installed then they can view your screencast!

I have included a screencast.timing and a screencast.log file you can download and use with scriptreplay if you would like to demo this out. The files combined are under 5K in size for a 1 minute 40 second screencast, however I think the actual size of the files depends more on how much data was output rather than the length of the screencast.

Even better than using my screencast, create one yourself!


Hi Planet! – SSH: Disable checking host key against known_hosts file.

Hi Everyone! Since this is my first post it is going to be short and sweet. :)

I work on a daily basis with Linux Servers that must be installed, configured, re-installed, configured etc... Over and over, develop and test. Our primary means of communication with these servers is through ssh. Every time a server is re-installed it generates a new ssh key and thus you will always get a "Man in the Middle Attack" warning from SSH like:

[root@fedorabook .ssh]# ssh @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that the RSA host key has just been changed. The fingerprint for the RSA key sent by the remote host is 08:50:e8:e4:1b:17:fd:69:08:bf:44:f2:c4:e4:8a:27. Please contact your system administrator. Add correct host key in /root/.ssh/known_hosts to get rid of this message. Offending key in /root/.ssh/known_hosts:1 RSA host key for has changed and you have requested strict checking. Host key verification failed.

You then have to open the ~/.ssh/known_hosts file and remove the offending key.

Since this is quite a mind numbing and time wasting task, I decided to win back those precious seconds. While perusing the ssh_config man page, I notice in the description of the StrictHostKeyChecking option that:

"The host keys of known hosts will be verified automatically in all cases."

This means that there is virtually no way to make it ignore the fact that the key has changed. At least there is no "designed" way.

I then started looking at the file that the remote key is checked against; the ~/.ssh/known_hosts file.

From what I can tell it is also impossible to disable writing remote host keys to this file.

So here we are with:

  1. No way to stop writing remote host keys into the known_hosts file
  2. No way to ignore the fact that the key in the known_hosts file doesn't match the key in the (newly reinstalled) target server.

What is the solution to this?

Well, since we can't disable writing new keys to the known_hosts file and we can't disable checking keys that are in the known_hosts file, why don't we just make the known_hosts file always be empty. Yep, that's right. Let's just point the known_hosts file to /dev/null.

Turns out you can do this by setting the UserKnownHostsFile option in the ~/.ssh/config file.


Voila! Now you will never be bothered by the same message again. It isn't all fruit and berries though. Now, since your known_hosts file is always empty, you will always be presented with the following message every time you ssh to any server:

[root@fedorabook .ssh]# ssh The authenticity of host ' (' can't be established. RSA key fingerprint is 08:50:e8:e4:1b:17:fd:69:08:bf:44:f2:c4:e4:8a:27. Are you sure you want to continue connecting (yes/no)?

In other words, you just traded one pain for another. However there is a solution for this as well :). We can make SSH automatically write new host keys to the known_hosts file by setting StrictHostKeyChecking to "no" in the ~/.ssh/config file.


And now you are smooth sailing to connect away without the impedence of interactive warning and error messages. Beware, however, that there are some security implications to performing the modifications that I have suggested. The web page here has an overview of SSH security, the different ssh options, and the implications of each. Of course, it suggests that you shouldn't set StrictHostKeyChecking=no but in my case, working on lab/test machines without sensitive data on them, I decided to take the risk.


Dusty Mabe