SysAdmin's Journey

Nagios Check_dell_warranty Plugin Updated

It’s been blogged about recently how cool the check_dell_warranty plugin by Erinn Looney-Triggs is. It solves a very real problem for us - sometimes servers run so well you forget to make sure that you renew your support contracts. However, it wasn’t quite right for us - we have some older Dells that have RHEL 4 with an older (incompatible) python on them still. Also, the plugin wouldn’t work without configuring sudo. Well, like any other sysadmin would do, I fixed it! With a very small amount of code, I added the ability to specify the serial number on the command line with a -s parameter. This gives you a lot of flexibility:

  • It lets you run this plugin on older machines with older Python interpreters
  • It eliminates the dependency requiring a recent dmidecode
  • It eliminates the sudo dependency, no need to configure sudo
  • It gives you the ability to have your Nagios server check the warranty status on multiple hosts without using ssh/nrpe/etc.
  • If you don’t use the -s flag, the plugin will function like before, using sudo and dmidecode to determine the service tag. Erinn also found a bug at the same time I did where the check was working properly, but it would never return a WARNING or CRITICAL state. It’s fixed now, so if you downloaded it before, make sure to get the most recent version. Erinn was kind enough to include my changes in the main script, so go download it!

Using the Sun StorageTek A1000 Under Solaris 10

We have an old Sun StorageTek A1000 sitting around that we use as scratch space for our Amanda backups. Sure, it’s old and slow, but it works very well for what we use it for. After upgrading to Solaris 10, the mountpoints worked, but I was getting a lot of complaints when loading the rdriver module from the Raid Manager application. Upon reboot into Solaris 10, I got a huge amount of these messages spewed to the console:

May 21 23:24:47 myhost krtld: [ID 819705 kern.notice] /kernel/drv/sparcv9/rdriver: undefined symbol
May 21 23:24:47 myhost krtld: [ID 826211 kern.notice] 'dev_get_dev_info'
May 21 23:24:47 myhost krtld: [ID 472681 kern.notice] WARNING: mod_load: cannot load module 'rdriver'
May 21 23:24:47 myhost genunix: [ID 370176 kern.warning] WARNING: forceload of drv/rdriver failed

It turns out that the A1000 was EOL before the Solaris 10 release, so I guess I can’t blame Sun for that. But, thanks to a post over on the Sun forums, everything still works with some configuration. First, you need to tell the system not to load the rdriver and rdnexus modules:

rem_drv rdriver
rem_drv rdnexus

Next, you need to tell the software to not use the multipath features of the driver (which the A1000 only has one controller, so multipathing is impossible). Make a backup of /etc/osa/rmparams, then change these two lines:

Rdac_SupportDisabled=FALSE
Rdac_SinglePathSupportDisabled=FALSE

to look like these two lines:

Rdac_SupportDisabled=TRUE
Rdac_SinglePathSupportDisabled=TRUE

Reboot, and all is well!

LiveUpgrade From Solaris 9 to Solaris 10

Here’s how to leverage LiveUpgrade to safely upgrade from Solaris 9 to Solaris 10 using a spare disk. No data is ever deleted, and to roll back to Solaris 9, all you need is one command and a reboot. Let’s continue on from the example in Unmirroring a RAID 1 Root Volume on Solaris - we have two disks, c1t0d0 and c1t1d0. c1t0d0 is our Solaris 9 disk, c1t1d0 is our spare.

Phase One: Prepare the Solaris 9 environment

First, we need to prepare the Solaris 9 environment with some packages and patches. First, check out Sun’s InfoDoc 206844 and make sure you have all of the patches required installed. Next, for LiveUpgrade to work properly, you need to install the LiveUpgrade packages from your Solaris 10 install media to your Solaris 9 box. First, remove the existing packages:

pkgrm SUNWlucfg SUNWluu SUNWlur

Then, install the new packages from your Solaris 10 media (if you’re using CD’s, it’s on disc 2):

cd $SOLARIS10MEDIA/Solaris_10/Tools/Installers
./liveupgrade20

Next, let’s copy the disk label from c1t0d0 over to c1t1d0 giving us the exact same disk layout on the new disk as the old disk:

prtvtoc /dev/rdsk/c1t0d0s2 | fmthard -s - /dev/rdsk/c1t1d0s2

We now have all the prep work done for getting ready for our LiveUpgrade. Now, we need to create our boot environment.

Phase Two: Create the New Boot Environment

First, let’s make some assumptions. Since we’re upgrading from Solaris 9 to Solaris 10, we’ll be upgrading from UFS to UFS file systems. Also, since we are upgrading from one disk to another, we will copy all filesystems from c1t0d0 to c1t1d0 - no filesystems will be shared. In order to create our new boot environment, we will use the ‘lucreate’ command. Let’s define some variables:

Flag Description

-c Sets the current boot environment name to this name. In this example, we use Solaris9

-n Sets the newly created environment’s name to this name. In this example, we use Solaris10

-m The -m option is the most critical part of the lucreate command. It specifies the filesystems in the new environment. To create more than one filesystem, you use the -m flag more than once. By using different variations on the -m flag, you can reorganize, resize, and merge filesystems, but that’s beyond the scope of this article. The value to the -m flag is: mountpoint:device:fs_options. From the previous example in Unmirroring a RAID 1 Root Volume on Solaris, we have 5 filesystems: /, /usr/local/, /apps, /export/home/, and swap.

Using the above options for our scenario, we end up with:

lucreate -c Solaris9 -n Solaris10 -m /:/dev/dsk/c1t1d0s0:ufs -m -:/dev/dsk/c1t1d0s1:swap \
-m /usr/local:/dev/dsk/c1t1d0s3:ufs -m /export/home:/dev/dsk/c1t1d0s4:ufs \
-m /apps:/dev/dsk/c1t1d0s5:ufs

Depending on the number and size of files in your source boot environment, this may take awhile. The lucreate process gives you output for each step, letting you know what it’s doing.

Phase Three: Upgrade the New Boot Environment to Solaris 10

Now that phase two is done, we have a bootable copy of our current Solaris 9 environment. Now, we need to apply the upgrade to Solaris 10 to our new boot environment. For simplicity, I have an exported Solaris 10 install DVD extracted on another server that I use the automounter to access. Now, we’ll run the ‘luupgrade’ command and tell it to install the upgrade to the new boot environment we just created:

luupgrade -u -n Solaris10 -s /net/install.mydomain.com/export/jumpstart/install/sparc_10

Again, this will take some time, and the process will give you output as it clicks along.

Phase Four: Mark the New Boot Environment as Active and Boot Into It

Now, our Solaris 10 boot environment actually has a copy of our Solaris 9 environment with the upgrade to Solaris 10 within it. To boot into that environment, we need to mark it as active, and reboot. These instructions cover SPARC machines, for x86/x64, see the Sun documents referred in the summary of this article - there are a couple of differences. To make our new boot environment active, we’ll use the ‘luactivate’ command:

luactivate Solaris10
init 6

Tricky, huh? ;-) Should something not go completely right during the upgrade, you can roll back to your previous boot environment simply by specifying ‘Solaris9’ for ‘Solaris10’ in the above ‘luactivate’ command. If something really went wrong with the upgrade, and you didn’t boot successfully, don’t worry. The ‘luactivate’ command above should have given you some output that you should copy/paste someplace. Here’s an example:

In case of a failure while booting to the target BE, the following process 
needs to be followed to fallback to the currently working boot environment:

1. Enter the PROM monitor (ok prompt).

2. Change the boot device back to the original boot environment by typing:

     setenv boot-device disk:a

3. Boot to the original boot environment by typing:

     boot

If that doesn’t do it, see the Sun docs in the summary of this article, they will coach you through booting off the CD/DVD and reactivating the old environment.

Summary

As an admin coming from Linux, the LiveUpgrade suite is a breath of fresh air for giving you an easy upgrade path that’s just as easy to undo as it is to do. We’ve only begun to scratch the surface of what you can do with LiveUpgrade. You can use it to migrate from a UFS root to a ZFS root, use it to install and test patches, install Flash Archives with it, reconfigure and resize partitions, and on and on. For a much more comprehensive look into what all LiveUpgrade can do, check out the Solaris Live Upgrade and Upgrade Planning guide on http://docs.sun.com - the current release is for Solaris 5/09.

Unmirroring a RAID 1 Root Volume on Solaris

It happens fairly often that you need to create a software mirror using SVM on Solaris. A smaller percentage of the time, you need to create a SVM mirror of your root partition. It doesn’t happen very often at all, but there are cases where you want to unmirror your root partition on Solaris. I’ll get into the why later, follow the jump for the howto. First, we need to define the disk setup of our server. The following table shows the current SVM setup.

Mount Point Mirror Device c1t0d0 Slice/Submirror c1t1d0 Slice/Submirror

/

d1

slice 0/d10

slice 0/d11

/usr/local

d2

slice 3/d20

slice 3/d21

/export/home

d3

slice 4/d30

slice 4/d31

/apps

d4

slice 5/d40

slice 5/d41

swap

d5

slice 1/d50

slice 1/d51

- MetaDB Slices

slice 6

slice 6

As you can see, we have 5 individual mirrors, one of which is for swap. I don’t recommend mirroring swap, but I include it here because there is an important caveat you need to catch if you do have mirrored swap. We have two disks: c1t0d0, and c1t1d0. We have the metadb’s stored on slice 6 of each disk. Our end goal is to boot from c1t0d0, and have c1t1d0 available for whatever we like.

Disclaimer

I used these instructions, and they worked great for me. I’ve used them on both Solaris 9 and Solaris 10. If you embark on such a task, make sure to have a complete, full backup before you proceed!

Step One: Detach Submirrors

First, we need to “break” the mirror, by removing all of the submirrors that are contained on c1t1d0. In our case, we have mirrors 1-5, and the submirror contained on c1t1d0 is always the same as the mirror device with a trailing 1. This makes for a nice one liner:

for i in 1 2 3 4 5; do metadetach d${i} d${i}1; done

This code removes submirror 11 from mirror 1, submirror 21 from mirror 2, and so on.

Step Two: de-metaroot

The proper way to create a mirrored root volume is to use the metaroot tool to modify /etc/vfstab and /etc/system for you. The good thing about this is that you can use the same tool to to de-configure it too. Keeping in mind that we want our root slice to be c1t0d0s0, we run:

metaroot /dev/dsk/c1t0d0s0

Step Three: Update vfstab

Now, we need to edit /etc/vfstab and replace all of the mirror device mounts with their c1t0d0 counterparts. If your original vfstab looked like this:

...
/dev/md/dsk/d5  -       -       swap    -       no      -
/dev/dsk/c1t0d0s0  /dev/rdsk/c1t0d0s0 /       ufs     1       no      -
/dev/md/dsk/d4  /dev/md/rdsk/d4 /apps   ufs     2       yes     -
/dev/md/dsk/d3  /dev/md/rdsk/d3 /export/home    ufs     2       yes     -
/dev/md/dsk/d2  /dev/md/rdsk/d2 /usr/local      ufs     2       yes     -
...

Then your new vfstab should look something like this:

...
/dev/dsk/c1t0d0s1       -       -       swap    -       no      -
/dev/dsk/c1t0d0s0  /dev/rdsk/c1t0d0s0 /       ufs     1       no      -
/dev/dsk/c1t0d0s5  /dev/rdsk/c1t0d0s5 /apps   ufs     2       yes     -
/dev/dsk/c1t0d0s4  /dev/rdsk/c1t0d0s4 /export/home    ufs     2       yes     -
/dev/dsk/c1t0d0s3  /dev/rdsk/c1t0d0s3 /usr/local      ufs     2       yes     -
...

Step Four: Configure your Dump Device

Here’s the caveat for mirrored swap - you’re probably using /dev/md/dsk/d5 for your dump device. Let’s fix that now. First run

dumpadm | grep '/md/'

If that returns any output, then run this (using your single-disk slice for swap):

dumpadm -s /var/crash/`hostname` -d /dev/dsk/c0t0d0s1

Step Five: Reboot and Verify

Cross your fingers, and do a

init 6

Once you’re back up, look at the output of

df -h && swap -l

and make sure there’s no references to any ‘md’ devices.

Step Six: Remove the Mirrors, Remaining Submirrors, and MetaDB’s

Now that we are running in a single disk environment, we need to remove the mirrors and submirrors. Again, ripe for a one-liner:

for i in 1 2 3 4 5; do metaclear -r d${i}; done

At this point, ‘metastat’ should return no mirrors. Now, we can remove the metadb’s from slice 6 on both disks. Only do this if you’re not using SVM for anything else!

metadb -df /dev/dsk/c1t1d0s6
metadb -df /dev/dsk/c1t0d0s6

Summary

Well, that covers the entire process. You should now have a free disk that you can use for whatever you like!

WTF Are You Doing to Your Keyboard?

So, I have yet to upgrade my laptop to Ubuntu Jaunty, but I saw this article on how to re-enable Ctrl+Alt+Backspace in Jaunty come through my feed reader. A little more research dug up this Ubuntu Wiki entry proclaiming: a number of users have complained about accidentally restarting their X-Server Now, maybe I’m getting old, but I can’t tell you how many times I’ve been saved a reboot by that handy Ctrl+Alt+Backspace shortcut. My rant? What in the hell are you trying to do that you “accidentally kill X” by happening upon hitting that key combination? Was it really so many people that you had to kill the shortcut for everyone? There is a line that can be crossed when listening to your users, and I think Ubuntu has just crossed it. What are your thoughts?

Breaking Bad Habits - Don't Use Seq in Your Shell Scripts

Like most, I learned shell scripting by following examples. Well, unfortunately, most of the samples I learned from used the ‘seq’ binary to execute a simple for loop like so:

for i in `seq 1 10`; do
echo $i
done

I discovered why this is bad today - not all Unixes (Solaris and Darwin included) come with it. Not to mention we’re forking a process where we don’t need it. On bash, use the built-in brace expansion instead:

for i in {1..10}; do
echo $i
done

For ksh and other shells, instead of using a for, use a while loop with an incrementing counter if the integers are too numerous to list in the loop header itself.

The Nagios Fork: Did Two Wrongs Make a Right???

It’s an item that I feel hasn’t got much press, at least in the limited RSS entries I’ve had time to scan lately: Nagios has been forked. I’ve been using Nagios long enough that I actually used NetSaint for a bit, so I have some mixed feelings about Icinga. In general, I’m all for forks when they are indeed needed - FOSWiki is a great example. But forks shouldn’t be taken lightly, in many ways they are like a divorce - they should be a last resort, not a quick way out. Personally, I think that the fork will ultimately either fail or merge back into Nagios, but read on for why I think the Icinga fork is a case of two wrongs making a right. First, let me state that this is all my personal opinion. I’ve been around Nagios for a long time, and while I don’t know the developers involved personally, I’ve read enough of all of their emails over the years to have a good feel for what they are about. All this seems to have come about because of a few reasons:

  • Ethan Galstad, the sole person with commit access to Nagios, has become a bottleneck in the project. It took forever for Nagios 3.0 to come about, and development has slowed even more since then.
  • People seem to fear the commercialization of Nagios. Ethan’s involvement in Nagios Enterprises appears to make some people nervous. I think this is all FUD - there’s many examples of commercial OSS out there, and many of them are perfectly community-friendly. If Ethan finally wants to make some money over what he’s developed over the past 10 years, then more power to him. If he and his company were to become “evil”, then you fork the code, but not before.
  • Ethan, like most developers, prioritized bugs and features based upon what he felt like working on the most. This lead many to feel that their requests were going unheard. Netways is the company that is sponsoring the fork, they are all great coders, and have committed a lot of their resources to Nagios. However, I feel they made a critical mistake - they treated the fork like a coup or a crusade. Ethan states that he was never directly approached about the fork before the announcement “Nagios is dead! Long live Icinga” was made on the mailing list. Part of OSS is open communication, I think that Netways should have tried to discuss things first. Going back to the divorce analogy, they filed for divorce before the spouse knew there was a problem. So, the title makes reference to two wrongs and a right. I think this may actually be one of the rare occasions where two wrongs do make a right. The two wrongs? Easy!

  • Ethan was wrong for taking the community too lightly, and not allowing the community to have more input into the direction of Nagios. He had far too little participation in the mailing lists to be the sole dictator of the code base.

  • Netways was wrong starting a fork for the wrong reasons. Don’t misunderstand me here, I think nearly all of their reasons are valid, I just don’t think that a fork should have been the first step to resolve them. So, what’s the right here? Well, it appears as though this may have been the wake up call that Ethan may have needed. I encourage you to read all of Ethan’s posts on his blog, but he is already taking steps to resolve some of the more obvious issues. First and foremost, he’s appointed Ton Voon and Andreas Ericsson to be core developers. Ton and Andreas are excellent developers, and likely have committed more of their time to the development and support of Nagios and it’s plugins than Ethan himself has over the last couple years. The project could not be in more competent hands than in theirs. He has committed to the setup of a bug/issue/request tracker for Nagios, and has created a site that allows end-users to vote on what features they want to see in Nagios. Unfortunately, my crystal ball tells me that Icinga will be a casualty, but it has already served it’s purpose without a single release - it has pushed Nagios further in a few weeks than anything else has in the past few years. Perhaps two wrongs did make a right? Personal note: this is the first big opinion story I’ve ever written here, and it feels odd. Usually I reserve space on my blog for howto’s and other facts instead of opinion. However, Nagios has done so much for me over the years, it’s hard not to voice my opinion on a piece of software that I truly consider the single most important piece of software I’ve ever used in my role as a sysadmin. More than likely if you’ve read this far, you’re a sysadmin too and likely have some strong opinions on this topic - please do share them!

SMF Manifests for Endeca IAP 6.x

In doing some recent upgrades, we migrated our Endeca installation to Solaris 10. I got the itch to write some SMF manifests, and here’s my completed XML and method files. First, download EndecaSMF.tar.gz, and extract it somewhere. Change to the ./endeca/ directory. Next you need to edit a few things in the two xml files. If you’re running the daemons under a user other than ‘endeca’, edit the following lines updating it with the user you created. Note that you have to edit both the start and the stop methods for a total of eight changes (4 in each XML file):

<method_credential user="endeca"/>
<envvar name="USER" value="endeca"/>

Next, edit the path to your Endeca installation. The XML files come preconfigured with ‘/apps/endeca’. This is the value to ‘–target=’ you used in the setup shell scripts. If you used something other than ‘/apps/endeca’, then do a search and replace through both XML files. Now, run the following as root:

cp svc-endeca* /lib/svc/method/
chown root:bin /lib/svc/method/svc-endeca*
chmod 755 /lib/svc/method/svc-endeca*
svccfg -v import endeca-eac.xml 
svccfg -v import endeca-workbench.xml

If your Endeca services are running, stop them now. To start the Endeca Application Controller (HTTP Service), run:

svcadm enable endeca/application-controller

If you want the Workbench server to start on every boot, run

svcadm enable endeca/workbench

That’s it!

Installing NRPE 2.12 From Source as a SMF Managed Service in Solaris 10

Installing NRPE on Solaris 10 involves just a bit more than your normal './configure && make && make install' routine. However, all the dependencies are likely present on a freshly installed system, you just have to tell NRPE where to find it. There's one file you need to patch, and then it will install. From there it's easy to plug into SMF! First, let's make sure some directories are present, and create our Nagios user:

# mkdir /usr/local
# groupadd nagios
# useradd -m -c "nagios system user" -d /usr/local/nagios -g nagios -m nagios

Next, download and extract the source code to NRPE:

$ cd /tmp/
$ /usr/sfw/bin/wget http://superb-east.dl.sourceforge.net/sourceforge/nagios/nrpe-2.12.tar.gz
$ gzip -dc nrpe-2.12.tar.gz | tar -xvf -
$ cd nrpe-2.12

Now, we need to tell the configure script where to find the openssl libraries, and make sure that GCC is in our path:

$ PATH=$PATH:/usr/sfw/bin:/usr/ccs/bin ./configure --with-ssl=/usr/sfw/ --with-ssl-lib=/usr/sfw/lib/

That should run just fine. Before we build, we need to apply a quick fix to nrpe.c. If you don't do this, you'll get an error from make that says "nrpe.c:617: error: 'LOG_AUTHPRIV' undeclared (first use in this function)".

$ perl -pi -e 's/LOG_AUTHPRIV/LOG_AUTH/; s/LOG_FTP/LOG_DAEMON/' src/nrpe.c

Now, we should be okay to build it:

$ PATH=$PATH:/usr/sfw/bin:/usr/ccs/bin make 

Then, install it as root:

# PATH=$PATH:/usr/sfw/bin:/usr/ccs/bin make install

Either copy the nrpe.cfg sample included in the source code, or drop your own into /usr/local/nagios/etc/nrpe.cfg. Now, stay logged in as root for the following, now we'll get NRPE setup to run under SMF. First, we need to setup the service and present it to inetd:

echo "nrpe 5666/tcp # NRPE" >> /etc/services
echo "nrpe stream tcp nowait nagios /usr/sfw/sbin/tcpd /usr/local/nagios/bin/nrpe \
 -c /usr/local/nagios/etc/nrpe.cfg -i" >> /etc/inet/inetd.conf

Now, tell SMF to pull in the inetd config:

inetconv

At this point, the SMF service is available, but we want to use TCP wrappers so that only our Nagios server can talk to NRPE (substitute $NAGIOS_IP with the IP of your Nagios server):

inetadm -m svc:/network/nrpe/tcp:default tcp_wrappers=TRUE
echo "nrpe: LOCAL, $NAGIOS_IP" >> /etc/hosts.allow
echo "nrpe: ALL" >> /etc/hosts.deny

Finally, fire up the service:

svcadm enable nrpe/tcp

That's it! Nagios should be able to monitor your Solaris 10 box now. Someday, I'll make a package for this, but you can pretty well copy and paste the code here to get up and running.