SysAdmin's Journey

Teaching Java How to Commit Suicide

At $work, we have a lot of java processes that are ran via cron and other wrappers that do some pretty critical tasks. The apps have been written so that the whole thing is wrapped in a try/catch that will call system.exit(1) should something not go right. Our wrapper scripts watch for a non-zero exit code, and alert Nagios if something went wrong. This works great except for when a VM encounters an outOfMemory exception (OOM). The Java VM attempts to continue on, but if the main thread hits this exception, the entire VM will exit. However, the application code that exits with a status of 1 never gets called, so the application ends up dying with a status of 0. Well, Sun (Oracle now I guess) gave us a new option in Java 6 that was backported to 1.4.2_12 and up that allows us to tell Java to run a shell command when it encounters an OOM exception. By adding the option

-XX:OnOutOfMemoryError="kill -9 %p"

to our Java command line, the VM will execute a shell that calls the kill command, with an argument of the PID of the VM. The -9 option to kill will cause the VM to exit with a non-zero status, so that our wrappers will pick up the error and alert the right people. Note: this feature was never backported to Java5 - sorry!

Display a CCK Filefield or Imagefield Upload Widget on Your Own Custom Form

Took a fair amount of googling around to find the solution to this one. With the Node Gallery 3.x branch, we needed a way to quickly add an image to an existing gallery. We could have displayed the whole node form, but there’s a lot of things on that form that we can just use the defaults for 99% of the time. We need just three fields filled in: Title, Caption, and the imagefield itself. To use the same imagefield widget that handles all the hard work for you on the node add field on your own field, first create a handler in hook_menu such as this:

$items['node/%node_gallery_gallery/upload'] = array(
    'title' => 'Upload New Image',
    'page callback' => 'drupal_get_form',
    'page arguments' => array('node_gallery_upload_image_form', 1),
    'access callback' => 'node_gallery_user_access',
    'access arguments' => array('upload', 1),
    'file' => 'node_gallery.pages.inc',
    'type' => MENU_LOCAL_TASK,
  );

Then, in node_gallery.pages.inc, you create the form function that does the work:

function node_gallery_upload_image_form($form_state, $gallery) {
  $imagetype = 'node_gallery_image';
  $form_id = $imagetype . '_node_form';
  
  module_load_include('inc', 'content', 'includes/content.node_form');
  $field = content_fields('field_node_gallery_image',$imagetype);
  
  $form['title'] = array(
    '#title' => t('Title'),
    '#type' => 'textfield',
    '#required' => TRUE,
    '#weight' => -10,
  );
  $form['body'] = array(
    '#title' => t('Caption'),
    '#type' => 'textarea',
    '#weight' => -9,
  );
  $form['type'] = array(
    '#type' => 'value',
    '#value' => $imagetype,
  );
  $form['gid'] = array(
    '#type' => 'value',
    '#value' => $gallery->nid,
  );
  $form['#field_info']['field_node_gallery_image'] = $field;
  $form['#field_info']['field_node_gallery_image']['#required'] = TRUE;
  $form += (array) content_field_form($form, $form_state, $field);
  
  $form['submit'] = array('#type' => 'submit', '#weight' => 10, '#value' => 'Save');
  
  return $form;
}

This is pretty straightforward, up until lines 28 - 30. Those three lines setup the form array and then append the results from content_field_form() to our existing form. Still, very easy, but I wasn’t able to find any documentation on how to do this. Just in case you’re curious, here’s the submit handler for that form.

function node_gallery_upload_image_form_submit($form, &$form_state) {
  global $user;
  $image = new stdClass;
  $image->uid = $user->uid;
  $image->name = (isset($user->name) ? $user->name : '');
  $values = $form_state['values'];
  foreach ($values as $key => $value) {
    $image->$key = $value;
  }
  node_gallery_image_save($image);
}

Nothing new there. The end result is a nice looking, concise form that allows you to quickly upload an image to a gallery. Sweet!

Drupal, Meet Hudson; Hudson, Drupal...

At $work, we use Hudson extensively, and it rocks. For those who don’t know already, Hudson is an implementation of Continuous Integration that is remarkably easy to use. I wrote about my first impressions of Hudson previously. Hudson’s original audience was Java developers using Ant or Maven, but with plugins and some hacking, we can make it do some things for us as module contributors to Drupal. I’ve been cutting my Drupal developer teeth by working pretty intensively on a few modules for Drupal - Node Gallery and it’s derivatives. We are hitting a crucial point in development where we are switching from the old way of defining fields on a node to using CCK. While the module is still in alpha, it’s still in use by quite a few sites - as of this writing it’s number 465 on the list of Drupal modules. Not exactly the spotlight, but we can’t go breaking things without making people angry either. I figured this would be the perfect place for Hudson - it will let you know when you break something.

Pieces of the Puzzle

Here’s the pieces you’ll need:

  • A linux server with Java, a working Drupal install (that may get broken at times) and the cvs command-line utility.
  • These Drupal modules installed and enabled: drush, coder, and optionally simpletest.
  • Some time on your hands

The shell script

This is the most important piece of the setup. By utilizing Hudson’s environment variables, we can make this as portable as possible. By using the same script for all Hudson jobs, changing the script automatically changes all of our jobs at once. Let’s dive right in:

#!/bin/bash
#set -x

PHP=/usr/bin/php
DRUSH_PATH=/apps/drupal/drush
DRUPAL_PATH=/apps/drupal/drupal_core
MODULES_DIR=$DRUPAL_PATH/sites/ngdemo.sysadminsjourney.com/modules
SITE="http://ngdemo.sysadminsjourney.com/"

DRUSH="$PHP $DRUSH_PATH/drush.php -n -r $DRUPAL_PATH -i $DRUPAL_PATH -l $SITE"
EXITVAL=0

# Check our syntax
PHP_FILES=`/usr/bin/find $WORKSPACE -type f -exec grep -q '<?php' {} \; -print`
for f in $PHP_FILES; do
  $PHP -l $f
  if [ $? != 0 ]; then
    let "EXITVAL += 1"
    echo "$f failed PHP lint test, not syncing to ngdemo website."
    exit $EXITVAL
  fi
done

#Install the files
/usr/bin/rsync -a --delete $WORKSPACE/* $MODULES_DIR/$JOB_NAME

#Run update.php
$DRUSH updatedb -q --yes

#Run coder
CODER_OUTPUT=`$DRUSH coder no-empty`
if [ -n "`echo $CODER_OUTPUT | grep $JOB_NAME`" ]; then
  $DRUSH coder no-empty
  echo "Coder module reported errors."
  let "EXITVAL += 1"
fi

#Run potx
cd $MODULES_DIR/$JOB_NAME
../potx/potx-cli.php
if [ $? != 0 ]; then
  let "EXITVAL += 1"
  echo "POTX failed."
fi
if [ -e $MODULES_DIR/$JOB_NAME/general.pot ]; then
  cp $MODULES_DIR/$JOB_NAME/general.pot $MODULES_DIR/../files/$JOB_NAME.pot
fi

exit $EXITVAL

Lines 14 through 23 find all files in $WORKSPACE (which is set by Hudson) that have the ‘must** name your Hudson project the same as the module name. Also note that your Hudson user needs to have write access to the specific module directory that it’s installing. Line 29 runs drush so that it invokes update.php, and answers yes to all questions. Lines 32 through 37 runs the default code review from the coder module. You will have had to set this up initially via the web interface. It then scans through that output looking for any complaints about our $JOB_NAME, and if found, prints it to stdout and increments our exit value by 1. Note we don’t exit here, as it’s a non-fatal error. However, Hudson will treat it as a failed build and email everyone about it. Lines 40 through 48 runs the Translation Template Extractor command line utility against our module. It then copies the general.pot to the files directory. Again, the user running Hudson will need write access for this to work properly. If the potx-cli.php script should exit uncleanly, we increment our exit value by one. Last in my script, we simply exit with whatever value we have ended up with at this point. Again, if Hudson sees anything other than a zero, it will email everyone about it. Since the modules I’m working on don’t have Simpletest tests ready yet, I don’t run them in this script. However, it’s on the horizon, and can be ran easily using run-tests.sh. Note that there is a patch that will cause run- tests.sh to output it’s results in a JUnit style output, which Hudson understands fully. If you implement this, I strongly recommend applying that patch.

Hudson Setup

Now that we have our script ready, we need to setup our Hudson job. Note that installing Hudson itself is outside the scope of this article - it’s refreshingly easy and doesn’t need repeating here. There are two things you must do before creating the build task. First, setup your “E-mail Notification” section according to your mail server at http://myhudsonserver:8080/configure. Also, you will need to install the “URL Change Trigger” plugin by navigating to http://myhudsonserver:8080/pluginManager/available. Once you install that plugin, create a new job. In my case, the job was named ‘node_gallery’, since that’s the name of the Drupal module I was working with. Select ‘Build a free- style software project’ when asked. Under the “Source Code Management” section, select “CVS”, and then fill in the CVSROOT of the project you’re working with. In my case, it was ‘:pserver:anonymous:anonymous@cvs.drupal.org:/cvs/drupal-contrib’. Next, fill in the path to the module in the “Module(s)” form - for me it’s ‘contributions/modules/node_gallery/’. If you’re working with CVS HEAD, leave the Branch field empty, otherwise type in the branch name there. IMPORTANT: to avoid abusing Drupal.org’s already overloaded CVS server with cvs logins once every 5 minutes, we will point Hudson instead to the RSS feed for the CVS log messages. Make sure “Poll SCM” is unchecked, and check “Build when a URL’s content changes”. To obtain the URL, you need the node id of your project. There’s many ways to do this, but you can find it by going to the project’s Drupal.org page and click on “View CVS Messages”. From that page, click the orange RSS icon at the bottom left of that page. Copy and paste that URL into the URL field in Hudson. Under the Build section, click “Add build step”, and select “Execute Shell”. In the resulting “Command” textarea, type the full path to the shell script we setup above. The final section, “Post-build actions” is up to you, but you’ll likely want to enable email notifications. Place a checkmark in the “Email Notification” box, and type in the email addresses of the desired recipients. Click Save, and you’re done! Hudson will start doing a CVS checkout of your project, and will start running tests on it. It will email you once anything goes wrong, and will notify you again when the problem is resolved. It will only run these tests after someone commits code to CVS, so you will likely need to hit the “Build Now” link in the left nav a few times. We’ve really only scratched the surface of what Hudson can do. You can track performance using JMeter, add all kinds of crazy plugins, require logins, the list goes on and on. While this helps, it’s still nowhere as helpful as a Ant/Maven job can be. Hopefully this article is enough to spark some interest from the Drupal community so that we can write some better continuous integration code in the future. Also, I’m far from being an expert on either Drupal or Hudson. I wrote my first code for Drupal in November of 2009, and I only tinker with Hudson on occasion at work. Hudson works so well, it’s one of those “set it and forget it” apps. I would love for readers to leave comments on any mistakes I might have made, or possible improvements I may have missed!

ZFS in the Trenches Presentation at LISA 09

Just got the chance to finally sit down and watch Ben Rockwood’s presentation at LISA 09: ZFS in the Trenches. If you are even thinking about ZFS and how it works, it’s a very informative presentation. There is very little marketing-speak, and he very specifically targets sysadmins as his audience. Great stuff! Of interesting note about his comparison of fsstat vs iostat, our Apache webservers routinely see about 5MB/sec reads being asked of ZFS, but the actual iostat on the disk shows that almost all of that traffic is being served up from ARC.

QuickTip: Fix Eclipse Galileo Buttons on Ubuntu 9.10

There’s a nasty upstream bug in GTK present in Ubuntu 9.10 that makes Eclipse Galileo all but unusable – specifically it makes clicking many buttons with the mouse just stop working. You can use tab and spacebar to make it work, but that’s not much of a workaround. All you need to do is set an environment variable before starting Eclipse:

export GDK_NATIVE_WINDOWS=true

Share Your Eclipse Plugins and Configurations Across Platforms

Over the years, I’ve come to know and love Eclipse. Though it has roots in Java, ironically, I use Eclipse for just about everything except for coding Java (if I wrote Java code, I’m sure I’d use Eclipse). Eclipse is great for browsing Subversion, coding PHP, coding Perl, and even coding shell scripts. For die hards like me, there’s the viPlugin that allows you to use all the vi commands you know and love within Eclipse. About the time you get your perfect Eclipse setup established, you buy a new laptop on a new platform. Or, in my case, I have three “primary” development workstations, each on a different OS. The rest of this article will show you how to hook Dropbox into your Eclipse installation, allowing you to share your plugins and configurations across different versions of Eclipse, on different machines, and even on different platforms.

Truth be told, doing this type of setup with Eclipse was actually easier to do with older versions of Eclipse. Since they’ve moved to the p2 provisioning system, it became a little harder to do, but still very possible. After much googling, I finally came across this StackOverflow question that gave me the pieces I needed to set this all up.

A little prep work on the frontend will save us a huge amount of time in maintenance. Note that I use Dropbox in this article, but any similar service should do. We’ll setup our Linux install first, since we can script things a little easier there. Go ahead and install Dropbox and Eclipse - they’re both very straightforward installations.

Let’s assume that our Dropbox directory is directly under our home directory, and our eclipse installation is in ~/eclipse. Let’s setup some environment variables and create our directory structure:

export DROPDIR=~/Dropbox
export ECLIPSEDIR=~/eclipse
cd $DROPDIR
mkdir eclipse-custom
cd eclipse-custom
# Create our shared extension dir
mkdir extensions
# Create our workspace dir
mkdir shared-workspace

With our directory structure setup, it’s time to pick a plugin to install. Let’s do PDT. The key here is that we start Eclipse by pointing it to a new configuration directory which lives on our Dropbox account, and install the new extension. This will force Eclipse to install the plugin to the Dropbox directory, instead of the local directory. Start Eclipse like so:

eclipse -configuration $DROPDIR/eclipse-custom/extensions/pdt/eclipse/configuration

Note that you can change the ‘pdt’ portion of that path to whatever you choose, but you must include the trailing eclipse/configuration portion. Once in Eclipse, go ahead and install PDT just as you normally would, then exit Eclipse.

Now that we’ve installed the PDT extension to a shared location, it’s time to point our local Eclipse installation to it. I wrote a quick script to do just that:

mkdir $ECLIPSEDIR/links
cd $DROPDIR/eclipse-custom/extensions
for d in `ls`; do
  echo "path=`pwd`/$d" > $ECLIPSEDIR/links/$d.link
done

This script creates a directory named ‘links’ in your Eclipse local installation, and creates a file for every extension that contains one line containing the path to the target extension. Now, start Eclipse. For some odd reason, the extensions wouldn’t actually install until after I restarted Eclipse a second time, so you may need to do the same. You should now see your plugin in Eclipse.

Please note that if you’re doing cross-platform development, you’ll save yourself some headache by not sharing the subclipse plugin. There’s too much of that plugin that depends on the underlying OS to share effectively.

NGINX Performs Well on Solaris 10 X86

Just a quick posting of some simple benchmarks today. Please note, these are not the be all, end all performance results that allow everyone to scream from atop yonder hill that Solaris performs better than Linux! This was just me doing a little due dilligence. I like Solaris 10, and wanted to run it on our webservers. We’re looking at using NGINX to serve up some static files, and I wanted to make sure it performed like it should on Solaris 10 before deploying it - you know, right tool for the job and all. So, disclaimers aside, here’s what I found.

The Hardware

The hardware I tested was a Dell PowerEdge R610, with 12GB RAM, and 2x4 Core Nehalem CPU’s. SATA disks were used with the internal RAID controller, but no RAID was configured.

The Benchmarks

I used ApacheBench, as shipped with Glassfish Webstack 1.5. Yes, I know, there’s all kinds of flaws with ApacheBench, but the key here isn’t the benchmarking tool, it’s that the tool and it’s configuration remain the same. Here’s the command line I used:

/opt/sun/webstack/apache2/2.2/bin/ab -n1000000 -k -c 2000 http://localhost/static/images/logo.jpg

CentOS 5.4

I installed CentOS 5.4, ran yum to get all the updates possible. I then installed NGINX 0.7.64 from source, and simply copied one image file into the document root. I did a few sysctl tweaks for buffers and whatnot, but found later that they didn’t impact the benchmark. Here’s what ApacheBench running on the local host had to say:

This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100000 requests
Completed 200000 requests
Completed 300000 requests
Completed 400000 requests
Completed 500000 requests
Completed 600000 requests
Completed 700000 requests
Completed 800000 requests
Completed 900000 requests
Completed 1000000 requests
Finished 1000000 requests


Server Software:        nginx/0.7.64
Server Hostname:        localhost
Server Port:            80

Document Path:          /static/images/logo.jpg
Document Length:        4404 bytes

Concurrency Level:      2000
Time taken for tests:   21.916 seconds
Complete requests:      1000000
Failed requests:        0
Write errors:           0
Keep-Alive requests:    990554
Total transferred:      4625275893 bytes
HTML transferred:       4407166476 bytes
Requests per second:    45629.29 [#/sec] (mean)
Time per request:       43.831 [ms] (mean)
Time per request:       0.022 [ms] (mean, across all concurrent requests)
Transfer rate:          206101.61 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   3.9      0     135
Processing:     0   43  67.8     27     676
Waiting:        0   43  67.7     27     676
Total:          0   44  68.1     27     676

Percentage of the requests served within a certain time (ms)
  50%     27
  66%     41
  75%     49
  80%     53
  90%     72
  95%    202
  98%    245
  99%    342
 100%    676 (longest request)

No matter how you slice it, that’s pretty darn fast. I knew that Solaris 10 had a completely rewritten TCP/IP stack optimized for multithreading, and that it should keep right up with Linux. However, NGINX uses different event models for Linux and Solaris 10 (epoll vs eventport), so I wanted to make sure there weren’t any major differences in performance.

Solaris 10

I installed Solaris 10 x86, ran pca to get all the updates possible. I then installed NGINX 0.7.64 from source, and simply copied one image file into the document root. Here’s what ApacheBench running on the local host had to say:

Server Software:        nginx/0.7.64
Server Hostname:        localhost
Server Port:            80

Document Path:          /static/images/logo.jpg
Document Length:        4404 bytes

Concurrency Level:      2000
Time taken for tests:   21.728 seconds
Complete requests:      1000000
Failed requests:        0
Write errors:           0
Keep-Alive requests:    991224
Total transferred:      4623536714 bytes
HTML transferred:       4405506168 bytes
Requests per second:    46023.73 [#/sec] (mean)
Time per request:       43.456 [ms] (mean)
Time per request:       0.022 [ms] (mean, across all concurrent requests)
Transfer rate:          207805.08 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1  71.9      0    4434
Processing:     0   42  57.2     29    1128
Waiting:        0   41  56.0     29    1128
Total:          0   43  98.6     29    4473

Percentage of the requests served within a certain time (ms)
  50%     29
  66%     35
  75%     42
  80%     50 
  90%     74
  95%    108
  98%    176
  99%    256
 100%   4473 (longest request)

Again, very impressive results. Overall, it appeared as though Solaris+NGINX was just a few millis faster than CentOS+NGINX in most cases, but certainly not enough to change your mind of what platform to use. If you notice the 4.5 second request on the Solaris box, I’m pretty sure that’s a TCP retransmit that I can work out with ndd tuning.

The Verdict

NGINX is freaking fast. My hunch is that it’s so fast, that I’m actually running up against the limits of ApacheBench, not NGINX – but that’s just a gut feeling. The verdict is that you won’t be making a mistake going with either Linux or Solaris when setting up your NGINX server.

Ask SAJ: What to Do With Apache Logs > 50GB?

Our site at $work is generating Apache logs that, when combined sequentially into one file, are larger than 50GB in size for one day’s worth of traffic. AWStats' perl script pretty much chokes when working on this much data. Last I checked, Webalizer wasn’t much different, and probably wouldn’t scale up to that amount of data either. Does anyone out there have any advice on a commercial solution for Apache log analysis that can scale up like that?

Tip for "Split Components Across Domains" Performance Goal From Yahoo!

Just thought I’d pass this little tidbit out there - we fixed it by pure luck on the first try. Yahoo unselfishly provides a document titled Best Practices for Speeding Up Your Website. While some of the rules offered there aren’t applicable for all sites, it’s a great document and if you run a website, you should read it. At $work, part of our last code drop was to push out a feature that enabled “Split Components Across Domains”. From the article Performance Research, Part 4: Maximizing Parallel Downloads in the Carpool Lane:

Our rule of thumb is to increase the number of parallel downloads by using at least two, but no more than four hostnames. Once again, this underscores the number one rule for improving response times: reduce the number of components in the page.

I’m here to tell you, if you have AOL users surfing your site, do not use four hostnames. When we pushed this feature up to production, we had one hostname that served up the HTML, and we had four hostnames that served up imagery (all these hostnames pointed back to the same IP, but doing this allows a performance boost in the browser). For this example, let’s say that www.mydomain.com is the HTML hostname; img0.mycontent.com, img1.mycontent.com, img2.mycontent.com, and img3.mycontent.com were the imagery servers. This most certainly improved performance on the client side, but we started receiving some reports from a few users that they were no longer able to see any imagery on the site since we dropped the new code. We immediately knew what was causing the issue, but had no idea why, or how far spread out it was. Well, after poking around some of the “big boys” websites such as Amazon, we noticed that while all of them separated their components as suggested by Yahoo!, all of them used only one hostname for the imagery. We quickly configured our webapp to only use www.mydomain.com for the HTML, and img0.mycontent.com for the imagery. Once we did that, our AOL users were again able to see imagery. Now, I have no idea how widespread the issue was. I know it was limited to users of the AOL browser, and I suspect it’s probably a bug in a specific version of their browser. However, if your site needs to provide compatibility to the most users possible, you may want to use just one separate hostname for splitting components. I hope this helps someone else!

Which Directory Server and Why?

One of my projects for 2010 is to get a reliable directory server in place. I was going to post a poll to my readers asking what they felt was the best DS, but Ben Rockrood beat me to it with his article Community Poll: Whats your favorite Directory Server?. It’s likely that most of the readers of this blog already read Ben’s too, but if you don’t it’s a great blog to subscribe to. If you have any input into the debate on what DS is the best, head on over and leave a comment!