SysAdmin's Journey

Beginning of a New Journey

Bad news: this blog is no longer maintained. Good news: I have a new blog!

It’s no secret that this blog has been short on content for quite awhile. Sometimes you just lose the urge to write. I didn’t feel like writing about about little things, and I wasn’t able to commit the time to write about the little things. A little over a year ago, I took an opportunity to move into InfoSec at the same company. I’ve decided to write a bit about what I learn, come see me there at Blue Wanting Red.

It’s going to be a blog where I write mostly about Red Team topics, but from a Blue Team’s perspective. My goal will be to outline the attacker’s perspective, and then explain what the defender can do to protect themselves.

I’m certainly not the first person to do this, but it’s the first time in a long time I’ve felt the urge to write about anything technical. Also, sometimes reading about the same topic from a different author can cause the “click” to occur. If I can help that “click” for a few people, then it’s all worth it.

Replicating Hashicorp Vault in a Multi-DataCenter Setup

At the request of the authors of Vault, I've decided to take this content down. It was creating a lot of problems for people who aren't using Vault in exactly the same way I was, and it was likely causing more harm than good. If you need a way to reliably do Vault replication, you'll want to look into the Vault Premium package.

Using Git Submodules With Dynamic Puppet Environments

There comes a point in the lifecycle of every Puppet setup where you realize that you’re going to be much better off utilizing other peoples' Puppet modules whenever possible. It’s what makes OSS great – why should I reinvent the wheel when I can help make your wheel even better? I’ve found what I think is a very productive setup – it leverages Git (specifically branches, submodules, and hooks), Gitolite permissions, and Puppet environments to create a workflow that a team of admins can use to iterate over new features on without disturbing each other.

Most of pieces to this puzzle are very well documented elsewhere, I’ll provide links where necessary.

Step 1: Establish Dynamic Environment Workflow

The first step is to go read “Git Workflow and Puppet Environments” written by Adrien Thebo of Puppet Labs. Once you’ve implemented that setup, you should be able to do the following from your workstation:

git clone git@git:puppet.git
cd puppet
git checkout -b mybrokenbranch
echo "this line breaks everything" >> manifests/site.pp
git commit -am 'Intentionally breaking things'
git push origin mybrokenbranch

At this point, you now have a new environment named ‘mybrokenbranch’ on your Puppetmaster. You can test the setup by ssh'ing into the client machines and run:

puppet agent --test --environment mybrokenbranch --noop

That obviously won’t be a happy puppet run. The key point here being that your other environments are not impacted by the work of this one admin. Let’s delete the local and remote branch. From your workstation:

git checkout master
git branch -d mybrokenbranch
git push origin :mybrokenbranch

Note that the Puppetmaster says that it’s deleted the environment. Feel free to verify that by running the above command on the Puppet client, it will complain about not having an environment.

Step 2: Incorporate Git Submodules

With all that setup, let’s go ahead and implement support for git submodules. I have a pull request off to Adrien to implement this functionality, but until he commits it in, you can use my fork on github. Replace the update hook with the updated version on your git server. Now, let’s try pulling a git submodule into our repo. Again, from your workstation:

git checkout -b firewall
git submodule add git://github.com/puppetlabs/puppetlabs-firewall.git modules/firewall
git add .gitmodules modules/firewall
git commit -m 'Adding firewall submodule'
git push origin firewall

Note in the output that the Puppetmaster is checking out the git submodule into the new environment. Go ahead and log into the Puppetmaster, and look in your firewall environment, you should see all the manifests and whatnot there.

Here’s where I need to stamp a disclosure notice – git submodules aren’t all milk and honey. There’s some funky situations you can get yourself into if you’re not careful. Thankfully, there’s not many of those situations you can’t get yourself out of. I highly recommend reading the Pro Git chapter on submodules before doing anything with them.

Step 3: Implement Access Controls on Gitolite

This next step is entirely optional, but works out well for us. We have a setup where I’m the only admin that can write to the master and testing branches of our git repo, but any sysadmin can create their own branch, test it, and delete it if need be. Setting up gitolite is far beyond the scope of this post, but if you have about an hour of free time, you can have it setup and running. However, below I’ve pasted the relevant snippet from gitolite.conf that enforces those permissions.

repo    puppet                                                                                                          
  RW+     = JustinEllison
  R       = @SysAdmins Fisheye-puppet PuppetMaster
               - master testing = @SysAdmins
  RW+     = @SysAdmins

Step 4: Profit!

To summarize it all, here’s the workflow for an admin to add a new feature in our Puppet setup:

  1. Create a new VM which will be the testing ground for the new feature.
  2. Create a local feature branch to implement the new feature in. The admin iterates over this branch (pushing the branch to origin) getting things working with his VM.
  3. Once he’s happy with the results on his VM, he’s required to login to another sandbox VM, and run it against the same puppet branch with the ‘–noop’ flag to ensure nothing unintended happens.
  4. At this point, the positive and the negative have been tested, and he then asks me to merge the feature branch into master.
  5. I then do a
git diff ...origin/newfeature

We go over any changes, and I merge it in. 1. From there, we follow our normal deployment method of tagging a release, and manually checking out the tag on the Puppetmaster.

While it’s certainly not perfect, this workflow setup has allowed us to work together as a team while still implementing some best practices. In particular, the dynamic environments allow us to test our features extensively before releasing them into production. This is especially important in a team where the admins aren’t Ruby programmers that can write puppet-rspec tests.

Before the integration of git submodules with the dynamic environment workflow, we were manually merging external repos into our own setup, and it was an absolute nightmare. Now, to update our repo to use a new version of someone else’s module, we just create a new feature branch, update the submodule, test, and merge.

What workflows do you and your team use that make life with Puppet better? Please share below.

VPS.net Review

I’ve been running a single node from VPS.net for about a year now. Please note that my specific experience has been in their “Chicago Zone D data center”, but if you check out their status page or search Twitter, you’ll find a lot of others having the same issues. While there’s a lot of good things to write about, where they fail is the most important area to me: availabilty. The pros of using VPS.net include pricing, control panel, and console level access. As is typical for a VPS provider, they give you many “add-on” options such as backup, etc that you can enable – I’ve not investigated them myself. Perhaps the one of the nicest features is the ability to add server resources or “nodes” on the fly with minimal downtime. However, it seems that VPS.net has made a horrible choice in selecting what SAN they use to back their VM’s. Examine the graphic below: As you can see, I’m getting less than 2 nines worth of uptime from my node. Each and everytime there’s been an issue, support has been quick to point out that they’ve had some sort of SAN issue, and that the SAN is ‘resyncing’. The problem is that while the SAN is resyncing, I/O to my node is so horrible, I can’t cat a 500 byte file to stdout in less than 10 seconds. So, the node will respond to a ping, but it can’t serve up a static image via Apache. For all intents and purposes, that’s down in my book. The last SAN synchronization took the better part of two days, during which time my node was unusable. In my experience, the SAN is the most important building block when architecting a service that’s meant to be highly available. Until VPS.net can address their SAN issues, they are likely to continue to have prolonged downtimes. Until that’s been fixed, there’s just no way I can recommend their services to anyone.

It's Not You, It's Me: Call for Node Gallery Co-maintainers

There’s only a certain amount of bandwidth in a person’s day. As you get older, that bandwidth seems to become more and more constrained. Kids are extreme bandwidth hogs :) Over the years I’ve found that I have enough bandwidth in my life to deal with one obsession that’s not part of my day job at any given time. For the last couple years, that obsession has been with Drupal and specifically with Node Gallery. In my very biased opinion, it’s the most user-friendly and integrated gallery experience you can have with Drupal 6.x. Also IMHO, there’s a huge void in Drupal 7 with respect to butt-kicking gallery modules, one that’s begging to be filled with a Node Gallery 7.x branch. But I just can’t bring myself to that one simple git command. I’ve had several changes at work in the past year, and I’m no longer working with Drupal and PHP on a regular basis. I’ve become enthralled with Puppet as of late, and that’s proven to be the gateway drug to the devops movement for me. I’m reading books on Kanban, learning a bit of Ruby, building deployment pipelines, and soaking up anything I can on devops. It seems sysadmins who can code really do have a place in the world, and it appears to be in devops. It’s not burnout, it’s simply a matter of prioritization on demands for a limited resource. There’s just no time left over for Drupal anymore. Back to the point of this post – Node Gallery needs a co-maintainer who can take the module into the 7.x branch. The recently released 6.x-3.x branch has proven to be quite stable, and would likely require only very minimal maintenance. You can take it for a spin on the demo site, or read all about it’s features on the project page. Here’s some quick points:

  • It has a reported user base of just under 3,800 sites, which puts it at right around #400 on the top modules list.
  • It has a great user base that’s proven to be active in the issue queue. Many of the support requests have been resolved by members of the community whom have never written a line of code. It has a strong German presence, and has been translated.
  • It integrates very tightly with Views, and supports bulk uploading with Plupload. It has it’s own access module in Node Gallery Access, as well as a handful of other modules (all of which are listed on the project page) it integrates with very well.
  • It’s been engineered to perform well from the start. If your server can handle the load of 100,000 nodes, there’s no reason it should be able to handle 100,000 Node Gallery image nodes – even if those are all in one gallery.
  • The administration UI aims to provide a working gallery setup out-of-the-box that works for 90% of the users, yet provide enough buttons and knobs for the remaining 10% to be able to tweak what they need.
  • It runs the gamut of technologies in Drupal; making use of caching, Views integration, jQuery and jQuery UI, CCK, Node Access, Batch API, etc.
  • What differentiates Node Gallery from most other gallery modules is that each and every image in a gallery isn’t just a field, it’s an entire node. This opens up huge possibilities for interactions with other contrib modules. The original reason for me selecting Node Gallery was because it was the only way I could sell individual images using Ubercart. Who I’m looking for:

  • This module is likely a bit complex for someone who’s never maintained a module before. If you’ve maintained your own Drupal module (either privately or on d.o), take a look at the code and make sure you can understand what’s going on.

  • Drupal 7 API experience is a must; experience in migrating D6 modules to D7 is a plus.
  • Ideally, you need to have an “itch that needs scratching” – in other words, you should probably have a need for an image gallery. If you’d like to take a crack at bringing Node Gallery to Drupal 7, contact me, or file an issue in the issue queue.

Drupal on Heroku

Heroku has been around for awhile now, but has been primarily a rails host. Well, until recently anway. With the announcement of their Facebook integration, many others have noted that *any* PHP app can at least parse on Heroku’s cedar stack. I’ll be honest, it took me longer to get ruby+rails setup on my Macbook than it did to get a proof-of-concept installation of Drupal up and running. Here’s what I did:

  1. Get ruby, rails, and the heroku gem installed and running. This page had me up and running pretty quickly on my Mac.
  2. Download and extract Drupal:
curl http://ftp.drupal.org/files/projects/drupal-6.22.tar.gz | tar -xzvf -; cd drupal-6.22
  1. Initialize your git repo:
git init
  1. Here’s what makes all this proof-of-concept only. Many of the features used in Drupal core’s .htaccess file assume that the webhost has enabled the “AllowOverride All” option. Heroku doesn’t allow this, it only allows a small subset of overrides. DOING THIS WILL MORE THAN LIKELY COMPROMISE THE SECURITY OF YOUR DRUPAL INSTALL. Open up .htaccess in your editor, and comment out any line that starts with these strings:

    • Order
    • Options
    • DirectoryIndex
    • php_value
  2. Add Drupal to git, and commit:

git add .; git commit -m 'initial commit'
  1. Create your heroku application. You’ll need to have signed up for a free account on http://www.heroku.com and give the following command your login credentials:
heroku create --stack cedar
  1. Push your code up to heroku (note the URL it gives you back):
git push heroku master
  1. Now, we need to setup the Postgres instance:
heroku addons:add shared-database
  1. Let’s display our Postgres credentials:
heroku config
  1. You can now hit your Drupal instance at the URL given to you by your last git push. Install as you normally would, selecting Postgres as your database, and filling in the user, password, database, and host given to you by ‘heroku config’. Make sure to change the host from localhost under the “Advanced” fieldset.

At this point, you can poke around your install, and start seeing what all else is broken :) ‘heroku logs -t’ is your friend. If you don’t believe me, here’s a D7 instance, and here’s a D6 one.

Seriously, the .htaccess point is a deal-breaker. Unless someone with more time on their hands than I do can suggest a more secure configuration (or heroku allows Drupal to override all), there’s some serious security ramifications to commenting out the lines in .htaccess.

Drupal is definitely slow on the free plan for Heroku, but I mean, it’s free; what did you expect? Drupal 6 seemed to work throughout, but I noticed when getting D7 up and running that I couldn’t hit some “heavy” URL’s like /admin/configure and /admin/reports/status. I could get into other sub-menus such as /admin/configure/development/performance. We all know D7 takes a fair amount of horsepower to run, and horses aren’t free :). The whole point of heroku is being able to scale your app by dragging a slider in a web ui, and there’s no reason to believe that Drupal wouldn’t start running much faster given more resources from a non-free plan.

The point of this blog post was to just jot down my notes and save someone else a little time in getting started – hopefully the community can come up with some ideas so we have another awesome choice in Drupal hosting!

Selecting the Right CDN for YOUR Website

At one of my jobs, we recently went through the process of selecting a CDN (Content Delivery Network) for our site. While the first rule of CDN’s is that “any CDN is better than no CDN”, it can be argued that certain CDN’s are a better fit in certain situation than others. This post is basically a summary of the process we went through when selecting our CDN. By no means is this a statement of “XYZ is better than ABC”, it’s simply documentation of the process we went through in order to select the right one for our business. While most CDN’s are compatible with Drupal via excellent contrib modules such as CDN, this information presented in this article is relative to any website and isn’t Drupal-specific.

To illustrate the importance of a CDN using real numbers, one image being fetched from the data center to our office takes about 323ms. That same image fetched from Seattle is 483ms, and from Washington DC takes 599ms. The worst cases appear when coming overseas - to fetch the same image from Paris it takes on average 1,141ms for just that one image.

A Content Delivery Network (CDN) shortens that distance between your static content and the end-user. While the text on most web pages is dynamic, most images, JavaScript, and CSS are static. These static objects make up a large percentage of the total bytes downloaded for each page view. By using a CDN, you place static content as close to the end-user as possible. In turn this decreases the page load time a end-user experiences by leaps and bounds.

Pre-selection Criteria

There’s a plethora of CDN’s to choose from, and if you don’t filter the initial list down to five or fewer providers, you’ll end up spending months in evaluation time. By defining specific must-have features, we were able to limit the initial number of companies to compare to four. Many CDN’s provide value-add services above and beyone static objects, such as “Dynamic Site Acceleration” – this evaluation looked solely at serving up static file content, e.g. JPEG, GIF, CSS, and Javascript.

The filtering properties we used to limit scope were:

  1. The CDN must provide “origin pull” or “reverse proxy” support. If the CDN receives a request for a file that doesn’t exist at the edge, it applies a customer-defined URL rewrite to the request, and proxies the request to the origin site. If the image exists at the origin, the edge server caches the image locally and serves it to the client from there. For example, the CDN host name might be cdn.example.com (which points to the edge), and the origin site (my server) would be www.example.com. If I point my browser to http://cdn.example.com/logo.gif, and that file doesn’t exist at the edge, the CDN will make a request for http://www.example.com/logo.gif. If that file exists, it is fetched and cached. If it doesn’t exist, a 404 is returned to the client. The trade off is that you don’t have to pre-seed static content to the CDN, but the first user request for a static object takes a bit longer to complete (because it results in two requests instead of one). Once the edge network’s cache is primed, there is no performance difference between origin pull and CDN origin.

  2. The CDN must propagate cache-related HTTP headers from the origin to the end-user We’ve went to great lengths to use versioning of filenames so that we can set far-future expires headers on 99% of our static content as recommended by Yahoo’s “Best Practices for Speeding Up Your Website”. This results in far fewer HTTP requests to render a page that has already been requested by the end-user previously, ultimately decreasing page response time. Some CDN’s that offer origin pull do not proxy these headers back.

  3. The CDN must use GZip compression on text-based content Most CDN’s support this, but it’s something you definitely want to check. When serving up static text-based content such as CSS or Javascript, the CDN can and should compress it for you before sending it to the end-user. Compression makes the overall page content smaller, and therefore faster to render.

  4. Response time must be consistent and fast Performance is a tricky thing. While having the fastest response time overall didn’t guarantee that a CDN would “win”, having consistent relative poor performance would guarantee a CDN would “lose”. Try not to focus too much on performance numbers – most of the CDN’s will have a standard deviation less than ten milliseconds between each other. In our research we found out quickly that there’s a lot of features more important to us than 5ms worth of response time.

  5. 100% Uptime SLA Since a CDN is at it’s most basic level a geographically distributed cluster of cache servers, it should be implied that a CDN can provide 100% uptime. If one POP goes down, requests should be automatically routed to the next nearest POP. If your CDN doesn’t offer this, you need a new CDN.

  6. Company financial strength and solvency This is something often overlooked when people evaluate, but was very important to us. There are a lot of CDN’s out there, and we found only 2 or 3 that could put in writing that they are a profitable corporation. Our implementation required a fair amount of work, and it would take us some time to switch to another CDN. If your CDN goes dark in the middle of the night, how long will it take you to switch?

Important Features

Whereas not meeting any of the above requirements would result in being excluded from our comparison, the following features were key points of consideration. Not meeting them all wouldn’t exclude a CDN, but on the flip side, implementing them all would put the CDN in very good standing.

  1. Price. While high prices weren’t going to scare us away, bang for the buck played a large part in our decision. We weren’t interested in paying a premium for brand recognition.

  2. Strong international presence. Our guests include international clients, and poor static object performance for those clients was the key motivation for implementing a CDN in the first place.

  3. Contract terms. Some CDN’s do month-to-month, some do 12 month, others require longer as you negotiate price.

  4. Overage fees. CDN’s meter you on the amount of bandwidth you consume. You pay for a “bucket”. No CDN’s turn your service off after you exceed that bucket, they just bill you for overages. The good CDN’s will bill you at the same per-GB rate that you pay for your monthly bucket. Some CDN’s charge as much as 2x for overages. Avoid those.

  5. Traffic accounting. One other thing often overlooked with origin pull CDN’s is whether or not the traffic between the edge POP’s and the customer origin is counted as traffic against your total. Some CDN’s count it against your bucket, others don’t.

  6. Setup fees. CDN’s vary wildly on their setup fees. Some are free, some charge more than $5,000. Make sure you incorporate that cost into your decision.

  7. User Interface. All CDN’s offer some form of web-based interface. The quality of the interface greatly differs between CDN’s. I could swear that some of the interfaces I saw were written in CGI Perl in the late 90’s. Others interfaces offered everything a customer could ever want, including detailed analytics and reporting. Key questions to ask are “If I get a bad image out on the edge, how do I purge it?”, and “How do I tell how much bandwidth is being consumed throughout the CDN at any particular point in time?”

External Reporting Data

We chose to invest in one month’s worth of reporting from CloudHarmony’s CloudReports service. This gave us a quick way to examine performance of CDN’s to the actual end-user browser behind a real cable/dsl/dialup connection (not to a datacenter somewhere). While some might view those reports expensive, we found it quite a bargain to have another independent view into the performance of a vast majority of CDN’s.

The Contenders

Given the above requirements, coupled with the performance data provided from CloudHarmony we were able to refine our list of CDN’s to consider. In alphabetical order:

  1. Akamai
  2. CacheFly
  3. EdgeCast
  4. LimeLight

First elimination: Akamai

Akamai is to the CDN market what Bose is to the home audio market. While it’s not inherently a bad product, you’re paying a huge premium for the brand name. While we never got so far as to setup a demo account, the performance data provided by CloudHarmony and other sources didn’t favor them well at all. My personal opinion (which is little more than a wild guess) on why Akamai doesn’t perform as well is because of their product’s age. Their network is by far the largest one out there, and I can guess that keeping up with the latest optimizations and protocols is a huge undertaking.

When speaking with Akamai, I got the impression that they really don’t care to sell their static object delivery product by itself. Their reps focused mostly on trying to upsell their Dynamic Site Acceleration product. While DSA might indeed be a great product, it wasn’t what we were interested in.

In the end, the best price I could get out of Akamai was more than twice that of the next most expensive CDN in our comparison, and they wanted a 3 year contract at that price. I’m just not that into paying twice as much for an equal product, so they were eliminated. If we should move to a Dynamic Site Acceleration type of service later, Akamai will definitely be re-evaluated at that time.

Second elimination: LimeLight

LimeLight Networks is the 2nd largest CDN provider, behind Akamai. It’s fitting that they are right behind Akamai, because they came across like a smaller Akamai to me. Their pricing is much more competitive than Akamai, and performance appeared to be quite good across the board. They supposedly have a nice web and reporting interface, but I was unable to get a demo setup without filling out paperwork that would have required approval from our legal department. Therein lies the problem with LimeLight – getting them to do anything outside the everyday norm was like pulling teeth. Like Akamai, LimeLight also is focused on the upsell and seemed to me generally disinterested in selling their static delivery service.

If for some reason, we had to switch away from our primary choice, my second choice would likely be LimeLight Networks, but only after I was able to obtain a demo account so that I could verify their performance was within acceptable range and the functionality of their user interface.

Independent Performance comparisons

I was able to easily procure demo accounts from EdgeCast and CacheFly, so I set up some performance testing of our own using Pingdom to download a typical JPEG image from each Pingdom POP using the origin pull setup. Note that since Pingdom’s servers are in data centers and not in actual residences; this isn’t a measure of end-to-end performance, rather a way to compare apples to apples response time from various regions around the world. The executive summary here is that while EdgeCast “edged” out CacheFly, the real message is that any CDN is so much better than none at all:

CDN US/Non-US Location # of Polls Avg Response Time Max Response Time StdDev
CacheFly Non-US Amsterdam 2, Netherlands 289 68 4202 285.98
Copenhagen, Denmark 259 158 461 36.02
Frankfurt, Germany 287 41 567 32.38
London 2, UK 290 29 2489 145.26
London, UK 284 29 127 11.30
Madrid, Spain 259 201 586 31.36
Manchester, UK 281 129 1709 184.87
Montreal, Canada 286 105 3084 250.63
Paris, France 286 143 521 60.11
Stockholm, Sweden 286 54 882 80.88
Non-US Total 2807 94 4202 157.88
US Atlanta, Georgia 289 16 398 23.52
Chicago, IL 288 56 2615 158.33
Dallas 4, TX 286 40 960 74.61
Dallas 5, TX 289 26 1506 89.08
Dallas 6, TX 291 47 1473 132.25
Denver, CO 289 216 925 72.18
Herndon, VA 288 473 3472 196.13
Houston 3, TX 289 107 382 18.15
Las Vegas, NV 288 74 3044 180.60
Los Angeles, CA 289 12 92 11.52
New York, NY 289 175 2571 152.29
San Francisco, CA 287 28 231 24.17
Seattle, WA 288 174 1083 108.41
Tampa, Florida 267 68 3048 214.49
Washington, DC 286 163 1547 141.67
US Total 4303 112 3472 170.11
CacheFly Total 7110 105 4202 165.61
EdgeCast Small Non-US Amsterdam 2, Netherlands 284 62 381 27.49
Copenhagen, Denmark 254 126 1148 87.72
Frankfurt, Germany 284 40 318 19.05
London 2, UK 284 26 975 59.59
London, UK 283 23 191 14.38
Madrid, Spain 252 176 1174 112.31
Manchester, UK 275 86 1494 118.26
Montreal, Canada 283 163 601 59.56
Paris, France 283 94 1537 140.76
Stockholm, Sweden 271 162 967 81.87
Non-US Total 2753 94 1537 99.35
US Atlanta, Georgia 284 129 523 34.51
Chicago, IL 284 26 463 35.86
Dallas 4, TX 277 30 339 25.79
Dallas 5, TX 284 26 581 50.32
Dallas 6, TX 284 23 430 33.68
Denver, CO 281 244 2169 150.12
Herndon, VA 280 24 301 20.44
Houston 3, TX 281 115 441 40.02
Las Vegas, NV 281 56 559 34.32
Los Angeles, CA 283 11 94 8.45
New York, NY 284 72 1134 161.16
San Francisco, CA 280 23 118 11.01
Seattle, WA 282 131 3571 333.38
Tampa, Florida 260 166 4977 303.29
Washington, DC 282 83 686 111.97
US Total 4207 77 4977 148.63
EdgeCast Small Total 6960 84 4977 131.64
Data Center Non-US Amsterdam 2, Netherlands 292 837 1344 35.72
Copenhagen, Denmark 262 990 4195 297.90
Frankfurt, Germany 291 867 1533 57.14
London 2, UK 291 725 1065 25.95
London, UK 290 811 1114 49.50
Madrid, Spain 262 1005 1765 75.84
Manchester, UK 281 899 8928 580.11
Montreal, Canada 291 342 412 11.52
Paris, France 293 1128 2680 230.78
Stockholm, Sweden 292 1063 4056 367.89
Non-US Total 2845 864 8928 326.71
US Atlanta, Georgia 291 316 1017 63.48
Chicago, IL 290 170 253 7.02
Dallas 4, TX 292 191 3214 253.67
Dallas 5, TX 292 145 263 14.52
Dallas 6, TX 291 147 358 22.93
Denver, CO 291 71 272 14.63
Herndon, VA 293 316 487 15.43
Houston 3, TX 293 177 372 19.66
Las Vegas, NV 290 246 3194 392.02
Los Angeles, CA 291 303 1188 57.60
New York, NY 290 346 1120 123.55
San Francisco, CA 293 229 519 22.28
Seattle, WA 290 489 1078 170.33
Tampa, Florida 270 331 4105 247.99
Washington, DC 290 595 1511 235.84
US Total 4347 271 4105 208.20
Data Center Total 7192 506 8928 390.52

… and to really drive the point home for the PHB’s, we consolidate the data and give a very telling graph:

Third elimination: CacheFly

CacheFly is an up-and-comer in the CDN arena. They have very aggressive pricing, and have very good performance as well. If the site in question was a popular blog or community website and was very price sensitive, I would select CacheFly as my first choice CDN. However, where they fall short is in reporting and their web interface. The best way to contact their support department is via email or web-based form. Their web interface left a huge amount to be desired, and they have very little documentation on how to use it. There is no reporting whatsoever – you get raw log files and have to write our own reporting scripts on top of that data. I couldn’t help but wonder about all the “what ifs”. What if we get an incorrect image cached and need to have it cleared from their network? If we see a DDoS at the CDN, how do we know? These and other similiar questions are what ultimately eliminated CacheFly.

In CacheFly’s defense, I was told that they were working on a complete refactor of the user interface and was offered a chance to help beta it, but I was under time constraints and declined. The issues I had with the UI may or may not be present at the time of this writing.

The winner (for us): EdgeCast

It will appear when reading this post that I used the process of elimination to find the “lesser of all evils”, but understand that’s just the writing style I chose to convey the process. It wasn’t that EdgeCast didn’t lose, it’s that they won. Here’s why:

  • EdgeCast is routinely in the top tier of CDN’s in terms of performance.
  • Their support is very knowledgeable and responsive.
  • The sales reps care about your business and are willing to work with you.
  • They offer the most features of any CDN I evaluated. One such feature is “rollover” where if you don’t use all your allotted transfer for one month, the remainder gets added to next months allotment. This is perfect for a business with holiday traffic spikes such as ours.
  • While they aren’t the cheapest CDN, they are certainly affordable, and offered the best “bang for the buck” for the feature set we needed.
  • Their UI is fully functional, offering configuration, reporting, and analytics in an easy to use fashion. The UI includes a fully functional rules engine (for additional charge) that allows you to apply actions such as cache purge, header change, etc based upon conditions like client IP, HTTP request header, etc.
  • Last but certainly not least, the company is one of only two profitable CDN’s in the market today.

IT’S NOT THE DESTINATION, IT’S THE JOURNEY!!!

Please don’t read this article and walk away saying “Justin recommends EdgeCast, that’s who we’re going with”. For one, if you’re letting my blog posts make business decisions for you instead of doing due diligence, then you’re doing it wrong.

For our very specific needs EdgeCast was the best fit. For your needs, you will very possibly arrive at a completely different decision, and that’s great. By all means, blog about it. What I’m trying to convey is that there are a lot of points of comparison when going through your evaluation, and not all of them are obvious. It’s hard to get an objective point of view when doing this on your own – this is my best attempt at documenting what I came across.

Hopefully if you haven’t implemented a CDN for your busy sites, this post will motivate you to do so. If you’re unhappy with your current CDN, perhaps this post has given you some insight on how to find a replacement. If you’re happy with your current CDN, please leave comments as to why.

Lastly, I was in no way influenced monetarily or otherwise by any vendors, and none of the links in this article contain referral ID’s. This is all my personal opinion and in no way represents the opinion of my employers.

Lead SysAdmin Position Available

There’s a blog post to follow with when/why, etc., but without further ado: I’m moving to a new position at Buckle, and that means we need a new Lead SysAdmin.  It’s a great job at a great company, in a great place to raise a family (Kearney, NE).  You get paid well, get a good yearly budget for new toys, and equipment, and it’s overall a very fun position. If interested, drop me a line, and I’ll make sure your resume gets the proper attention.  To apply online, click here, and search for jobs within 5 miles of zip code 68845.  The job title is “Web Development - Lead Systems Administrator”. Here’s the job posting: JOB DETAIL Job Title:  ** Web Development - Lead Systems Administrator** image Location:  Buckle Corporate Office & Distribution Center  2407 W 24th Street  KEARNEY, Nebraska 68845-0000 image Job Description:  ### Lead Systems Administrator **Position Summary:** The Lead Systems Administrator will be responsible for the deployment and maintenance of Unix/Linux systems and application software in multiple environments. The ideal candidate will possess a deep understanding of large scale Unix deployments and will lead the team responsible for the infrastructure serving all e-commerce and intranet applications. Additionally, this person must be able to function effectively in a fast-paced environment where projects range from maintenance to upgrades to new deployments and technologies. Our Systems Administrators also serve as Network Administrators for the smaller networks their systems reside in, so strong knowledge of ethernet, TCP/IP, and network security is required. **Responsibilities:** • Maintain all servers and workstations on WSD team, including production, development, and staging of servers for the e-commerce platform and company intranet • Setup, maintain, and manage an enterprise-class backup strategy for WSD team servers and workstations • Automate tasks via custom scripting • Assist in architecting and designing solid server solutions **Requirements:** • Expertise in setting up robust and reliable server architectures. Additionally, a large appetite for automating the mundane is preferred and will be encouraged. ● In-depth knowledge of technologies that include but are not limited to: Linux systems administration, Java VM tuning, Weblogic Administration, Apache HTTPD/NGINX Administration, All layers of TCP/IP, subnetting, etc., IPSEC VPN’s, ISC BIND, Load balancing and clustering technologies, Shell scripting, Nagios monitoring and RRD administration, RPM packaging format and patching best practices • A bachelor’s degree in Computer Science or other discipline • Minimum 4-5 years of previous system-administration experience in a professional environment **Compensation:** Market/negotiable, relocation assistance is a possibility for the right candidate.

Case Study in Migrating XML to Drupal Using Migrate

Sorry for the lack of posts as of late – a massive upgrade operation at $DAYJOB has had me out of commission for a few weeks. Also, I’ve had the great fortune to be able to be part of a migration to Drupal which exposed me to migrate and friends. Yes, I said “great fortune” in the same sentence with “migration” without using a negative - that’s just how awesome this module is. My first impression when looking at the documentation for migrate was that it didn’t seem complete. While it’s true that the documentation could be better (what module couldn’t use better documentation?), the problem is that no two migrations are alike. Because of this, the best documentation is not going to be written by the module authors, it will be written by the module users - they are the ones that come up with the recipies to fill the cookbook. There are several good reasons why there aren’t many recipes available:

  • Developers don’t like doing migrations. It can be painful, and often takes quite a bit of time.
  • Users don’t like migrations. They see a migration of data as something easily done, and they often get sticker shock when presented with estimates for a large migration.
  • Migration code is written in a flurry before the site is active. Right before launch, development crescendos, and then is often never used again (because no two migrations are the same). This being my first migration, I vowed that I would document my experience, because I learned so much from it. In this particular migration, we had to migrate a huge XML file into about 2,200 nodes in 3 content types. Read on for my contribution to the cookbook! First, some discussion on the general workflow and some design decisions. Since I had to get XML into the database before I could run the migrate, I wrote a command line script to do just that. When you need to manipulate data between your source and destination (i.e., change all references to www.olddomain.com to www.newdomain.com), you usually have to do this via the hooks that the migrate module provides. In my case, there were a few cases where doing the data munging in the command line script was much easier than doing it within the hooks. The problem with making transformations within the command line script is that with every change, I had to re-run the script. This wasn’t a big deal, as the XML to MySQL script took around 15 seconds to complete. I also quickly discovered that if you have less than 10 entities of one type (Story content type, user, etc), it’s usually better to just hand-migrate them. The most straight-forward migration will take 1 hour at a minimum to setup and test – if it will take less than that to copy/paste, save your time and do it the less sexy way. Since we had to transform the XML into MySQL tables, and there was a lot of data in the XML that we didn’t need, I decided the best way to dynamically change what we import and what we didn’t was by using hook_install() and Drupal’s DB schema API. By naming the MySQL table columns the same as the XML attributes, we can add and remove data to be transformed quite easily. Lastly, I need to re- iterate that this was my first migration. What I describe here works for me, but may very well not be the best way to do it. Also, I will not duplicate what you can learn from the migrate module documentation, so make sure to read that first. Let me know any suggestions you may have in the comments.

Install Module Dependencies

The first step is to install module dependencies. You’ll need Views, Schema, Table Wizard(tw). You’ll also want to install Migrate, and Migrate Extras if you want to do any work with CCK fields. I must admit that I hadn’t seen Table Wizard before this project, but it will always be present in my dev installs from here out. If you find yourself using SQLYog, PHPMyAdmin, or some other tool to simply look at data in the database, be sure to check it out.

Create Our Custom Module

As I mentioned above, we are relying on the Drupal Schema API to make a lot of this easy, so let’s make a custom module that sets up our schemas for us. We’ll call this module my_import. Create a new directory in your modules directory, and name it my_import. First, create my_import.info with this inside:

name = My Import
description = "An import module."
core = 6.x
dependencies[] = migrate
dependencies[] = migrate_extras
dependencies[] = content
dependencies[] = path_redirect
package = Database

Nothing too wild here, just requiring some modules that we’ll be using later. Now, create my_import.install in the same directory with this inside:

<?php
function my_import_schema() {
  $schema = array();
  
  $schema['clickability_articles'] = array(
    'fields' => array(
      'id' => array(
        'type' => 'int',
        'not null' => TRUE,
        'description' => t('Clickability article ID'),
      ),
      'createDate' => array(
        'type' => 'datetime',
        'not null' => TRUE,
        'description' => t('Clickability article creation date.'),
      ),
      'editDate' => array(
        'type' => 'datetime',
        'not null' => TRUE,
        'description' => t('Clickability article edit date'),
      ),
      'title' => array(
        'type' => 'text',
        'not null' => TRUE,
        'description' => t('Clickability article title'),
      ),
      'author' => array(
        'type' => 'text',
        'not null' => FALSE,
        'description' => t('Clickability content author (optional)'),
      ),
      'articleauthor' => array(
        'type' => 'text',
        'not null' => TRUE,
        'description' => t('Clickability article author'),
      ),
      'summary' => array(
        'type' => 'text',
        'not null' => TRUE,
        'description' => t('Clickability article short summary'),
      ),
      'body' => array(
        'type' => 'text',
        'not null' => TRUE,
        'size' => 'big',
        'description' => t('Clickability article body'),
      ),
      'placement' => array(
        'type' => 'text',
        'not null' => FALSE,
        'description' => t('Clickability article related article placement lists'),
      ),
      'thumbnail' => array(
        'type' => 'text',
        'not null' => FALSE,
        'description' => t('Clickability article thumbnail'),
      ),
      'image' => array(
        'type' => 'text',
        // @todo: Some articles do not have an image, but we require Master Image to be set.
        'not null' => FALSE,
        'description' => t('Clickability article image'),
      ),
      'image2' => array(
        'type' => 'text',
        'not null' => FALSE,
        'description' => t('Clickability article image page 2'),
      ),
      'image3' => array(
        'type' => 'text',
        'not null' => FALSE,
        'description' => t('Clickability article image page 3'),
      ),
      'master_image_byline_title' => array(
        'type' => 'text',
        'not null' => FALSE,
        'description' => t('Clickability article image page 7'),
      ),
      'tags' => array(
        'type' => 'text',
        'not null' => FALSE,
        'description' => t('Clickability article image page 7'),
      ),
      'status' => array(
        'type' => 'text',
        'not null' => TRUE,
        'description' => t('Clickability article status'),
      ),
      'websitePlacements' => array(
        'type' => 'text',
        'not null' => TRUE,
        'description' => t('Clickability book review status'),
      ),
      ),
    'primary key' => array('id'),
  );
  return $schema;
}

function my_import_install() {
  $ret = drupal_install_schema('my_import');
  return $ret;
}

function my_import_uninstall() {
  $ret = drupal_uninstall_schema('my_import');
  return $ret;
}

When I created the schema, I took care to make sure that the column names in my table exactly matched the attributes and elements I was looking to pull out of the XML file. This saves a lot of coding later. Any time we change the schema, you can create a hook_update_N() function, or just change the schema and disable+uninstall+install the custom module. I did the latter with a drush alias and it worked well. The hook_install() and hook_uninstall() functions simply add and remove the tables.

Setup the Command Line Script to Import the XML into the DB

Create the file myimport.php in your module directory, and paste in the following:

#!/usr/bin/php
<?php
// get the path to our XML file
$args = getopt("f:");

// Bootstrap Drupal
require_once './includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

// Make sure my_import is enabled
if (!module_exists('my_import')) { 
  echo "I need the my_import module enabled!  Exiting.\n";
  exit(1);
}

/*
 * Make sure our media directory exists.
 * We will import from this directory into whatever directory filefield is configured for
 * so we should remove this dir when done with the migration.
 */
$media_dir = file_directory_path() .'/migrated';
echo "Media dir = $media_dir\n";
if (! is_dir($media_dir)) {
  mkdir($media_dir);
}

// Slurp in our XML file.  If your XML file is huge, watch your PHP memory limits
$xml = simplexml_load_file($args['f']);
echo "XML Loaded.\n";

$rowcount = 0;
// Here we iterate over each child of the root of the XML, which in our case is a Article
foreach ($xml->children() as $content) {
  // Setup our $obj object which represents a row in the DB, and use some caching to 
  // not abuse drupal_get_schema().
  $obj = new stdClass;
  static $schema = array();
  
  // Dereference our child from the parent $xml, or xpath performance sucks hard
  $content = simplexml_load_string($content->asXML());
  
  $table = NULL;
  $content_type = NULL;
  switch((string) $content['type']) { 
    // Add more case statements for more content types as needed
    case 'Article':
      $table = 'clickability_articles';
      $content_type = 'article';
      break;
    // All cases below are silently ignored - we are not importing them
    case 'Book Reviews':
    case 'Blog Topic':
    case 'Event':
    case 'Job': 
      // Ignored
      break;
    default:
      // Any content type not accounted for gets reported
      echo "Warning: unknown content of type ". $content['type'] ."\n";
  }
  if (isset($table)) {
    if (! isset($schema[$table])) {
      // Get the table schema from Drupal
      $schema[$table] = drupal_get_schema($table);
      // On first run, truncate the table
      $sql = "truncate table {$table}";
      db_query($sql);
      echo "$table truncated.\n";
    }
    // This function does the heavy lifting, creating the $obj object from the XML data
    $obj = xml2object($content, $schema[$table], $content['type']); 
    // There are some cases where $obj is intentionally null, only write to the db if not null
    if ($obj) {
      $ret = drupal_write_record($table, $obj);
      if ($ret) {
        $rowcount++;
      }
    }
  }
}
echo "Inserted $rowcount rows.\n";

function xml2object($xml, $tableschema, $content_type) {
  
  global $media_dir;
  $obj = new stdClass;
  // Our main iterator is the column names in the table
  foreach (array_keys($tableschema['fields']) as $field) {
    switch($field) {
      case 'master_image_byline_title':
        // This field is populated when we work with the images later on
        break;
      case 'id':
        $obj->$field = $xml[$field];
        break;
      case 'status':
        $obj->$field = (string)$xml->$field;
        break;
      // A Clickability placement roughly corresponds to a Drupal term
      case 'placement':
        $element =  array_pop($xml->xpath("//field[@name='$field']"));
        $obj->$field = (string)$element->row->value;
        $obj->$field = map_taxonomy($obj->$field, $content_type);
        break;
      case 'author':
        $element =  array_pop($xml->xpath("//field[@name='$field']"));
        $obj->$field = (string)$element->value;
        break;
      case 'image2':
      case 'image3':
        // Combine image2 and image3 elements in Clickability into our multivalue filefield as csv
        if ($content_type == "Article") {
          $mediaplacement = array_pop($xml->xpath("//mediaPlacement[@name='$field']"));
          // migrate module requires full path to filefield source
          $obj->$field = getcwd() .'/'. $media_dir .'/'. (string)$mediaplacement->media->path;
          if (substr($obj->$field, -1, 1) == '/')  {
            $obj->$field = NULL;
          }
          else {
            if (!empty($obj->image)) {
              $obj->image .= ",";
            }
            $obj->image .= $obj->$field;
          }
        }
        break;
      case 'thumbnail':
      case 'image':
        $mediaplacement = array_pop($xml->xpath("//mediaPlacement[@name='$field']"));
        // migrate module requires full path to filefield source
        $obj->$field = getcwd() .'/'. $media_dir .'/'. (string)$mediaplacement->media->path;
        // Check the schema.  If the field is required, then fill in a default, otherwise wipe it
        $required = $tableschema['fields'][$field]['not null'];
        // If the file path ends in a /, then the XML did not have an image for this article
        // -- if we require one, make a default
        if (substr($obj->$field, -1, 1) == '/')  {
          if ($required) {
            echo "$content_type with ID of ". $obj->id ." does not have a $field.  Adding test.gif.\n";
            $obj->$field .= "test.gif";
            touch($obj->$field);
          }
          else {
            // NOTE: We need this patch for this to work: http://drupal.org/node/780920
            $obj->$field = NULL;
          }
        }
        else {
          // Transfer the caption on the image in the XML to the CCK byline accreditation
          $obj->master_image_byline_title = (string)$mediaplacement->caption;
          // See if the file exists on the filesystem
          if (! file_exists($obj->$field)) {
            // Nope, let's fill it in with our default image
            echo $obj->$field ." does not exist, replacing with test.gif.\n";
            $obj->$field = preg_replace('#^(.*)/(.*)$#', '\1/test.gif', $obj->$field);
          }
          
          // Replace .bmp with .jpg
          $jpg = preg_replace('/\.bmp$/', '.jpg', $obj->$field);
          if ($jpg != $obj->$field) {
            if (file_exists($jpg)) {
              $obj->$field = $jpg;
            }
            else {
              // Tell the user what to do to create the image and exit.
              echo "ID ". $obj->id ." has a image of type bmp, and no jpg found on the file system.\n";
              echo "Create them by running 'mogrify -format jpg /path/to/*.bmp' and re-run this script.\n";
              exit(1);
            }
            
          }
        }
        break;
      // Any DB column not explicity defined above maps cleanly with the code below
      default:
        $obj->$field = (string)array_pop($xml->xpath("//field[@name='$field']"));
        break;
    }

  }
  
  // We assume it does not need imported until we prove otherwise
  $needs_imported = FALSE;
  $tags = array();
  $websitePlacements = array();
  foreach ($xml->xpath("//websitePlacement") as $websitePlacement) {
    // Only if the XML says the domain is www.newdomain.com do we need to import
    if ($websitePlacement->domain == 'www.newdomain.com') {
      $needs_imported = TRUE;
      
      // Convert the old "sections" into tag taxonomy
      $tags[] = substr($section, 1, strlen($section));
      
      // Grab the old URLs from websitePlacement, and place them on an array
      $section = (string)$websitePlacement->section;
      $oldurl = $section .'/'. $obj->id .'.html';
      $websitePlacements[] = $oldurl;
      
      // If we do not have a placement yet, we try to set some form of taxonomy
      if (! isset($obj->placement)) {
        $taxo = map_taxonomy($section, $content_type);
        // NOTE: We need this patch for this to work: http://drupal.org/node/780920
        $obj->placement = $taxo;
      }
      // If the XML did not explicity tell us the createDate, we use the start date from the webSitePlacement
      if (empty($obj->createDate)) {
        $date = (string)$websitePlacement->startDate;
        $obj->createDate = substr($date, 0, strlen($date) -4);
        $obj->editDate = $obj->createDate;
      }
    }    
  }
  $obj->websitePlacements = implode(',', $websitePlacements);
  $obj->tags = implode(',', $tags);
  
  // Return the object only if we need it imported
  return $needs_imported ? $obj : NULL;
}

function map_taxonomy($oldtext, $content_type) {
  // Simple maps of Clickability placements to Drupal terms
  if ($content_type == 'Job') {
    return NULL;
  }
  if (preg_match('/building/i', $oldtext)) {
    return "Green Building";
  }
  if (preg_match('/(clean|energy)/i', $oldtext)) {
    return "Clean Energy";
  }
  if (preg_match('/financ/i', $oldtext)) {
    return "Finance";
  }
  if (preg_match('/food/i', $oldtext)) {
    return "Food & Farms";
  }
  if (preg_match('/marketing/i', $oldtext)) {
    return "Green Marketing";
  }
  if (preg_match('/recycled/i', $oldtext)) {
    return "Recycled Markets";
  }
  if (preg_match('/technol/i', $oldtext)) {
    return "Technology";
  }
  if (preg_match('/leaders/i', $oldtext)) {
    return "Business Leaders";
  }
  if (preg_match('/transportation/i', $oldtext)) {
    return "Transportation";
  }
  return NULL;
}

?>

Wow, that’s a lot of code. I’ve commented it pretty heavily, but here’s the “40,000 foot view” of what’s going on:

  • Lines 1-26: Nothing too fancy here. I should note that the script expects to be executed from your Drupal root directory. It grabs the path to the XML file from the command line and does some sanity checking.
  • Lines 28-30: Here we use PHP’s SimpleXML API to load the entire XML file into an object. If you have a huge XML file and/or small PHP memory limits, you will likely have to use XML Parser or another library. The power and convenience of SimpleXML is a pretty convincing argument to temporarily upping your memory limits in this case.
  • Lines 34-82: This is the main loop which iterates over each Article in the XML file. By looking at the content type in the XML record, we determine what table and content type to use for Drupal. The first time a schema is loaded, we truncate the table in the database. Once we determine some metadata about the record, we call xml2object() on line 72 which does most of the work for us. Once we have an object, we store it to the database.
  • Lines 84-222: Here we have the xml2object() function, and yes, it’s way too long and should be broken up. But hey, it’s migration code, who else will ever see it??? We’ll break it down more below.
  • Lines 89-182: This code runs a for loop around each column in the table. Since we’re using the Schema API here, we can safely assume that the column order specified in our install file will be duplicated when we fetch it in our script. For each column type in the table, it attempts to pull the data needed from the XML record, transform it if necessary, and store it in our $obj object. Read the code for details on what is happening to each field on the way in.
  • Lines 184-218: Now that we have iterated over all the fields in the schema, we can use the data stored in $obj to calculate other fields we need. Again, read the code for details, but here we are setting taxonomy terms, URLs for use with path_redirect, and filling in other data that may have been missing from the XML.
  • Lines 224-257: Is a simple example on how to statically map some data in the XML to return taxonomy terms in Drupal Now that we’ve got that out of the way, let’s create our module file.

Create my_import.module

Now, create a file in your module directory named my_import.module. This file will contain the actual module used by Drupal and will implement some of the migrate modules hooks. You might ask, why not deal with everything in the command line script? There are two primary reasons why:

  1. You may come across a condition where you need the nid of the node (i.e. create path redirects), or otherwise interact with the $node object. You can only get this information from implementing migrate’s hooks.
  2. While I personally found it easier to manipulate taxonomy terms via the command line script and then rely upon the out-of-the-box code supplied with migrate to setup the node for me, this has a drawback. Any time you change the command line script, you must “clear” your imported data, re-run the command line script, and re-import your data using the migrate module. If you make changes to your module, you only have two steps to test (clear and re-import). Paste this code into my_import.module:
<?php
define(NUM_PARAGRAPHS_PER_PAGE, 6);

function my_import_migrate_prepare_user(&$user, $tblinfo, $row) {
  // Randomly assign passwords to users, forcing them to reset their password
  $errors = array();
  $user['pass'] = preg_replace("/([0-9])/e","chr((\\1+112))",rand(100000,999999));
  return $errors;
}

function my_import_migrate_prepare_node(&$node, $tblinfo, $row) {
  $errors = array();
  // In Clickability, there were multiple states that represented "Published", here we map them.
  $status = $tblinfo->view_name .'_status';
  switch($row->$status) {
    case 'live':
    case 'APPROVED':
      $node->status = 1;
      break;
    default:
      $node->status = 0;
      break;
  } 
  
  if ($node->type == 'article') {
    // Paginate articles by inserting a pagebreak tag every 6th paragraph to emulate Clickability's pagination
    $paragraphs = preg_split('#<br />\s*<br />#s', $node->body);
    if (count($paragraphs) > NUM_PARAGRAPHS_PER_PAGE) {
      $node->body = '';
      $i = 1;
      foreach ($paragraphs as $paragraph) {
        if (($i % NUM_PARAGRAPHS_PER_PAGE) == 0) {
          $node->body .= $paragraph . "\n[pagebreak]\n";
        }
        else {
          $node->body .= $paragraph ."<br />\n<br />\n";
        }
        $i++;
      }
    }
    
  }
  return $errors;
}

function my_import_migrate_complete_node(&$node, $tblinfo, $row) {
  $errors = array();
  // Create redirects for old URLs
  $field = $tblinfo->view_name .'_websitePlacements';
  foreach(explode(',', $row->$field) as $oldurl) {
    // Delete any old redirects
    if (substr($oldurl,0,1) == '/') {
      $oldurl = substr($oldurl,1);
    }
    path_redirect_delete(array('source' => $oldurl));
    $redirect = array(
      'source' => $oldurl,
      'redirect' => '/node/'. $node->nid,
      'type' => 301,
    );
    path_redirect_save($redirect);
  }
  return $errors;
}

Here’s the high-level breakdown, check the code+comments for the details.

  • Lines 5-10: Just a quick example of how to set a random password on any user that is imported.
  • Lines 12-24: hook_migrate_prepare_node() is executed before the node has been saved to the database, and should be where the majority of your code is at. These lines set any article with a status of ‘live’ or ‘APPROVED’ to published in Drupal.
  • Lines 26-41: This code uses some regex magic to create a pagebreak every 6th paragraph. This is what Clickability did, and the client wanted to keep this on their migrated articles.
  • Lines 47-65: hook_migrate_complete_node() is called after the node has been saved to the database, and it has a nid at this point. The client wished to migrate their old URL’s to Drupal – in order to do that we must have the nid to know where to redirect to.

Create sample XML

Finally, let’s create some sample data so we can see how this all meshes together. Create the file content.xml in your module directory, and paste this in it:

<?xml version="1.0" encoding="utf-8"?>
<cmPublishImport>
  <content type="Article" id="7241321">
    <field name="title"><![CDATA[Donec risus purus]]></field>
    <field name="author">
      <value><![CDATA[Me]]></value>
    </field>
    <field name="articleauthor"><![CDATA[Me]]></field>
    <field name="date"><![CDATA[2007-04-29]]></field>
    <field name="summary"><![CDATA[Donec risus purus]]></field>
    <field name="body"><![CDATA[Donec risus purus, euismod eu volutpat ac, pharetra non nulla. Vestibulum quis neque lacus. Donec sit amet tortor nisi. Nam et lectus nec turpis consequat rhoncus. Proin porttitor, quam nec faucibus pulvinar, arcu magna facilisis erat, eu imperdiet risus tortor ac quam. Praesent non justo ac nisl ultricies condimentum a eget arcu. Nam in mi est. Donec risus orci, imperdiet ut tempus et, pulvinar nec diam. Donec eleifend pulvinar aliquam. Nulla faucibus turpis nec neque scelerisque convallis. Fusce gravida pulvinar quam, sit amet faucibus risus sodales ornare. Nullam arcu risus, lacinia vel faucibus at, auctor eget diam. Quisque a neque ac tellus bibendum luctus fringilla in lacus. Praesent id nunc eu dolor adipiscing consequat vel eget leo. Donec velit mi, pharetra quis tincidunt id, laoreet et dolor. Vestibulum fringilla rutrum arcu at accumsan.<br/>
<br/>  
Cras pellentesque sagittis mi. Pellentesque cursus nisl id nunc suscipit luctus. Duis pellentesque rhoncus sodales. Nullam dictum augue ac diam fermentum vel feugiat mauris euismod. Mauris nec metus eu sem tristique euismod. Etiam lorem est, accumsan vitae bibendum sit amet, tempus sit amet urna. Nullam lobortis adipiscing convallis. Nullam scelerisque sagittis tellus vitae interdum. Integer eget interdum nunc. Nam ligula orci, bibendum ac mattis eget, mollis at massa.<br/>
<br/>  
Vestibulum sodales elit vel est feugiat vitae dapibus erat ultricies. Proin auctor quam sit amet nisi aliquet pharetra. Curabitur tristique quam vel tortor gravida scelerisque. Morbi laoreet aliquet mi, sed imperdiet mauris mattis et. Praesent non quam nec lorem dapibus semper. Quisque vulputate neque et turpis placerat bibendum. Phasellus suscipit urna eget augue ullamcorper ultricies. Curabitur hendrerit dui sit amet elit elementum nec venenatis orci tempor. Fusce semper vestibulum odio vitae porta. Mauris non tellus non mi hendrerit suscipit in sed ante. Donec arcu neque, tristique ut elementum sed, suscipit at leo. Curabitur eget enim quis leo scelerisque laoreet et eget augue. Fusce posuere est ac felis fringilla consectetur. Nulla elit magna, pharetra sit amet tincidunt sed, tristique sed mi. Nam iaculis, elit sit amet condimentum blandit, massa neque pharetra justo, non ornare ligula ante non leo. Praesent ullamcorper suscipit tempus. In varius, neque eget volutpat posuere, velit odio luctus turpis, ac varius nulla erat sit amet justo. Quisque convallis mollis pharetra. Aliquam porta dolor quis nunc tempor vitae pharetra lectus ultricies. Fusce egestas sagittis sapien, sit amet pharetra sem ullamcorper a.<br/>
<br/>  
Ut dui tortor, porta eu ultrices sed, interdum vitae lectus. Integer facilisis velit sit amet dui ultricies lobortis. Fusce ut malesuada tellus. Aenean in nibh at lorem iaculis dictum vitae in nulla. Etiam dapibus lacinia eleifend. Aliquam erat volutpat. Nullam sit amet sapien ut risus consequat posuere eu quis quam. In lobortis fringilla felis quis pretium. Suspendisse non nisl libero, non tempor justo. Nunc volutpat nulla vitae lacus tincidunt feugiat congue sapien commodo. Suspendisse venenatis aliquet ante in hendrerit. Sed lectus ligula, gravida id tincidunt et, feugiat non justo.<br/>
<br/>  
Sed metus tellus, vestibulum in mollis quis, imperdiet et velit. Praesent suscipit elit et mi rutrum sit amet gravida augue iaculis. Etiam nec tellus nec augue porttitor pharetra. Vivamus feugiat mollis est, eu aliquam neque tempus a. Ut magna mauris, sollicitudin in ornare non, lacinia a lacus. Aenean porttitor magna ac sem ornare pellentesque. Aliquam mattis dolor in metus molestie ut feugiat mi auctor. Etiam laoreet pulvinar ipsum id bibendum. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Quisque porttitor convallis lacus, nec pretium leo varius non. Morbi non dapibus diam. Sed nec venenatis diam. Cras mollis porta tempor.<br/>
<br/>  
Donec ornare mi sed tellus porta luctus. Nulla euismod venenatis ante, in rhoncus felis ornare non. Cras tempor venenatis est at gravida. Etiam imperdiet dolor vitae ipsum lacinia imperdiet. Maecenas purus lorem, rhoncus non porttitor in, semper nec quam. Integer ullamcorper facilisis ultrices. Vivamus porttitor lacinia augue in venenatis. Quisque interdum euismod tellus, et consectetur nunc dictum sit amet. Maecenas pulvinar placerat mauris, quis auctor purus pellentesque at. Vestibulum vulputate, tellus id eleifend posuere, ligula erat hendrerit orci, nec lobortis tortor sapien ut ligula. Donec id augue leo, non consectetur nisl. In viverra dictum lorem eget blandit. Etiam tempus nisl ac nibh viverra id cursus eros luctus. Duis ut tellus nisi.<br/>
<br/>  
Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Aliquam quis justo risus, eget eleifend nibh. Morbi quis dolor nulla, sed cursus metus. Vestibulum vel ipsum non erat tincidunt luctus et eget sapien. Nunc vel justo ante, vel auctor purus. Proin vulputate bibendum placerat. Fusce vel tincidunt nunc. Praesent at eros in dolor faucibus blandit et vitae magna. Fusce arcu nisl, sollicitudin sed accumsan sed, rhoncus at tellus. Ut ut mauris vel ipsum bibendum ullamcorper eget sed neque. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Sed dui massa, imperdiet sit amet lacinia id, rutrum sed orci. Proin pharetra risus eu risus gravida convallis id ac mi. Etiam a neque ut lacus convallis accumsan non eget arcu. Sed blandit velit id lectus tincidunt ut aliquet mi egestas. Aliquam cursus odio vitae turpis suscipit mollis et aliquam purus. Suspendisse pretium tincidunt porttitor. Nunc vestibulum, lacus at auctor laoreet, orci lectus volutpat diam, ut mollis risus lectus a ligula.<br/>
<br/>  
Phasellus id urna sit amet elit pretium viverra sit amet eget felis. Maecenas sed arcu sed eros fringilla commodo id non magna. Maecenas urna mauris, cursus vel mattis et, volutpat in purus. Nulla sapien orci, faucibus sed tincidunt tincidunt, tristique id lacus. Nullam tortor libero, porttitor eget faucibus eget, vehicula a enim. Suspendisse malesuada consectetur mattis. Integer dapibus dignissim tempor. In viverra luctus orci sed placerat. Suspendisse aliquam mattis diam mattis dapibus. Aenean suscipit purus eu ipsum dignissim in aliquet urna mollis. Duis nibh magna, fringilla eu ultrices posuere, sodales sed felis. Proin varius dignissim sem a consequat. Pellentesque facilisis felis vel mi malesuada placerat. Curabitur gravida euismod mi in molestie. Vivamus sit amet dui leo. Praesent mi justo, bibendum at rutrum ac, bibendum ut felis. Nullam nec dolor dui, quis imperdiet nulla. Morbi semper pulvinar risus.]]></field>
    <mediaPlacement name="image">
      <media id="577286">
        <path>images/1.jpg</path>
        <caption><![CDATA[Cows at Three Mile Canyon provide resources such as methane and compost for on-farm operations.]]></caption>
      </media>
    </mediaPlacement>
    <status>live</status>
    <websitePlacement>
      <domain>www.newdomain.com</domain>
      <section>/foodandfarms</section>
      <startDate dateFormat="yyyy-MM-dd HH:mm:ss zzz">2007-04-29 14:00:00 PDT</startDate>
    </websitePlacement>
    <websitePlacement>
      <domain>www.olddomain.com</domain>
      <section>/greenmarketing</section>
      <startDate dateFormat="yyyy-MM-dd HH:mm:ss zzz">2007-04-29 14:00:00 PDT</startDate>
    </websitePlacement>
  </content>
</cmPublishImport>

Enable the my_import Module and Run the Command Line Script

Now (finally), it’s time for some action. Enable your newly created my_import module, and jump out to the shell. Assuming your Drupal root is at /var/www/drupal, cd into that directory. Create the new directory sites/default/files/migrated/images, and place a jpg named 1.jpg in that directory. Now run the import script:

php5 ./sites/all/modules/my_import/myimport.php  -f ./sites/all/modules/my_import/content.xml

With luck, the script will succeed, and you will have 1 row of data in your clickability_articles table! If not, fix the error (if you’re using the sample data, let me know what went wrong and I’ll fix it). Next up, Table Wizard configuration.

Expose the Table to Table Wizard

All the hard work is done now - we can use a web UI from here on out. Visit /admin/content/tw in your browser, and under the “Add Existing Tables” fieldset, and select the tables you imported with myimport.php. If your tables are huge (50K+ rows), you may want to select “Skip full analysis”. Click the “Add tables” button. At this point, that’s all we need from Table Wizard, but I strongly encourage you click around a bit. The table analysis can tell you some handy things about your data.

Create the Migrate Content Set

In the previous step, we essentially built a view that we can provide to the Migrate module. Now we need to tell Migrate how to use the view. Visit the Migrate settings at /admin/content/migrate/settings. If you can, implement the changes it recommends to .htaccess as it will speed up the import considerably. Also, make sure to expand the “Migration support implemented in the XYZ module” fieldsets and enable the support you need for your import. Now, visit the dashboard at /admin/content/migrate. Expand the “Add a content set” fieldset, and fill in the values. When choosing the value for “Source view from which to import content”, scroll down towards the bottom of the list. All Table Wizard views are prefixed with “tw: ”, so the one we’re looking for here is “tw: clickability_issues (clickability_issues)”. You can leave “View arguments” and “Weight” to defaults. The next screen is where the real magic happens. By interrogating the view, Migrate presents you with a map fields form that allows us to select our source column from a dropdown to assign to various node elements. If you have a setting that should remain constant across all imported records (“Node: Input format” is usually a good example), you can type in a default value here. The rest should be fairly self explanatory. Click “Submit changes”, and you’ll be taken back to the dashboard.

Run the Import, Clear the Import, Wash, Rinse, Repeat

Now, the way I did my testing was to choose one row from the source table to import. Grab its primary key and copy it to the clipboard. Check the box under “Import” for our content set, then expand the “Execute” fieldset. Paste the ID into the “Source IDs:” text field, and click the Run button. With any luck, you will be returned to the dashboard, but the content set will show 1 imported. Hopefully there will be no errors, but if there are, find and fix the problem. You can view the old primary key mapping to the node ID by going back to /admin/content/tw and looking for a view named migrate_map_si_articles. This table is created by the Migrate module – it uses this table to track what has been imported, and what NID the imported nodes have. Grab that nid, and load up /node/[nid]. If it looks good, then we can to a bigger import. Go back to the Migrate dashboard, and this time click the “Clear” checkbox next to the content set. Expand “Execute”, make sure all fields are blank, and click the Run button. This process will “unimport” the row we just imported. Now, depending on your row count, you may want to import all rows and see what happens. Since I was dealing with thousands of nodes, I did an import of just 100 nodes to make sure things were okay. To do this, instead of specifying “Source IDs”, place the number 100 in “Sample Size”, and click Run. To import everything, leave all fields blank. The power to quickly and easily remove all changes made by the migrate module is huge. Because of this “safety net”, it lets you work on the import within the same development sandbox as your designers and themers. They’ll appreciate having something other than “Lorem Ipsum” to look at!

Run to the Nearest Pub and Celebrate the Completion of Your Migration

If I have to explain this to you, you’re in the wrong field of work!

Summary

This post is my longest to date, and there’s a good chance I missed some things. By all means, let me know in the comments if you find any holes and I’ll get it corrected. I hope this case study helps some other Drupalers out there - when I first started this project I couldn’t find any examples on how to get XML into Drupal using the Migrate module. Now Google has some spider food :)

My Thoughts and Ramblings on DrupalConSF 2010

I had the great pleasure of attending my first DrupalCon this week. Held in downtown San Francisco at the Moscone Center, it was my opinion that this was Drupal’s “homecoming”. While Drupal wasn’t “born” in San Francisco, it seems to be the city that has the strongest following. The attendance numbers didn’t lie - I’m pretty sure they broke 3,000 geeks attendees. I made this trip solo – I only knew three people that were going, and those three were only acquaintances I’d met via email/IM a few months before. When I left, I didn’t come home with “leads” or “contacts”, I came home with friends and role models, many of whom I plan on staying in touch with. I met most of the authors of the Drupal books I’ve read, associated faces to the podcasts and RSS feeds I subscribe to, and I even had the opportunity to quickly say thanks to Dries and shake his hand.

For those who didn’t know, archive.org has made the sessions available for download, so be sure to check those out. Read on for my “takeaways” from DCSF2010.

Please note that these are just what come to my mind, I’m sure I’m forgetting huge topics. Please forgive me in advance for those!

Larry Garfield is my favorite presenter of the conference

Larry Garfield works for Palantir.net, and is one of the few people that I’ve listened to that is immensely intelligent, yet speak well and even make a crowd genuinely laugh out loud. I attended his “Objectifying PHP” and “Views for Developers” sessions, and left feeling motivated and enlightened. My thanks go out to him, as he very obviously put a lot of preparation time into his presentations.

Drupal is methodically (pun intended) implementing OO

As evidenced by Larry Garfield’s “Objectfying PHP” and John VanDyk’s “Batch vs Queue” session, Drupal is refactoring portions of core into classes and methods where it fits. I’m part of the camp that welcomes the change, and can’t wait. I can’t help but wonder if we’ll alienate some contrib module authors in the process, but I’m sure that it will bring the overall quality of contrib modules up a few notches.

David Strauss knows what he’s talking about

I’ve been in IT/Networking/Programming/etc for about 20 years now. While I don’t claim to be the smartest person in the group at any point in time, I consider myself pretty well rounded. It’s been a long time since someone was able to truly talk so far over my head that I couldn’t keep up, but David Strauss of Four Kitchens did just that at the Chapter Three open house party. We discussed HipHop PHP, operating systems, configuration management, and god knows what else. I had to look like a deer in headlights!

HipHop PHP will eventually run Drupal

I can say this because David Strauss is the one working on it. Enough said.

Microsoft is playing it smart

Instead of trying to compete with Drupal, they’re finally trying to help Drupal. I’m a hardcore anything-but-Microsoft OS kinda guy, but I can’t dispute that there’s a lot of shops out there that already have well versed SQL Server and IIS admins. Microsoft announced that they now have a native SQL Server driver for PHP, and that Drupal can now run on MS SQL Server. This will be a huge boon for getting Drupal into the Microsoft-centric enterprises - there’s no longer a need to have a MySQL guy. Oh, and giving away free alcohol never hurt either :)

MongoDB will have a large impact on Drupal 7

Chx gave an excellent presentation - “MongoDB: Humongous Drupal”. He covered a lot about SQL, and how over the years it’s become “best practice” to de-normalize tables to improve performance. We’ve all done that, but have you ever pondered that you’re breaking one of the fundamental rules of relational databases when you’ve done that? While MongoDB is perfectly suited for logging and caching in Drupal, the biggest win is with Fields in Drupal 7. Each field you create results in a new table that must be added to a JOIN when building a node. Shops with a lot of fields on their nodes will likely see huge gains in performance by moving to MongoDB for those tables.

Big Drupal is Big

Hey, did you hear that Drupal powers whitehouse.gov? Seriously, there’s been a lot of progress in the past year with regards to making Drupal scale. Project Mercury from the great folks at Chapter Three makes Big Drupal easy, and is now supported on Amazon’s EC2, Rackspace, and Linode thanks to my stackscript. There was a huge amount of interest in Mercury and how it all works at the conference. The BOF session was great - unfortunately I missed the sessions where it was discussed in more detail.

Chapter Three rocks

Two out of the three people I knew coming into DCSF currently work for Chapter Three, and the third person used to work for them. Special thanks to Greg Coit and Kevin Montgomery for taking me under their wing and introducing me to all their colleagues. I also had the pleasure to meet Josh Koenig, albeit briefly. Seems the partner/CTO of one of the leading Drupal shops is a little busy at a DrupalCon. I ended up meeting a few other guys that I clicked really well with and hope to keep tabs on: Jeff Graham of FunnyMonkey, Rob Wohleb of Xomba.com, and Aaron Levy of Chapter Three - thanks for the beer and discussion!

Git will change Drupal.org

The migration to Git can’t happen fast enough for me. Aside from the ability to commit code on a plane, contrib modules will benefit greatly. When all is said and done, every new issue on drupal.org will have it’s own repository that any user will be able to commit to. Once the issue is resolved, the fix will be merged back into the main module repo. That should break down even more barriers for new contrib authors getting into Drupal development.

Dries Buytaert is really tall

Yes, Dries is very tall - at least 6'4" if I had to guess, but this is actually just a way for me to remind you that I shook Dries' hand :) I was more than a little starstruck!

Overall, I had a blast, and can’t wait for the next DrupalCon in the states. I heard it’s in Chicago – count me in! If you ever get the chance, I absolutely recommend that you attend.