Most of pieces to this puzzle are very well documented elsewhere, I’ll provide links where necessary.
The first step is to go read “Git Workflow and Puppet Environments” written by Adrien Thebo of Puppet Labs. Once you’ve implemented that setup, you should be able to do the following from your workstation:
git clone git@git:puppet.git
cd puppet
git checkout -b mybrokenbranch
echo "this line breaks everything" >> manifests/site.pp
git commit -am 'Intentionally breaking things'
git push origin mybrokenbranch
At this point, you now have a new environment named ‘mybrokenbranch’ on your Puppetmaster. You can test the setup by ssh'ing into the client machines and run:
puppet agent --test --environment mybrokenbranch --noop
That obviously won’t be a happy puppet run. The key point here being that your other environments are not impacted by the work of this one admin. Let’s delete the local and remote branch. From your workstation:
git checkout master
git branch -d mybrokenbranch
git push origin :mybrokenbranch
Note that the Puppetmaster says that it’s deleted the environment. Feel free to verify that by running the above command on the Puppet client, it will complain about not having an environment.
With all that setup, let’s go ahead and implement support for git submodules. I have a pull request off to Adrien to implement this functionality, but until he commits it in, you can use my fork on github. Replace the update hook with the updated version on your git server. Now, let’s try pulling a git submodule into our repo. Again, from your workstation:
git checkout -b firewall
git submodule add git://github.com/puppetlabs/puppetlabs-firewall.git modules/firewall
git add .gitmodules modules/firewall
git commit -m 'Adding firewall submodule'
git push origin firewall
Note in the output that the Puppetmaster is checking out the git submodule into the new environment. Go ahead and log into the Puppetmaster, and look in your firewall environment, you should see all the manifests and whatnot there.
Here’s where I need to stamp a disclosure notice – git submodules aren’t all milk and honey. There’s some funky situations you can get yourself into if you’re not careful. Thankfully, there’s not many of those situations you can’t get yourself out of. I highly recommend reading the Pro Git chapter on submodules before doing anything with them.
This next step is entirely optional, but works out well for us. We have a setup where I’m the only admin that can write to the master and testing branches of our git repo, but any sysadmin can create their own branch, test it, and delete it if need be. Setting up gitolite is far beyond the scope of this post, but if you have about an hour of free time, you can have it setup and running. However, below I’ve pasted the relevant snippet from gitolite.conf that enforces those permissions.
repo puppet
RW+ = JustinEllison
R = @SysAdmins Fisheye-puppet PuppetMaster
- master testing = @SysAdmins
RW+ = @SysAdmins
To summarize it all, here’s the workflow for an admin to add a new feature in our Puppet setup:
git diff ...origin/newfeature
We go over any changes, and I merge it in. 1. From there, we follow our normal deployment method of tagging a release, and manually checking out the tag on the Puppetmaster.
While it’s certainly not perfect, this workflow setup has allowed us to work together as a team while still implementing some best practices. In particular, the dynamic environments allow us to test our features extensively before releasing them into production. This is especially important in a team where the admins aren’t Ruby programmers that can write puppet-rspec tests.
Before the integration of git submodules with the dynamic environment workflow, we were manually merging external repos into our own setup, and it was an absolute nightmare. Now, to update our repo to use a new version of someone else’s module, we just create a new feature branch, update the submodule, test, and merge.
What workflows do you and your team use that make life with Puppet better? Please share below.
]]>What differentiates Node Gallery from most other gallery modules is that each and every image in a gallery isn’t just a field, it’s an entire node. This opens up huge possibilities for interactions with other contrib modules. The original reason for me selecting Node Gallery was because it was the only way I could sell individual images using Ubercart. Who I’m looking for:
This module is likely a bit complex for someone who’s never maintained a module before. If you’ve maintained your own Drupal module (either privately or on d.o), take a look at the code and make sure you can understand what’s going on.
curl http://ftp.drupal.org/files/projects/drupal-6.22.tar.gz | tar -xzvf -; cd drupal-6.22
git init
Here’s what makes all this proof-of-concept only. Many of the features used in Drupal core’s .htaccess file assume that the webhost has enabled the “AllowOverride All” option. Heroku doesn’t allow this, it only allows a small subset of overrides. DOING THIS WILL MORE THAN LIKELY COMPROMISE THE SECURITY OF YOUR DRUPAL INSTALL. Open up .htaccess in your editor, and comment out any line that starts with these strings:
Add Drupal to git, and commit:
git add .; git commit -m 'initial commit'
heroku create --stack cedar
git push heroku master
heroku addons:add shared-database
heroku config
At this point, you can poke around your install, and start seeing what all else is broken :) ‘heroku logs -t’ is your friend. If you don’t believe me, here’s a D7 instance, and here’s a D6 one.
Seriously, the .htaccess point is a deal-breaker. Unless someone with more time on their hands than I do can suggest a more secure configuration (or heroku allows Drupal to override all), there’s some serious security ramifications to commenting out the lines in .htaccess.
Drupal is definitely slow on the free plan for Heroku, but I mean, it’s free; what did you expect? Drupal 6 seemed to work throughout, but I noticed when getting D7 up and running that I couldn’t hit some “heavy” URL’s like /admin/configure and /admin/reports/status. I could get into other sub-menus such as /admin/configure/development/performance. We all know D7 takes a fair amount of horsepower to run, and horses aren’t free :). The whole point of heroku is being able to scale your app by dragging a slider in a web ui, and there’s no reason to believe that Drupal wouldn’t start running much faster given more resources from a non-free plan.
The point of this blog post was to just jot down my notes and save someone else a little time in getting started – hopefully the community can come up with some ideas so we have another awesome choice in Drupal hosting!
]]>To illustrate the importance of a CDN using real numbers, one image being fetched from the data center to our office takes about 323ms. That same image fetched from Seattle is 483ms, and from Washington DC takes 599ms. The worst cases appear when coming overseas - to fetch the same image from Paris it takes on average 1,141ms for just that one image.
A Content Delivery Network (CDN) shortens that distance between your static content and the end-user. While the text on most web pages is dynamic, most images, JavaScript, and CSS are static. These static objects make up a large percentage of the total bytes downloaded for each page view. By using a CDN, you place static content as close to the end-user as possible. In turn this decreases the page load time a end-user experiences by leaps and bounds.
There’s a plethora of CDN’s to choose from, and if you don’t filter the initial list down to five or fewer providers, you’ll end up spending months in evaluation time. By defining specific must-have features, we were able to limit the initial number of companies to compare to four. Many CDN’s provide value-add services above and beyone static objects, such as “Dynamic Site Acceleration” – this evaluation looked solely at serving up static file content, e.g. JPEG, GIF, CSS, and Javascript.
The filtering properties we used to limit scope were:
The CDN must provide “origin pull” or “reverse proxy” support. If the CDN receives a request for a file that doesn’t exist at the edge, it applies a customer-defined URL rewrite to the request, and proxies the request to the origin site. If the image exists at the origin, the edge server caches the image locally and serves it to the client from there. For example, the CDN host name might be cdn.example.com (which points to the edge), and the origin site (my server) would be www.example.com. If I point my browser to http://cdn.example.com/logo.gif, and that file doesn’t exist at the edge, the CDN will make a request for http://www.example.com/logo.gif. If that file exists, it is fetched and cached. If it doesn’t exist, a 404 is returned to the client. The trade off is that you don’t have to pre-seed static content to the CDN, but the first user request for a static object takes a bit longer to complete (because it results in two requests instead of one). Once the edge network’s cache is primed, there is no performance difference between origin pull and CDN origin.
The CDN must propagate cache-related HTTP headers from the origin to the end-user We’ve went to great lengths to use versioning of filenames so that we can set far-future expires headers on 99% of our static content as recommended by Yahoo’s “Best Practices for Speeding Up Your Website”. This results in far fewer HTTP requests to render a page that has already been requested by the end-user previously, ultimately decreasing page response time. Some CDN’s that offer origin pull do not proxy these headers back.
The CDN must use GZip compression on text-based content Most CDN’s support this, but it’s something you definitely want to check. When serving up static text-based content such as CSS or Javascript, the CDN can and should compress it for you before sending it to the end-user. Compression makes the overall page content smaller, and therefore faster to render.
Response time must be consistent and fast Performance is a tricky thing. While having the fastest response time overall didn’t guarantee that a CDN would “win”, having consistent relative poor performance would guarantee a CDN would “lose”. Try not to focus too much on performance numbers – most of the CDN’s will have a standard deviation less than ten milliseconds between each other. In our research we found out quickly that there’s a lot of features more important to us than 5ms worth of response time.
100% Uptime SLA Since a CDN is at it’s most basic level a geographically distributed cluster of cache servers, it should be implied that a CDN can provide 100% uptime. If one POP goes down, requests should be automatically routed to the next nearest POP. If your CDN doesn’t offer this, you need a new CDN.
Company financial strength and solvency This is something often overlooked when people evaluate, but was very important to us. There are a lot of CDN’s out there, and we found only 2 or 3 that could put in writing that they are a profitable corporation. Our implementation required a fair amount of work, and it would take us some time to switch to another CDN. If your CDN goes dark in the middle of the night, how long will it take you to switch?
Whereas not meeting any of the above requirements would result in being excluded from our comparison, the following features were key points of consideration. Not meeting them all wouldn’t exclude a CDN, but on the flip side, implementing them all would put the CDN in very good standing.
Price. While high prices weren’t going to scare us away, bang for the buck played a large part in our decision. We weren’t interested in paying a premium for brand recognition.
Strong international presence. Our guests include international clients, and poor static object performance for those clients was the key motivation for implementing a CDN in the first place.
Contract terms. Some CDN’s do month-to-month, some do 12 month, others require longer as you negotiate price.
Overage fees. CDN’s meter you on the amount of bandwidth you consume. You pay for a “bucket”. No CDN’s turn your service off after you exceed that bucket, they just bill you for overages. The good CDN’s will bill you at the same per-GB rate that you pay for your monthly bucket. Some CDN’s charge as much as 2x for overages. Avoid those.
Traffic accounting. One other thing often overlooked with origin pull CDN’s is whether or not the traffic between the edge POP’s and the customer origin is counted as traffic against your total. Some CDN’s count it against your bucket, others don’t.
Setup fees. CDN’s vary wildly on their setup fees. Some are free, some charge more than $5,000. Make sure you incorporate that cost into your decision.
User Interface. All CDN’s offer some form of web-based interface. The quality of the interface greatly differs between CDN’s. I could swear that some of the interfaces I saw were written in CGI Perl in the late 90’s. Others interfaces offered everything a customer could ever want, including detailed analytics and reporting. Key questions to ask are “If I get a bad image out on the edge, how do I purge it?”, and “How do I tell how much bandwidth is being consumed throughout the CDN at any particular point in time?”
We chose to invest in one month’s worth of reporting from CloudHarmony’s CloudReports service. This gave us a quick way to examine performance of CDN’s to the actual end-user browser behind a real cable/dsl/dialup connection (not to a datacenter somewhere). While some might view those reports expensive, we found it quite a bargain to have another independent view into the performance of a vast majority of CDN’s.
Given the above requirements, coupled with the performance data provided from CloudHarmony we were able to refine our list of CDN’s to consider. In alphabetical order:
Akamai is to the CDN market what Bose is to the home audio market. While it’s not inherently a bad product, you’re paying a huge premium for the brand name. While we never got so far as to setup a demo account, the performance data provided by CloudHarmony and other sources didn’t favor them well at all. My personal opinion (which is little more than a wild guess) on why Akamai doesn’t perform as well is because of their product’s age. Their network is by far the largest one out there, and I can guess that keeping up with the latest optimizations and protocols is a huge undertaking.
When speaking with Akamai, I got the impression that they really don’t care to sell their static object delivery product by itself. Their reps focused mostly on trying to upsell their Dynamic Site Acceleration product. While DSA might indeed be a great product, it wasn’t what we were interested in.
In the end, the best price I could get out of Akamai was more than twice that of the next most expensive CDN in our comparison, and they wanted a 3 year contract at that price. I’m just not that into paying twice as much for an equal product, so they were eliminated. If we should move to a Dynamic Site Acceleration type of service later, Akamai will definitely be re-evaluated at that time.
LimeLight Networks is the 2nd largest CDN provider, behind Akamai. It’s fitting that they are right behind Akamai, because they came across like a smaller Akamai to me. Their pricing is much more competitive than Akamai, and performance appeared to be quite good across the board. They supposedly have a nice web and reporting interface, but I was unable to get a demo setup without filling out paperwork that would have required approval from our legal department. Therein lies the problem with LimeLight – getting them to do anything outside the everyday norm was like pulling teeth. Like Akamai, LimeLight also is focused on the upsell and seemed to me generally disinterested in selling their static delivery service.
If for some reason, we had to switch away from our primary choice, my second choice would likely be LimeLight Networks, but only after I was able to obtain a demo account so that I could verify their performance was within acceptable range and the functionality of their user interface.
I was able to easily procure demo accounts from EdgeCast and CacheFly, so I set up some performance testing of our own using Pingdom to download a typical JPEG image from each Pingdom POP using the origin pull setup. Note that since Pingdom’s servers are in data centers and not in actual residences; this isn’t a measure of end-to-end performance, rather a way to compare apples to apples response time from various regions around the world. The executive summary here is that while EdgeCast “edged” out CacheFly, the real message is that any CDN is so much better than none at all:
CDN | US/Non-US | Location | # of Polls | Avg Response Time | Max Response Time | StdDev |
---|---|---|---|---|---|---|
CacheFly | Non-US | Amsterdam 2, Netherlands | 289 | 68 | 4202 | 285.98 |
Copenhagen, Denmark | 259 | 158 | 461 | 36.02 | ||
Frankfurt, Germany | 287 | 41 | 567 | 32.38 | ||
London 2, UK | 290 | 29 | 2489 | 145.26 | ||
London, UK | 284 | 29 | 127 | 11.30 | ||
Madrid, Spain | 259 | 201 | 586 | 31.36 | ||
Manchester, UK | 281 | 129 | 1709 | 184.87 | ||
Montreal, Canada | 286 | 105 | 3084 | 250.63 | ||
Paris, France | 286 | 143 | 521 | 60.11 | ||
Stockholm, Sweden | 286 | 54 | 882 | 80.88 | ||
Non-US Total | 2807 | 94 | 4202 | 157.88 | ||
US | Atlanta, Georgia | 289 | 16 | 398 | 23.52 | |
Chicago, IL | 288 | 56 | 2615 | 158.33 | ||
Dallas 4, TX | 286 | 40 | 960 | 74.61 | ||
Dallas 5, TX | 289 | 26 | 1506 | 89.08 | ||
Dallas 6, TX | 291 | 47 | 1473 | 132.25 | ||
Denver, CO | 289 | 216 | 925 | 72.18 | ||
Herndon, VA | 288 | 473 | 3472 | 196.13 | ||
Houston 3, TX | 289 | 107 | 382 | 18.15 | ||
Las Vegas, NV | 288 | 74 | 3044 | 180.60 | ||
Los Angeles, CA | 289 | 12 | 92 | 11.52 | ||
New York, NY | 289 | 175 | 2571 | 152.29 | ||
San Francisco, CA | 287 | 28 | 231 | 24.17 | ||
Seattle, WA | 288 | 174 | 1083 | 108.41 | ||
Tampa, Florida | 267 | 68 | 3048 | 214.49 | ||
Washington, DC | 286 | 163 | 1547 | 141.67 | ||
US Total | 4303 | 112 | 3472 | 170.11 | ||
CacheFly Total | 7110 | 105 | 4202 | 165.61 | ||
EdgeCast Small | Non-US | Amsterdam 2, Netherlands | 284 | 62 | 381 | 27.49 |
Copenhagen, Denmark | 254 | 126 | 1148 | 87.72 | ||
Frankfurt, Germany | 284 | 40 | 318 | 19.05 | ||
London 2, UK | 284 | 26 | 975 | 59.59 | ||
London, UK | 283 | 23 | 191 | 14.38 | ||
Madrid, Spain | 252 | 176 | 1174 | 112.31 | ||
Manchester, UK | 275 | 86 | 1494 | 118.26 | ||
Montreal, Canada | 283 | 163 | 601 | 59.56 | ||
Paris, France | 283 | 94 | 1537 | 140.76 | ||
Stockholm, Sweden | 271 | 162 | 967 | 81.87 | ||
Non-US Total | 2753 | 94 | 1537 | 99.35 | ||
US | Atlanta, Georgia | 284 | 129 | 523 | 34.51 | |
Chicago, IL | 284 | 26 | 463 | 35.86 | ||
Dallas 4, TX | 277 | 30 | 339 | 25.79 | ||
Dallas 5, TX | 284 | 26 | 581 | 50.32 | ||
Dallas 6, TX | 284 | 23 | 430 | 33.68 | ||
Denver, CO | 281 | 244 | 2169 | 150.12 | ||
Herndon, VA | 280 | 24 | 301 | 20.44 | ||
Houston 3, TX | 281 | 115 | 441 | 40.02 | ||
Las Vegas, NV | 281 | 56 | 559 | 34.32 | ||
Los Angeles, CA | 283 | 11 | 94 | 8.45 | ||
New York, NY | 284 | 72 | 1134 | 161.16 | ||
San Francisco, CA | 280 | 23 | 118 | 11.01 | ||
Seattle, WA | 282 | 131 | 3571 | 333.38 | ||
Tampa, Florida | 260 | 166 | 4977 | 303.29 | ||
Washington, DC | 282 | 83 | 686 | 111.97 | ||
US Total | 4207 | 77 | 4977 | 148.63 | ||
EdgeCast Small Total | 6960 | 84 | 4977 | 131.64 | ||
Data Center | Non-US | Amsterdam 2, Netherlands | 292 | 837 | 1344 | 35.72 |
Copenhagen, Denmark | 262 | 990 | 4195 | 297.90 | ||
Frankfurt, Germany | 291 | 867 | 1533 | 57.14 | ||
London 2, UK | 291 | 725 | 1065 | 25.95 | ||
London, UK | 290 | 811 | 1114 | 49.50 | ||
Madrid, Spain | 262 | 1005 | 1765 | 75.84 | ||
Manchester, UK | 281 | 899 | 8928 | 580.11 | ||
Montreal, Canada | 291 | 342 | 412 | 11.52 | ||
Paris, France | 293 | 1128 | 2680 | 230.78 | ||
Stockholm, Sweden | 292 | 1063 | 4056 | 367.89 | ||
Non-US Total | 2845 | 864 | 8928 | 326.71 | ||
US | Atlanta, Georgia | 291 | 316 | 1017 | 63.48 | |
Chicago, IL | 290 | 170 | 253 | 7.02 | ||
Dallas 4, TX | 292 | 191 | 3214 | 253.67 | ||
Dallas 5, TX | 292 | 145 | 263 | 14.52 | ||
Dallas 6, TX | 291 | 147 | 358 | 22.93 | ||
Denver, CO | 291 | 71 | 272 | 14.63 | ||
Herndon, VA | 293 | 316 | 487 | 15.43 | ||
Houston 3, TX | 293 | 177 | 372 | 19.66 | ||
Las Vegas, NV | 290 | 246 | 3194 | 392.02 | ||
Los Angeles, CA | 291 | 303 | 1188 | 57.60 | ||
New York, NY | 290 | 346 | 1120 | 123.55 | ||
San Francisco, CA | 293 | 229 | 519 | 22.28 | ||
Seattle, WA | 290 | 489 | 1078 | 170.33 | ||
Tampa, Florida | 270 | 331 | 4105 | 247.99 | ||
Washington, DC | 290 | 595 | 1511 | 235.84 | ||
US Total | 4347 | 271 | 4105 | 208.20 | ||
Data Center Total | 7192 | 506 | 8928 | 390.52 |
… and to really drive the point home for the PHB’s, we consolidate the data and give a very telling graph:
CacheFly is an up-and-comer in the CDN arena. They have very aggressive pricing, and have very good performance as well. If the site in question was a popular blog or community website and was very price sensitive, I would select CacheFly as my first choice CDN. However, where they fall short is in reporting and their web interface. The best way to contact their support department is via email or web-based form. Their web interface left a huge amount to be desired, and they have very little documentation on how to use it. There is no reporting whatsoever – you get raw log files and have to write our own reporting scripts on top of that data. I couldn’t help but wonder about all the “what ifs”. What if we get an incorrect image cached and need to have it cleared from their network? If we see a DDoS at the CDN, how do we know? These and other similiar questions are what ultimately eliminated CacheFly.
In CacheFly’s defense, I was told that they were working on a complete refactor of the user interface and was offered a chance to help beta it, but I was under time constraints and declined. The issues I had with the UI may or may not be present at the time of this writing.
It will appear when reading this post that I used the process of elimination to find the “lesser of all evils”, but understand that’s just the writing style I chose to convey the process. It wasn’t that EdgeCast didn’t lose, it’s that they won. Here’s why:
Please don’t read this article and walk away saying “Justin recommends EdgeCast, that’s who we’re going with”. For one, if you’re letting my blog posts make business decisions for you instead of doing due diligence, then you’re doing it wrong.
For our very specific needs EdgeCast was the best fit. For your needs, you will very possibly arrive at a completely different decision, and that’s great. By all means, blog about it. What I’m trying to convey is that there are a lot of points of comparison when going through your evaluation, and not all of them are obvious. It’s hard to get an objective point of view when doing this on your own – this is my best attempt at documenting what I came across.
Hopefully if you haven’t implemented a CDN for your busy sites, this post will motivate you to do so. If you’re unhappy with your current CDN, perhaps this post has given you some insight on how to find a replacement. If you’re happy with your current CDN, please leave comments as to why.
Lastly, I was in no way influenced monetarily or otherwise by any vendors, and none of the links in this article contain referral ID’s. This is all my personal opinion and in no way represents the opinion of my employers.
]]>The first step is to install module dependencies. You’ll need Views, Schema, Table Wizard(tw). You’ll also want to install Migrate, and Migrate Extras if you want to do any work with CCK fields. I must admit that I hadn’t seen Table Wizard before this project, but it will always be present in my dev installs from here out. If you find yourself using SQLYog, PHPMyAdmin, or some other tool to simply look at data in the database, be sure to check it out.
As I mentioned above, we are relying on the Drupal Schema API to make a lot of this easy, so let’s make a custom module that sets up our schemas for us. We’ll call this module my_import. Create a new directory in your modules directory, and name it my_import. First, create my_import.info with this inside:
name = My Import
description = "An import module."
core = 6.x
dependencies[] = migrate
dependencies[] = migrate_extras
dependencies[] = content
dependencies[] = path_redirect
package = Database
Nothing too wild here, just requiring some modules that we’ll be using later. Now, create my_import.install in the same directory with this inside:
<?php
function my_import_schema() {
$schema = array();
$schema['clickability_articles'] = array(
'fields' => array(
'id' => array(
'type' => 'int',
'not null' => TRUE,
'description' => t('Clickability article ID'),
),
'createDate' => array(
'type' => 'datetime',
'not null' => TRUE,
'description' => t('Clickability article creation date.'),
),
'editDate' => array(
'type' => 'datetime',
'not null' => TRUE,
'description' => t('Clickability article edit date'),
),
'title' => array(
'type' => 'text',
'not null' => TRUE,
'description' => t('Clickability article title'),
),
'author' => array(
'type' => 'text',
'not null' => FALSE,
'description' => t('Clickability content author (optional)'),
),
'articleauthor' => array(
'type' => 'text',
'not null' => TRUE,
'description' => t('Clickability article author'),
),
'summary' => array(
'type' => 'text',
'not null' => TRUE,
'description' => t('Clickability article short summary'),
),
'body' => array(
'type' => 'text',
'not null' => TRUE,
'size' => 'big',
'description' => t('Clickability article body'),
),
'placement' => array(
'type' => 'text',
'not null' => FALSE,
'description' => t('Clickability article related article placement lists'),
),
'thumbnail' => array(
'type' => 'text',
'not null' => FALSE,
'description' => t('Clickability article thumbnail'),
),
'image' => array(
'type' => 'text',
// @todo: Some articles do not have an image, but we require Master Image to be set.
'not null' => FALSE,
'description' => t('Clickability article image'),
),
'image2' => array(
'type' => 'text',
'not null' => FALSE,
'description' => t('Clickability article image page 2'),
),
'image3' => array(
'type' => 'text',
'not null' => FALSE,
'description' => t('Clickability article image page 3'),
),
'master_image_byline_title' => array(
'type' => 'text',
'not null' => FALSE,
'description' => t('Clickability article image page 7'),
),
'tags' => array(
'type' => 'text',
'not null' => FALSE,
'description' => t('Clickability article image page 7'),
),
'status' => array(
'type' => 'text',
'not null' => TRUE,
'description' => t('Clickability article status'),
),
'websitePlacements' => array(
'type' => 'text',
'not null' => TRUE,
'description' => t('Clickability book review status'),
),
),
'primary key' => array('id'),
);
return $schema;
}
function my_import_install() {
$ret = drupal_install_schema('my_import');
return $ret;
}
function my_import_uninstall() {
$ret = drupal_uninstall_schema('my_import');
return $ret;
}
When I created the schema, I took care to make sure that the column names in my table exactly matched the attributes and elements I was looking to pull out of the XML file. This saves a lot of coding later. Any time we change the schema, you can create a hook_update_N() function, or just change the schema and disable+uninstall+install the custom module. I did the latter with a drush alias and it worked well. The hook_install() and hook_uninstall() functions simply add and remove the tables.
Create the file myimport.php in your module directory, and paste in the following:
#!/usr/bin/php
<?php
// get the path to our XML file
$args = getopt("f:");
// Bootstrap Drupal
require_once './includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);
// Make sure my_import is enabled
if (!module_exists('my_import')) {
echo "I need the my_import module enabled! Exiting.\n";
exit(1);
}
/*
* Make sure our media directory exists.
* We will import from this directory into whatever directory filefield is configured for
* so we should remove this dir when done with the migration.
*/
$media_dir = file_directory_path() .'/migrated';
echo "Media dir = $media_dir\n";
if (! is_dir($media_dir)) {
mkdir($media_dir);
}
// Slurp in our XML file. If your XML file is huge, watch your PHP memory limits
$xml = simplexml_load_file($args['f']);
echo "XML Loaded.\n";
$rowcount = 0;
// Here we iterate over each child of the root of the XML, which in our case is a Article
foreach ($xml->children() as $content) {
// Setup our $obj object which represents a row in the DB, and use some caching to
// not abuse drupal_get_schema().
$obj = new stdClass;
static $schema = array();
// Dereference our child from the parent $xml, or xpath performance sucks hard
$content = simplexml_load_string($content->asXML());
$table = NULL;
$content_type = NULL;
switch((string) $content['type']) {
// Add more case statements for more content types as needed
case 'Article':
$table = 'clickability_articles';
$content_type = 'article';
break;
// All cases below are silently ignored - we are not importing them
case 'Book Reviews':
case 'Blog Topic':
case 'Event':
case 'Job':
// Ignored
break;
default:
// Any content type not accounted for gets reported
echo "Warning: unknown content of type ". $content['type'] ."\n";
}
if (isset($table)) {
if (! isset($schema[$table])) {
// Get the table schema from Drupal
$schema[$table] = drupal_get_schema($table);
// On first run, truncate the table
$sql = "truncate table {$table}";
db_query($sql);
echo "$table truncated.\n";
}
// This function does the heavy lifting, creating the $obj object from the XML data
$obj = xml2object($content, $schema[$table], $content['type']);
// There are some cases where $obj is intentionally null, only write to the db if not null
if ($obj) {
$ret = drupal_write_record($table, $obj);
if ($ret) {
$rowcount++;
}
}
}
}
echo "Inserted $rowcount rows.\n";
function xml2object($xml, $tableschema, $content_type) {
global $media_dir;
$obj = new stdClass;
// Our main iterator is the column names in the table
foreach (array_keys($tableschema['fields']) as $field) {
switch($field) {
case 'master_image_byline_title':
// This field is populated when we work with the images later on
break;
case 'id':
$obj->$field = $xml[$field];
break;
case 'status':
$obj->$field = (string)$xml->$field;
break;
// A Clickability placement roughly corresponds to a Drupal term
case 'placement':
$element = array_pop($xml->xpath("//field[@name='$field']"));
$obj->$field = (string)$element->row->value;
$obj->$field = map_taxonomy($obj->$field, $content_type);
break;
case 'author':
$element = array_pop($xml->xpath("//field[@name='$field']"));
$obj->$field = (string)$element->value;
break;
case 'image2':
case 'image3':
// Combine image2 and image3 elements in Clickability into our multivalue filefield as csv
if ($content_type == "Article") {
$mediaplacement = array_pop($xml->xpath("//mediaPlacement[@name='$field']"));
// migrate module requires full path to filefield source
$obj->$field = getcwd() .'/'. $media_dir .'/'. (string)$mediaplacement->media->path;
if (substr($obj->$field, -1, 1) == '/') {
$obj->$field = NULL;
}
else {
if (!empty($obj->image)) {
$obj->image .= ",";
}
$obj->image .= $obj->$field;
}
}
break;
case 'thumbnail':
case 'image':
$mediaplacement = array_pop($xml->xpath("//mediaPlacement[@name='$field']"));
// migrate module requires full path to filefield source
$obj->$field = getcwd() .'/'. $media_dir .'/'. (string)$mediaplacement->media->path;
// Check the schema. If the field is required, then fill in a default, otherwise wipe it
$required = $tableschema['fields'][$field]['not null'];
// If the file path ends in a /, then the XML did not have an image for this article
// -- if we require one, make a default
if (substr($obj->$field, -1, 1) == '/') {
if ($required) {
echo "$content_type with ID of ". $obj->id ." does not have a $field. Adding test.gif.\n";
$obj->$field .= "test.gif";
touch($obj->$field);
}
else {
// NOTE: We need this patch for this to work: http://drupal.org/node/780920
$obj->$field = NULL;
}
}
else {
// Transfer the caption on the image in the XML to the CCK byline accreditation
$obj->master_image_byline_title = (string)$mediaplacement->caption;
// See if the file exists on the filesystem
if (! file_exists($obj->$field)) {
// Nope, let's fill it in with our default image
echo $obj->$field ." does not exist, replacing with test.gif.\n";
$obj->$field = preg_replace('#^(.*)/(.*)$#', '\1/test.gif', $obj->$field);
}
// Replace .bmp with .jpg
$jpg = preg_replace('/\.bmp$/', '.jpg', $obj->$field);
if ($jpg != $obj->$field) {
if (file_exists($jpg)) {
$obj->$field = $jpg;
}
else {
// Tell the user what to do to create the image and exit.
echo "ID ". $obj->id ." has a image of type bmp, and no jpg found on the file system.\n";
echo "Create them by running 'mogrify -format jpg /path/to/*.bmp' and re-run this script.\n";
exit(1);
}
}
}
break;
// Any DB column not explicity defined above maps cleanly with the code below
default:
$obj->$field = (string)array_pop($xml->xpath("//field[@name='$field']"));
break;
}
}
// We assume it does not need imported until we prove otherwise
$needs_imported = FALSE;
$tags = array();
$websitePlacements = array();
foreach ($xml->xpath("//websitePlacement") as $websitePlacement) {
// Only if the XML says the domain is www.newdomain.com do we need to import
if ($websitePlacement->domain == 'www.newdomain.com') {
$needs_imported = TRUE;
// Convert the old "sections" into tag taxonomy
$tags[] = substr($section, 1, strlen($section));
// Grab the old URLs from websitePlacement, and place them on an array
$section = (string)$websitePlacement->section;
$oldurl = $section .'/'. $obj->id .'.html';
$websitePlacements[] = $oldurl;
// If we do not have a placement yet, we try to set some form of taxonomy
if (! isset($obj->placement)) {
$taxo = map_taxonomy($section, $content_type);
// NOTE: We need this patch for this to work: http://drupal.org/node/780920
$obj->placement = $taxo;
}
// If the XML did not explicity tell us the createDate, we use the start date from the webSitePlacement
if (empty($obj->createDate)) {
$date = (string)$websitePlacement->startDate;
$obj->createDate = substr($date, 0, strlen($date) -4);
$obj->editDate = $obj->createDate;
}
}
}
$obj->websitePlacements = implode(',', $websitePlacements);
$obj->tags = implode(',', $tags);
// Return the object only if we need it imported
return $needs_imported ? $obj : NULL;
}
function map_taxonomy($oldtext, $content_type) {
// Simple maps of Clickability placements to Drupal terms
if ($content_type == 'Job') {
return NULL;
}
if (preg_match('/building/i', $oldtext)) {
return "Green Building";
}
if (preg_match('/(clean|energy)/i', $oldtext)) {
return "Clean Energy";
}
if (preg_match('/financ/i', $oldtext)) {
return "Finance";
}
if (preg_match('/food/i', $oldtext)) {
return "Food & Farms";
}
if (preg_match('/marketing/i', $oldtext)) {
return "Green Marketing";
}
if (preg_match('/recycled/i', $oldtext)) {
return "Recycled Markets";
}
if (preg_match('/technol/i', $oldtext)) {
return "Technology";
}
if (preg_match('/leaders/i', $oldtext)) {
return "Business Leaders";
}
if (preg_match('/transportation/i', $oldtext)) {
return "Transportation";
}
return NULL;
}
?>
Wow, that’s a lot of code. I’ve commented it pretty heavily, but here’s the “40,000 foot view” of what’s going on:
Now, create a file in your module directory named my_import.module. This file will contain the actual module used by Drupal and will implement some of the migrate modules hooks. You might ask, why not deal with everything in the command line script? There are two primary reasons why:
<?php
define(NUM_PARAGRAPHS_PER_PAGE, 6);
function my_import_migrate_prepare_user(&$user, $tblinfo, $row) {
// Randomly assign passwords to users, forcing them to reset their password
$errors = array();
$user['pass'] = preg_replace("/([0-9])/e","chr((\\1+112))",rand(100000,999999));
return $errors;
}
function my_import_migrate_prepare_node(&$node, $tblinfo, $row) {
$errors = array();
// In Clickability, there were multiple states that represented "Published", here we map them.
$status = $tblinfo->view_name .'_status';
switch($row->$status) {
case 'live':
case 'APPROVED':
$node->status = 1;
break;
default:
$node->status = 0;
break;
}
if ($node->type == 'article') {
// Paginate articles by inserting a pagebreak tag every 6th paragraph to emulate Clickability's pagination
$paragraphs = preg_split('#<br />\s*<br />#s', $node->body);
if (count($paragraphs) > NUM_PARAGRAPHS_PER_PAGE) {
$node->body = '';
$i = 1;
foreach ($paragraphs as $paragraph) {
if (($i % NUM_PARAGRAPHS_PER_PAGE) == 0) {
$node->body .= $paragraph . "\n[pagebreak]\n";
}
else {
$node->body .= $paragraph ."<br />\n<br />\n";
}
$i++;
}
}
}
return $errors;
}
function my_import_migrate_complete_node(&$node, $tblinfo, $row) {
$errors = array();
// Create redirects for old URLs
$field = $tblinfo->view_name .'_websitePlacements';
foreach(explode(',', $row->$field) as $oldurl) {
// Delete any old redirects
if (substr($oldurl,0,1) == '/') {
$oldurl = substr($oldurl,1);
}
path_redirect_delete(array('source' => $oldurl));
$redirect = array(
'source' => $oldurl,
'redirect' => '/node/'. $node->nid,
'type' => 301,
);
path_redirect_save($redirect);
}
return $errors;
}
Here’s the high-level breakdown, check the code+comments for the details.
Finally, let’s create some sample data so we can see how this all meshes together. Create the file content.xml in your module directory, and paste this in it:
<?xml version="1.0" encoding="utf-8"?>
<cmPublishImport>
<content type="Article" id="7241321">
<field name="title"><![CDATA[Donec risus purus]]></field>
<field name="author">
<value><![CDATA[Me]]></value>
</field>
<field name="articleauthor"><![CDATA[Me]]></field>
<field name="date"><![CDATA[2007-04-29]]></field>
<field name="summary"><![CDATA[Donec risus purus]]></field>
<field name="body"><![CDATA[Donec risus purus, euismod eu volutpat ac, pharetra non nulla. Vestibulum quis neque lacus. Donec sit amet tortor nisi. Nam et lectus nec turpis consequat rhoncus. Proin porttitor, quam nec faucibus pulvinar, arcu magna facilisis erat, eu imperdiet risus tortor ac quam. Praesent non justo ac nisl ultricies condimentum a eget arcu. Nam in mi est. Donec risus orci, imperdiet ut tempus et, pulvinar nec diam. Donec eleifend pulvinar aliquam. Nulla faucibus turpis nec neque scelerisque convallis. Fusce gravida pulvinar quam, sit amet faucibus risus sodales ornare. Nullam arcu risus, lacinia vel faucibus at, auctor eget diam. Quisque a neque ac tellus bibendum luctus fringilla in lacus. Praesent id nunc eu dolor adipiscing consequat vel eget leo. Donec velit mi, pharetra quis tincidunt id, laoreet et dolor. Vestibulum fringilla rutrum arcu at accumsan.<br/>
<br/>
Cras pellentesque sagittis mi. Pellentesque cursus nisl id nunc suscipit luctus. Duis pellentesque rhoncus sodales. Nullam dictum augue ac diam fermentum vel feugiat mauris euismod. Mauris nec metus eu sem tristique euismod. Etiam lorem est, accumsan vitae bibendum sit amet, tempus sit amet urna. Nullam lobortis adipiscing convallis. Nullam scelerisque sagittis tellus vitae interdum. Integer eget interdum nunc. Nam ligula orci, bibendum ac mattis eget, mollis at massa.<br/>
<br/>
Vestibulum sodales elit vel est feugiat vitae dapibus erat ultricies. Proin auctor quam sit amet nisi aliquet pharetra. Curabitur tristique quam vel tortor gravida scelerisque. Morbi laoreet aliquet mi, sed imperdiet mauris mattis et. Praesent non quam nec lorem dapibus semper. Quisque vulputate neque et turpis placerat bibendum. Phasellus suscipit urna eget augue ullamcorper ultricies. Curabitur hendrerit dui sit amet elit elementum nec venenatis orci tempor. Fusce semper vestibulum odio vitae porta. Mauris non tellus non mi hendrerit suscipit in sed ante. Donec arcu neque, tristique ut elementum sed, suscipit at leo. Curabitur eget enim quis leo scelerisque laoreet et eget augue. Fusce posuere est ac felis fringilla consectetur. Nulla elit magna, pharetra sit amet tincidunt sed, tristique sed mi. Nam iaculis, elit sit amet condimentum blandit, massa neque pharetra justo, non ornare ligula ante non leo. Praesent ullamcorper suscipit tempus. In varius, neque eget volutpat posuere, velit odio luctus turpis, ac varius nulla erat sit amet justo. Quisque convallis mollis pharetra. Aliquam porta dolor quis nunc tempor vitae pharetra lectus ultricies. Fusce egestas sagittis sapien, sit amet pharetra sem ullamcorper a.<br/>
<br/>
Ut dui tortor, porta eu ultrices sed, interdum vitae lectus. Integer facilisis velit sit amet dui ultricies lobortis. Fusce ut malesuada tellus. Aenean in nibh at lorem iaculis dictum vitae in nulla. Etiam dapibus lacinia eleifend. Aliquam erat volutpat. Nullam sit amet sapien ut risus consequat posuere eu quis quam. In lobortis fringilla felis quis pretium. Suspendisse non nisl libero, non tempor justo. Nunc volutpat nulla vitae lacus tincidunt feugiat congue sapien commodo. Suspendisse venenatis aliquet ante in hendrerit. Sed lectus ligula, gravida id tincidunt et, feugiat non justo.<br/>
<br/>
Sed metus tellus, vestibulum in mollis quis, imperdiet et velit. Praesent suscipit elit et mi rutrum sit amet gravida augue iaculis. Etiam nec tellus nec augue porttitor pharetra. Vivamus feugiat mollis est, eu aliquam neque tempus a. Ut magna mauris, sollicitudin in ornare non, lacinia a lacus. Aenean porttitor magna ac sem ornare pellentesque. Aliquam mattis dolor in metus molestie ut feugiat mi auctor. Etiam laoreet pulvinar ipsum id bibendum. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Quisque porttitor convallis lacus, nec pretium leo varius non. Morbi non dapibus diam. Sed nec venenatis diam. Cras mollis porta tempor.<br/>
<br/>
Donec ornare mi sed tellus porta luctus. Nulla euismod venenatis ante, in rhoncus felis ornare non. Cras tempor venenatis est at gravida. Etiam imperdiet dolor vitae ipsum lacinia imperdiet. Maecenas purus lorem, rhoncus non porttitor in, semper nec quam. Integer ullamcorper facilisis ultrices. Vivamus porttitor lacinia augue in venenatis. Quisque interdum euismod tellus, et consectetur nunc dictum sit amet. Maecenas pulvinar placerat mauris, quis auctor purus pellentesque at. Vestibulum vulputate, tellus id eleifend posuere, ligula erat hendrerit orci, nec lobortis tortor sapien ut ligula. Donec id augue leo, non consectetur nisl. In viverra dictum lorem eget blandit. Etiam tempus nisl ac nibh viverra id cursus eros luctus. Duis ut tellus nisi.<br/>
<br/>
Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Aliquam quis justo risus, eget eleifend nibh. Morbi quis dolor nulla, sed cursus metus. Vestibulum vel ipsum non erat tincidunt luctus et eget sapien. Nunc vel justo ante, vel auctor purus. Proin vulputate bibendum placerat. Fusce vel tincidunt nunc. Praesent at eros in dolor faucibus blandit et vitae magna. Fusce arcu nisl, sollicitudin sed accumsan sed, rhoncus at tellus. Ut ut mauris vel ipsum bibendum ullamcorper eget sed neque. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Sed dui massa, imperdiet sit amet lacinia id, rutrum sed orci. Proin pharetra risus eu risus gravida convallis id ac mi. Etiam a neque ut lacus convallis accumsan non eget arcu. Sed blandit velit id lectus tincidunt ut aliquet mi egestas. Aliquam cursus odio vitae turpis suscipit mollis et aliquam purus. Suspendisse pretium tincidunt porttitor. Nunc vestibulum, lacus at auctor laoreet, orci lectus volutpat diam, ut mollis risus lectus a ligula.<br/>
<br/>
Phasellus id urna sit amet elit pretium viverra sit amet eget felis. Maecenas sed arcu sed eros fringilla commodo id non magna. Maecenas urna mauris, cursus vel mattis et, volutpat in purus. Nulla sapien orci, faucibus sed tincidunt tincidunt, tristique id lacus. Nullam tortor libero, porttitor eget faucibus eget, vehicula a enim. Suspendisse malesuada consectetur mattis. Integer dapibus dignissim tempor. In viverra luctus orci sed placerat. Suspendisse aliquam mattis diam mattis dapibus. Aenean suscipit purus eu ipsum dignissim in aliquet urna mollis. Duis nibh magna, fringilla eu ultrices posuere, sodales sed felis. Proin varius dignissim sem a consequat. Pellentesque facilisis felis vel mi malesuada placerat. Curabitur gravida euismod mi in molestie. Vivamus sit amet dui leo. Praesent mi justo, bibendum at rutrum ac, bibendum ut felis. Nullam nec dolor dui, quis imperdiet nulla. Morbi semper pulvinar risus.]]></field>
<mediaPlacement name="image">
<media id="577286">
<path>images/1.jpg</path>
<caption><![CDATA[Cows at Three Mile Canyon provide resources such as methane and compost for on-farm operations.]]></caption>
</media>
</mediaPlacement>
<status>live</status>
<websitePlacement>
<domain>www.newdomain.com</domain>
<section>/foodandfarms</section>
<startDate dateFormat="yyyy-MM-dd HH:mm:ss zzz">2007-04-29 14:00:00 PDT</startDate>
</websitePlacement>
<websitePlacement>
<domain>www.olddomain.com</domain>
<section>/greenmarketing</section>
<startDate dateFormat="yyyy-MM-dd HH:mm:ss zzz">2007-04-29 14:00:00 PDT</startDate>
</websitePlacement>
</content>
</cmPublishImport>
Now (finally), it’s time for some action. Enable your newly created my_import module, and jump out to the shell. Assuming your Drupal root is at /var/www/drupal, cd into that directory. Create the new directory sites/default/files/migrated/images, and place a jpg named 1.jpg in that directory. Now run the import script:
php5 ./sites/all/modules/my_import/myimport.php -f ./sites/all/modules/my_import/content.xml
With luck, the script will succeed, and you will have 1 row of data in your clickability_articles table! If not, fix the error (if you’re using the sample data, let me know what went wrong and I’ll fix it). Next up, Table Wizard configuration.
All the hard work is done now - we can use a web UI from here on out. Visit /admin/content/tw in your browser, and under the “Add Existing Tables” fieldset, and select the tables you imported with myimport.php. If your tables are huge (50K+ rows), you may want to select “Skip full analysis”. Click the “Add tables” button. At this point, that’s all we need from Table Wizard, but I strongly encourage you click around a bit. The table analysis can tell you some handy things about your data.
In the previous step, we essentially built a view that we can provide to the Migrate module. Now we need to tell Migrate how to use the view. Visit the Migrate settings at /admin/content/migrate/settings. If you can, implement the changes it recommends to .htaccess as it will speed up the import considerably. Also, make sure to expand the “Migration support implemented in the XYZ module” fieldsets and enable the support you need for your import. Now, visit the dashboard at /admin/content/migrate. Expand the “Add a content set” fieldset, and fill in the values. When choosing the value for “Source view from which to import content”, scroll down towards the bottom of the list. All Table Wizard views are prefixed with “tw: ”, so the one we’re looking for here is “tw: clickability_issues (clickability_issues)”. You can leave “View arguments” and “Weight” to defaults. The next screen is where the real magic happens. By interrogating the view, Migrate presents you with a map fields form that allows us to select our source column from a dropdown to assign to various node elements. If you have a setting that should remain constant across all imported records (“Node: Input format” is usually a good example), you can type in a default value here. The rest should be fairly self explanatory. Click “Submit changes”, and you’ll be taken back to the dashboard.
Now, the way I did my testing was to choose one row from the source table to import. Grab its primary key and copy it to the clipboard. Check the box under “Import” for our content set, then expand the “Execute” fieldset. Paste the ID into the “Source IDs:” text field, and click the Run button. With any luck, you will be returned to the dashboard, but the content set will show 1 imported. Hopefully there will be no errors, but if there are, find and fix the problem. You can view the old primary key mapping to the node ID by going back to /admin/content/tw and looking for a view named migrate_map_si_articles. This table is created by the Migrate module – it uses this table to track what has been imported, and what NID the imported nodes have. Grab that nid, and load up /node/[nid]. If it looks good, then we can to a bigger import. Go back to the Migrate dashboard, and this time click the “Clear” checkbox next to the content set. Expand “Execute”, make sure all fields are blank, and click the Run button. This process will “unimport” the row we just imported. Now, depending on your row count, you may want to import all rows and see what happens. Since I was dealing with thousands of nodes, I did an import of just 100 nodes to make sure things were okay. To do this, instead of specifying “Source IDs”, place the number 100 in “Sample Size”, and click Run. To import everything, leave all fields blank. The power to quickly and easily remove all changes made by the migrate module is huge. Because of this “safety net”, it lets you work on the import within the same development sandbox as your designers and themers. They’ll appreciate having something other than “Lorem Ipsum” to look at!
If I have to explain this to you, you’re in the wrong field of work!
This post is my longest to date, and there’s a good chance I missed some things. By all means, let me know in the comments if you find any holes and I’ll get it corrected. I hope this case study helps some other Drupalers out there - when I first started this project I couldn’t find any examples on how to get XML into Drupal using the Migrate module. Now Google has some spider food :)
]]>For those who didn’t know, archive.org has made the sessions available for download, so be sure to check those out. Read on for my “takeaways” from DCSF2010.
Please note that these are just what come to my mind, I’m sure I’m forgetting huge topics. Please forgive me in advance for those!
Larry Garfield works for Palantir.net, and is one of the few people that I’ve listened to that is immensely intelligent, yet speak well and even make a crowd genuinely laugh out loud. I attended his “Objectifying PHP” and “Views for Developers” sessions, and left feeling motivated and enlightened. My thanks go out to him, as he very obviously put a lot of preparation time into his presentations.
As evidenced by Larry Garfield’s “Objectfying PHP” and John VanDyk’s “Batch vs Queue” session, Drupal is refactoring portions of core into classes and methods where it fits. I’m part of the camp that welcomes the change, and can’t wait. I can’t help but wonder if we’ll alienate some contrib module authors in the process, but I’m sure that it will bring the overall quality of contrib modules up a few notches.
I’ve been in IT/Networking/Programming/etc for about 20 years now. While I don’t claim to be the smartest person in the group at any point in time, I consider myself pretty well rounded. It’s been a long time since someone was able to truly talk so far over my head that I couldn’t keep up, but David Strauss of Four Kitchens did just that at the Chapter Three open house party. We discussed HipHop PHP, operating systems, configuration management, and god knows what else. I had to look like a deer in headlights!
I can say this because David Strauss is the one working on it. Enough said.
Instead of trying to compete with Drupal, they’re finally trying to help Drupal. I’m a hardcore anything-but-Microsoft OS kinda guy, but I can’t dispute that there’s a lot of shops out there that already have well versed SQL Server and IIS admins. Microsoft announced that they now have a native SQL Server driver for PHP, and that Drupal can now run on MS SQL Server. This will be a huge boon for getting Drupal into the Microsoft-centric enterprises - there’s no longer a need to have a MySQL guy. Oh, and giving away free alcohol never hurt either :)
Chx gave an excellent presentation - “MongoDB: Humongous Drupal”. He covered a lot about SQL, and how over the years it’s become “best practice” to de-normalize tables to improve performance. We’ve all done that, but have you ever pondered that you’re breaking one of the fundamental rules of relational databases when you’ve done that? While MongoDB is perfectly suited for logging and caching in Drupal, the biggest win is with Fields in Drupal 7. Each field you create results in a new table that must be added to a JOIN when building a node. Shops with a lot of fields on their nodes will likely see huge gains in performance by moving to MongoDB for those tables.
Hey, did you hear that Drupal powers whitehouse.gov? Seriously, there’s been a
lot of progress in the past year with regards to making Drupal scale. Project
Mercury from the great folks
at Chapter Three makes Big Drupal easy, and is
now supported on Amazon’s EC2, Rackspace, and
Linode
Two out of the three people I knew coming into DCSF currently work for Chapter Three, and the third person used to work for them. Special thanks to Greg Coit and Kevin Montgomery for taking me under their wing and introducing me to all their colleagues. I also had the pleasure to meet Josh Koenig, albeit briefly. Seems the partner/CTO of one of the leading Drupal shops is a little busy at a DrupalCon. I ended up meeting a few other guys that I clicked really well with and hope to keep tabs on: Jeff Graham of FunnyMonkey, Rob Wohleb of Xomba.com, and Aaron Levy of Chapter Three - thanks for the beer and discussion!
The migration to Git can’t happen fast enough for me. Aside from the ability to commit code on a plane, contrib modules will benefit greatly. When all is said and done, every new issue on drupal.org will have it’s own repository that any user will be able to commit to. Once the issue is resolved, the fix will be merged back into the main module repo. That should break down even more barriers for new contrib authors getting into Drupal development.
Yes, Dries is very tall - at least 6'4" if I had to guess, but this is actually just a way for me to remind you that I shook Dries' hand :) I was more than a little starstruck!
Overall, I had a blast, and can’t wait for the next DrupalCon in the states. I heard it’s in Chicago – count me in! If you ever get the chance, I absolutely recommend that you attend.
]]>… a drop-in replacement for your Drupal website hosting service that delivers break-through performance. Mercury can serve two-hundred times more pages per second and generate pages three times faster than standard hosting services.
Mercury achieves this by using open-source technologies like so many ingredients of a complex dish - a little Varnish here, a dash of Memcached there, a hint of the Alternative PHP Cache, a healthy dose of Tomcat and Solr, all based upon the Pressflow distribution of Drupal. None of it is anything you couldn’t do yourself – many before Chapter Three had done it actually. However, they were the first to tie it all together using BCFG2, and release an Amazon EC2 AMI image of it. As word spread, many liked the idea of Mercury, but wanted to brew their own non-EC2 instance. While they posted a wiki article on how to do it yourself, they went to work on native support for RackSpace. When I read Josh Koenig’s post on the Linode blog stating he wanted to bring Mercury to Linode, I made a mental note. Some time passed, I became much more involved in Drupal, and I decided to volunteer to write the StackScript . Josh said okay, and put me in touch with Greg Coit, their resident sysadmin, and we went to work. Fast forward a couple weeks, and we’ve announced a beta! The StackScript is quite complete - it supports Ubuntu Jaunty and Karmic, and can use the current stable branch or the soon- to-be-released 1.1 development branch. Once Lucid is released, we’ll test to make sure it works there as well. I want to thank Greg for all his help. We found some bugs in Ubuntu, some quirks in the memcached init script, and fixed many bugs and added some features to their BCFG2 bazaar repo. Thanks also go out to Josh for his oversight and guidance. It was a great time, a great learning experience, and I came out of it with some new colleagues (and some free beers at DrupalConSF). Feel free to read up on my experiences with Linode, and if you like what you see, click on one of the many links to Linode from my blog. If you sign up and stay a customer for 90 days (trust me, you will), I’ll get $20 credited to my account. Feel free to comment below about the StackScript and let me know about any issues you might find.
]]>If you don’t get any traffic after a few seconds, go hit your /cron.php page - this should generate some traffic like this for you to see: Here we can see that our host is making a bunch of outbound requests to master.drupal.org. This is because the “Update Status” module is checking to see what upgrades are available for us. What if you see traffic and don’t know what module is causing it? grep to the rescue! In order to find out which module was making the calls to TinyURL, we ran the following command: grep -R ‘tinyurl.com’ /path/to/drupal/sites/all/modules/*
This returned one hit, from /path/to/drupal/sites/all/modules/service_links/service_links.module. By disabling the short links feature within the module we decreased page load time by 7 seconds!
]]>Finnix is the distro of Linode’s choice for ‘rescue’ operations on your server. Think of it as a Swiss Army knife - it’s a very powerful tool that takes very little setup. For more on Finnix, checkout the Linode Library article. First, we need to setup a very small, 20MB ext3 disk that will house our installation kernel and initrd. Set up another ext3 disk of 100MB to be mounted at /boot for PV-GRUB. Finally, setup your raw disk that will be used for the OS installation. Since we’ll be using LVM, you can easily add to and resize your disk later, so don’t overdo it. I went with 5GB for my root disk. If you’re following along, here’s what you should have: Now, setup a Finnix configuration profile. Click on “Create a new configuration profile”, and type “Finnix Rescue” for the label. For the kernel, select “Recovery - Finnix (kernel)”. For /dev/xvda, select the “Recovery - Finnix (iso)”. For /dev/xvdb, select the “Centos 5.4 Install Disk”. For uncompressed initrd image, select “Recovery - Finnix (initrd). Leave the other settings at defaults, and save the profile. Here’s what it should look like:
Next, boot Finnix from your Linode control panel. Click on the console tab, and launch the AJAX console. Once at the console, we need to mount our install disk, and fetch the xen-enabled kernel and initrd from your favorite mirror. Mount the installation disk, change directories, and download the files:
mount /dev/xvdb /mnt/xvdb
cd /mnt/xvdb
for f in initrd.img vmlinuz; do
wget http://mirror.unl.edu/centos/5.4/os/i386/images/xen/${f}
done
cd
umount /mnt/xvdb
Now, shutdown Finnix from the Linode control panel.
Create a new profile, and name it CentOS 5.4. For the kernel, select “pv-grub- x86_32”. For /dev/xvda, select “CentOS 5.4 PV-GRUB Boot”. For /dev/xvdb choose “CentOS 5.4 OS Disk”. For /dev/xvdh, select our “CentOS 5.4 Install Disk”. Point the root device to a custom location: “/dev/mapper/VolGroup00-LogVol00”. Leave the rest as defaults, here’s a screenshot:
Save the profile, and boot it. Note that it won’t boot automatically, we have to point GRUB in the right direction first. You’ll be greeted by a scary- looking ‘grubdom>’ prompt. Now, we need to tell grub to boot our install kernel and initrd:
root (hd2)
kernel (hd2)/vmlinuz
initrd (hd2)/initrd.img
boot
Note that if you want to do a kickstart install, you would append ks=http://my.com/this.ks to the kernel line above. More on this later. Once the kernel loads, you’ll be presented with the familiar anaconda text-based installer. Choose your language, and your installation type. I prefer HTTP from a mirror. If you choose to do the same, use the mirror hostname for the Web site name, and the path to the directory that contains all the release notes – usually it’s /centos/5.4/os/i386/. Anaconda will fetch the stage2 image, then launch the installer. Here’s where it gets cool - it will give you a choice to “Start VNC”. If you choose this option, you can connect to your Linode via VNC (note it launches on display 1, not 0), and complete the installation via a GUI. Install as you would any other CentOS installation. Make note of where your root directory is at. The installer may complain about your /dev/xvdh being a loop device, tell anaconda to ignore it. Exclude /dev/xvda from any partitioning, we’ll set that up manually later.
system
Once you click the “Reboot” button on the installer, you’ll be disconnected from VNC. Your machine will be restarted, but it will stick at the grubdom prompt again - that’s okay. We’ll be stuck at the grubdom> prompt one more time - use this to tell it to boot CentOS using the boot partition the installer setup for us:
grubdom> root (hd1,0)
grubdom> kernel (hd1,0)/vmlinuz-2.6.18-164.el5xen
grubdom> initrd (hd1,0)/initrd-2.6.18-164.el5xen.img
grubdom> boot
You will then boot into CentOS - exit the system settings GUI - you can run it again later by running system-config-securitylevel-tui. Now we need to setup our boot disk so that pv-grub knows how to boot our kernel so we’re not consistently prompted upon reboot. Linode uses PV-GRUB to boot our kernel, and it’s looking for a ‘boot’ directory directly on /dev/xvda. For more details, see the Linode Wiki. Run this as root, and make sure your block devices are aligned with mine before copy/pasting:
mkdir /mnt/newboot
mkfs.ext3 /dev/xvda
mount /dev/xvda /mnt/newboot
rsync -av /boot /mnt/newboot
cd /mnt/newboot/boot
ln -s . boot
cd
umount /mnt/newboot
umount /boot
e2label /dev/xvdb1 oldboot
e2label /dev/xvda /boot
That’s it - reboot, and you should be up and running! You can create a LVM physical volume out of the old boot partition on /dev/xvdb1, or just leave it around unused.
We can use kickstart to really streamline the process. Follow steps 1, 2, and 3 above, but on step 4, replace the kernel line with this:
kernel (hd2)/vmlinuz ks=http://www.sysadminsjourney.com/assets/files/linode-minimal.ks
That’s it! The kickstart file handles partitioning, and setting up the right boot partition, as well as disabling unneeded services that you don’t need for a Linode. Make sure you check out the file http://www.sysadminsjourney.com/assets/files/linode-minimal.ks - your root password is there, change it immediately! Stay tuned for my CentOS StackScripts!
]]>VBO was attractive enough that we decided to offload the bulk/batch operations of Node Gallery to VBO. Integration for the most part was surprisingly easy - VBO “speaks” in Drupal Actions, so by writing actions, we were writing integration with VBO.
There’s one undocumented case where VBO can be used that was critical for us. Most VBO actions you will find perform one action to a set of nodes, one at a time. Often times, that one action is to set a value of some sort on said nodes. In the case of Node Gallery, we wanted to be able to assign different weight values (used for sorting) to a bunch of nodes. The key here is that we aren’t assigning a value of ‘2’ to all selected node’s weight, we want to assign a weight of 2 to node #1, 3 to node #2, 8 to node #3, and so on. While not straightforward, it’s definitely achievable.
The general idea we’ll be taking is to have VBO display a list of nodes to the admin. The admin can place a checkmark next to the nodes that he wishes to change the weight on, then select “Change the image’s weight” from the action dropdown, and click submit. We will then draw a form that includes some summary information about the nodes, and a select box with the node’s current weight. The admin sets the weight he wants for each node, then clicks submit. VBO then takes over, assigning each node the proper weight. Let’s get into the code - first we implement hook_action_info(), telling Drupal that we have actions to provide:
<?php
/**
* Implementation of hook_action_info().
*/
function node_gallery_action_info() {
return array(
'node_gallery_change_image_weight_action' => array(
'description' => t('Change image weight (sort order)'),
'type' => 'node',
'behavior' => array('changes_node_property'),
'configurable' => TRUE,
'hooks' => array(
'node' => array('presave'),
),
),
);
}
The only real items of note in the hook above are setting ‘configurable’ to true, and setting ‘behavior’ to ‘changes_node_property’. Setting ‘configurable’ allows us to display a custom form, and setting the behavior tells VBO that we’ll be modifying the node. In turn, VBO will call $node->save on each node after it’s been processed. Next, we define our configurable action’s form function:
<?php
function node_gallery_change_image_weight_action_form($context = array()) {
//We're being called from VBO - we can do extra validation
if ($context['view']->plugin_name == 'bulk') {
//@todo: Add imagefield support in our sort form, and theme it with draggable items
$sql = "SELECT n.nid, n.title, ngi.weight FROM {node} n " .
"INNER JOIN {node_gallery_images} ngi ON n.nid = ngi.nid " .
"WHERE n.nid IN (". db_placeholders($context['selection']) .")";
$result = db_query($sql,$context['selection']);
$delta = count($context['selection']) > 20 ? intval(count($context['selection'])/2) : 10;
$form['node_gallery_change_image_weight_action']['#tree'] = TRUE;
while ($image = db_fetch_object($result)) {
$form['node_gallery_change_image_weight_action'][$image->nid]['title'] = array(
'#type' => 'item',
'#value' => $image->title,
);
$form['node_gallery_change_image_weight_action'][$image->nid]['weight'] = array(
'#type' => 'weight',
'#title' => t('Weight'),
'#default_value' => $image->weight,
'#delta' => $delta,
);
}
}
//We're called from a standard advanced action where we assign one weight to all nodes
else {
$form['node_gallery_change_image_weight_action'] = array(
'#type' => 'weight',
'#title' => t('Weight'),
'#description' => t('When listing images in a gallery, heavier items will sink at the lighter items will be positioned near the top'),
'#delta' => 10,
);
if (isset($context['imageweight'])) {
$form['node_gallery_change_image_weight_action']['#default_value'] = $context['imageweight'];
}
}
return $form;
}
To define your form function, simply append ‘_form’ to your action name and you have the function name. Nothing too wild and crazy in the form function above, but there’s two key points:
Next, we define our submit function (you can define a validate function if needed). Our submit function will pull the important data from the submitted form and assemble it into a concise array that our action can use. Here’s our submit function:
<?php
function node_gallery_change_image_weight_action_submit($form, $form_state) {
//We're setting all nodes to the same weight
if (is_numeric($form_state['values']['node_gallery_change_image_weight_action'])) {
$weight = $form_state['values']['node_gallery_change_image_weight_action'];
}
//VBO is passing us a set of nids
else {
foreach ($form_state['values']['node_gallery_change_image_weight_action'] as $nid => $val) {
$weight[$nid] = $val['weight'];
}
}
return array('imageweight' => $weight);
}
The key here is that if we are passed in the “single value” form, we stick the value into the variable $weight as a simple scalar. If we are passed in form data from the VBO “multi value” form, then $weight becomes an associative array where the key is the nid, and the value is the weight for that node.
Finally, we define our action function. Our action is pretty simple, because it will only be called with one node and one value. This is a key thing to remember when writing code for VBO - even though you are working with batches of nodes, VBO is essentially one big loop around the actions – it executes the action once for each node. So, in our action, we simply check to see if the value of the $context[‘imageweight’] index that we passed from our submit function is an integer or an array, and perform the correct operation on the node to assign it it’s new weight. Once this function returns, VBO will call $node->save for us.
<?php
function node_gallery_change_image_weight_action(&$node, $context = array()) {
if (in_array($node->type, (array)node_gallery_get_types('image'))) {
//All nodes are set to the same weight
if (is_numeric($context['imageweight'])) {
$node->weight = $context['imageweight'];
}
//VBO is sending us a list of nodes to modify with different weights
else {
$node->weight = $context['imageweight'][$node->nid];
}
}
}
While not always obvious, there’s not too many bulk operation conditions that VBO can’t handle. Hats off to infojunkie for writing such a helpful module that is also easily integrated with!
]]>[error] (13)Permission denied: proxy: AJP: attempt to connect to 10.x.x.x:7009 (virtualhost.virtualdomain.com) failed
I thought for sure it was proxy permissions, but nothing I did fixed the issue. Then it hit me: SELinux! Why I always think of SELinux last when it’s responsible for 90% of my problems, I’ll never know. SELinux on RHEL/CentOS by default ships so that httpd processes cannot initiate outbound connections, which is just what mod_proxy attempts to do. If this is your problem, you’ll see something like this in /var/log/audit/audit.log:
type=AVC msg=audit(1265039669.305:14): avc: denied { name_connect } for pid=4343 comm="httpd" dest=7009
scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:port_t:s0 tclass=tcp_socket
To fix this, first test by setting the boolean dynamically (not permanent yet):
/usr/sbin/setsebool httpd_can_network_connect 1
If that works, you can set it so that the default policy is changed and this setting will persist across reboots:
/usr/sbin/setsebool -P httpd_can_network_connect 1
Hope this saves others some time!
]]>