Lessons from last night's Speed of Light Outage

You hopefully didn't notice, but the site took a pretty serious hit late last night and was down between about 11:30PM through till about 4PM today. At various points during the outage, some articles were available, some duplicated, and all of them out of order. This also affected the ATOM feed for subscribers. It's all back to normal now. Here's what happened.

Around 11:15PM last night, I had just finished writing a rebuttal to Matt Neuburg's post “Lion is a Quitter”. The editing happens in a homemade app called Editor. I saved my article and began the publishing process. Something on the server crashed (an uncaught error) and the article didn't make it into the the publishing system.

As you may recall from an earlier article, I described the basic layout of the publishing engine I wrote, Colophon, which drives this site (among others):

Articles are written in plain text files, formatted as markdown, along with a second JSON file, which describes some metadata about the article (title, creation and modification dates, etc.). These files combined are the raw source to all the articles of my website.

Instead of storing articles directly in a master database, they're stored in normal text files, which can be edited from any text editor. To publish them, I add them to my website's git repository and push the changes to my server. This triggers a migration script.

The script reads through all the changed files, then adds them to a database. The database is what's queried when you request a page. The database is more or less transient, in the sense I don't really care if it gets screwed up, because I can just wipe it and remigrate all the articles again with no harm done. This has happened countless times before.

Up until now, this is not without precedent. I just re-migrated all my articles and things should have been fine. But it turns out there was a second, more nefarious problem I wasn't aware of: My host had updated some of the software upon which Colophon depends, introducing a bug in the date parsing code (the dates in my metadata files can be natural language. If I wanted, I could use a date such as “yesterday at noon” and it would correctly parse that. Obviously, that would be silly because “yesterday” is a moving target. But it makes for more readable dates). This is what caused the articles to start appearing out of order.

The reason why they appeared multiple times was because it took me a while to track this down, all the while things were getting duplicated because my script wasn't ready to handle that case.

The bigger problem

The real issue here isn't so much the nature of Colophon, and it isn't so much my host changing things underneath me, it's the lack of good tools.

Having rolled my own publishing software means I have to roll my own tools, too. It's not a perfect solution, but it's one I enjoy. I wouldn't recommend it for many other people though.

The current tool I have is old and not entirely working on Lion. And it now has to support another website, which it isn't cut out to do. I'll report back when I've fixed this problem.

Speed of Light