Wednesday, November 21, 2012

The Software Bathtub Curve


(Disclaimer: please do not use software in the bathtub.)

Professor Hirsch speaks frequently about the bathtub curve observed in semiconductor failure rates. A microchip (unlike, say, a pair of scissors) doesn't follow a simple pattern of gradually wearing out over time until it fails. Instead, the observed likelihood of failure follows three distinct phases: First, a brief initial period when failure rates are high, mostly caused by manufacturing defects. Then there's a long steady period when failure rates are very low. Then, eventually, the physical media starts to break down, and increasing failures result. The shape of the graph looks like a bathtub.

Semiconductor failure rates

A case can be made that software follows a bathtub curve, too. However, rather than failure rate, quality is measured in terms of user satisfaction (always!). In the initial phase (beta and early release), bugs abound, performance is slow, and the application lacks refinement, having not yet benefited from extensive user feedback. Satisfaction is low but increasing. Then, after a few releases to polish it up, users settle into a comfortable working relationship with the app, and satisfaction is consistently high. Eventually, though, due to evolving user needs or technology environment, the software no longer does the job and must be EOL'd or (preferably) upgraded. Differences in methodology will affect the scale of this curve, both in terms of time and application scope: in an agile process, the cycle will run faster and across smaller portions of functionality. But the pattern will be observed nonetheless: a (hopefully) rapid resolution of initial shortcomings, followed by a comparatively longer period of stability.

Software user satisfaction

It's important to remember this when rolling out new software features: you may be tempted to make a big splash by announcing a new release loudly and to all the most important users (the ones who have been demanding those features the most adamantly). Unfortunately, this creates the biggest exposure right when the software is at its most vulnerable. Better is to take a gradual approach, allowing the new functionality to "burn in" under safe conditions, with sympathetic users and in non-mission-critical situations. When problems are found, they can be addressed calmly and efficiently. As the software matures, the size of the user base is allowed to grow. It's true that this approach lacks zazz, and may therefore pose challenges for marketing and sales objectives (so compromises must be made). But over the long haul, I've found that a cautious approach to deployment creates higher user satisfaction overall -- and that drives customer loyalty, which is good for everyone.


Wednesday, November 14, 2012

A CDN of one

On theatermania, we recently had a good-but-scary thing happen: a particular piece of content "went viral." (Is that still a thing?) It involved politics, sexuality, and musical theater, so the blogosphere lit up like crazy. Traffic shot up to 5-10x normal, with all of the new traffic hitting that one page. Now, we've got some nice caching mechanisms in place, and most assets are served by a CDN, but we're still working toward infinite* scalability. For now, all page requests involve at least one database round-trip, and the db was getting hammered with increasing intensity.

So I had a bit of an idea. There was only the one page lighting up like that, so we loaded it in a browser, copied the HTML source, and saved it to a file. Then we uploaded that file to a directory of static content, and added a single matching pattern -- that page's precise URL -- to the .htaccess file. Apache short-circuited those requests right to the saved source and bypassed all dynamic processing. The requests were still all hitting the app servers, but they required hardly any effort to fulfill. Most of the dynamic content is loaded via third-party Ajax calls (e.g. disqus), so it was still updating in real time, and for anything that was processed server-side, it was okay if it only got updated when we manually refreshed the page source.

It's not a scalable strategy for scalability: too much human intervention, and it only works if you catch a spike on the upswing. It's a quick-and-dirty thing, and not a reason to slow down at all on implementing a true scaling mechanism. But I'm going to keep it filed away in my bag of tricks just in case.