Wednesday, November 21, 2012

The Software Bathtub Curve


(Disclaimer: please do not use software in the bathtub.)

Professor Hirsch speaks frequently about the bathtub curve observed in semiconductor failure rates. A microchip (unlike, say, a pair of scissors) doesn't follow a simple pattern of gradually wearing out over time until it fails. Instead, the observed likelihood of failure follows three distinct phases: First, a brief initial period when failure rates are high, mostly caused by manufacturing defects. Then there's a long steady period when failure rates are very low. Then, eventually, the physical media starts to break down, and increasing failures result. The shape of the graph looks like a bathtub.

Semiconductor failure rates

A case can be made that software follows a bathtub curve, too. However, rather than failure rate, quality is measured in terms of user satisfaction (always!). In the initial phase (beta and early release), bugs abound, performance is slow, and the application lacks refinement, having not yet benefited from extensive user feedback. Satisfaction is low but increasing. Then, after a few releases to polish it up, users settle into a comfortable working relationship with the app, and satisfaction is consistently high. Eventually, though, due to evolving user needs or technology environment, the software no longer does the job and must be EOL'd or (preferably) upgraded. Differences in methodology will affect the scale of this curve, both in terms of time and application scope: in an agile process, the cycle will run faster and across smaller portions of functionality. But the pattern will be observed nonetheless: a (hopefully) rapid resolution of initial shortcomings, followed by a comparatively longer period of stability.

Software user satisfaction

It's important to remember this when rolling out new software features: you may be tempted to make a big splash by announcing a new release loudly and to all the most important users (the ones who have been demanding those features the most adamantly). Unfortunately, this creates the biggest exposure right when the software is at its most vulnerable. Better is to take a gradual approach, allowing the new functionality to "burn in" under safe conditions, with sympathetic users and in non-mission-critical situations. When problems are found, they can be addressed calmly and efficiently. As the software matures, the size of the user base is allowed to grow. It's true that this approach lacks zazz, and may therefore pose challenges for marketing and sales objectives (so compromises must be made). But over the long haul, I've found that a cautious approach to deployment creates higher user satisfaction overall -- and that drives customer loyalty, which is good for everyone.


Wednesday, November 14, 2012

A CDN of one

On theatermania, we recently had a good-but-scary thing happen: a particular piece of content "went viral." (Is that still a thing?) It involved politics, sexuality, and musical theater, so the blogosphere lit up like crazy. Traffic shot up to 5-10x normal, with all of the new traffic hitting that one page. Now, we've got some nice caching mechanisms in place, and most assets are served by a CDN, but we're still working toward infinite* scalability. For now, all page requests involve at least one database round-trip, and the db was getting hammered with increasing intensity.

So I had a bit of an idea. There was only the one page lighting up like that, so we loaded it in a browser, copied the HTML source, and saved it to a file. Then we uploaded that file to a directory of static content, and added a single matching pattern -- that page's precise URL -- to the .htaccess file. Apache short-circuited those requests right to the saved source and bypassed all dynamic processing. The requests were still all hitting the app servers, but they required hardly any effort to fulfill. Most of the dynamic content is loaded via third-party Ajax calls (e.g. disqus), so it was still updating in real time, and for anything that was processed server-side, it was okay if it only got updated when we manually refreshed the page source.

It's not a scalable strategy for scalability: too much human intervention, and it only works if you catch a spike on the upswing. It's a quick-and-dirty thing, and not a reason to slow down at all on implementing a true scaling mechanism. But I'm going to keep it filed away in my bag of tricks just in case.

Tuesday, October 23, 2012

Proposing a new HTTP request header: max_timeout

I'm certainly not going to confess that my sites sometimes have technical problems. But a... friend...of mine tells me that servers can get bolloxed up. With a site like OvationTix (er, I mean: OshmationShmix), the most likely source of a site-wide problem will be congestion in the database. In such instances, the site will not be completely inaccessible, but it will be seriously slow. Users will say "the site's down," while engineers will say, "it's not down, exactly..." Situations like this can be frustrating, and become especially problematic when dealing with SLAs -- if pages that usually load in 2 seconds are instead loading in 2 minutes, is that "downtime," requiring remediation according to an SLA? I've seen some contracts stipulating that all pages must respond in under 3 seconds. That's good, but still too coarse-grained; what about a page (like a big report) which typically takes 20 seconds -- and which users expect will take a long time?

For this and other situations, I'm proposing a new HTTP request header: max_timeout (in milliseconds). After the timeout is reached, the browser will treat it as failed and inform the user accordingly. For users, this resolves ambiguity: currently, if a page loads slower than they're used to, they don't know whether it's likely to load in another second or two, or if it's stalled forever. For servers, this timeout can be used to abort processing of requests that exceed the threshold. (If the user isn't waiting for the response, there's no reason to keep executing the request.) Servers could also use this parameter to prioritize incoming requests: those that expect a subsecond response will go to the top of the pile, and those that say "hey, no rush" can be delayed.

How would the timeout be specified? We need to allow different requests to specify different timeouts. For the first request to a site, the timeout could be delivered via DNS (not saying this is easy to implement, but...): a response from a DNS server would contain the destination IP as well as the recommended timeout header to request. For subsequent pages, a parameter on an <a> tag or Ajax call would instruct the browser on how to construct the request header.

If a critical mass of sites/browsers/servers implement this request header, users will come to trust that the Internet won't leave them hanging -- if a server is taking longer to respond than it's supposed to, we'll time out the response so they can go about their business. Or, we could offer a prompt: "Your request has been cancelled because the website did not respond in a timely manner. Would you like to try again with no time constraint? [Yes|No]" Either way, having the confidence of a time limit would discourage users from abandoning early, or hitting "refresh" because the site feels slow. To that end, browsers could even refuse to re-issue a command until the timeout was reached. (Hang on, we're working on it...)

Wednesday, September 26, 2012

Client side includes

Here's a future tech to keep an eye on: the "seamless" attribute for iframes. (WhatWG spec) It basically acts as a client-side include for HTML code (and so, really, isn't much like a frame at all). So, just as we've always done server-side, we can now ("now," meaning "lord knows when"... browser support ugh) break a page into fragments and serve each portion from a different URI, which are recombined by the browser. This gives us great control over caching and distribution. We could, for example, use a frame for the navigation that is served from a CDN and refreshed only every 24 hours, and one for an article content that caches for one hour (in case of updates), and one for social media stuff that is never cached. Unlike with regular iframes, all of this can happen within one DOM space, sharing styles, scripts, etc.

Of course, much of the same can be achieved with javascript templating and AJAX fetches. But that's not always appropriate, and it always adds an additional layer of complexity that might be overkill. This is a conceptually simpler approach, and I like having it as an option. Unfortunately, we don't really have it as an option yet. There's no browser support; it's not even mentioned yet on caniuse. So, future-Dave: let's watch this one as it emerges. It looks like a good trick for scalability on sites where the content on a page has a mix of freshness requirements. (And isn't that pretty much every site?)

Saturday, September 8, 2012

Defaulting on performance

In creating a highly performant Web environment, I've frequently been hampered by the very conservative default limits that exist in many technologies. In fact, I'd say that at least 4 out of every 5 performance bottlenecks I've encountered have come not from the system being actually overwhelmed, but rather from a self-imposed limit. Systems have numerous in-built constraints on concurrency, memory usage, and many other ways to prevent themselves from utilizing all of their available resources and realizing their true potential. Servers, you need a life coach. I'm here for you.

Of course, these limits were placed there by very wise people with very good reasons. For example, on *NIX operating systems, the designers were considering the needs of a multi-user situation: in a university CS lab where dozens of naive and/or mischievous undergrads are sharing clock cycles of a single CPU, it's important that no one user is allowed to chew up all the resources and bring down the system. But a Web server typically runs just one user, and has much more capacity than it is configured to use by default. When you're chasing down an international superspy in your Bugatti Veyron (who hasn't been there?), you've gotta pop out the electronic speed limiter and go for it. Likewise, when your NCIS slash fiction goes viral (who hasn't been there?), you need to goose the config and let those servers fly.

So, here is an incomplete list of some of the configuration-imposed constraints I've encountered over the years that are typically easy to relieve -- if you do it before the crowds arrive:
  • Linux: iptables' ip_conntrack default database size is too small [see racker hacker]
  • MySQL: max_connections default is low [see electric toolbox]
  • Apache: prefork module is single-threaded [see serverfault]
  • MySQL: query cache is disabled by default [it's not always a good idea to turn it on, but when it's good, it can be very good. see mysqlperformanceblog]
  • Tomcat: the JDBC default connection pool size is small [see tomcat 6 docs, but also, tomcat 7's new hotness]
  • Java: default JVM memory settings don't take advantage of available memory, and default garbage collection can create long "stop-the-world" pauses [this is a deep topic, but here's an intro]
  • Linux: default max open files is too low [see stackoverflow]
  • MySQL: back_log default is low [see MySQL documentation]
  • MySQL: innodb_thread_concurrency default is low [see stackexchange -- though be aware, setting it to "0" (infinite) might be too much]
  • Linux: net.core.somaxconn, net.core.netdev_max_backlog are too low [see nginx.org]

Tuesday, August 28, 2012

Who is your first customer, and why?

More than once recently, I've found myself giving advice in the form of this question from a favorite professor: "Who is your first customer, and why?" This was something he posed to the class repeatedly, in the context of creating a go-to-market strategy for a startup. But these days -- or, heck, fifteen years ago -- what's good for a startup is also good for a person, and any project you might do.

So, it's a valuable question to ask when planning out a project: who is your first [ customer | user | reader | fraggle ], and why? It's natural and healthy to dream big and aim for a million users. But until the case is made to your investor (and yourself) that customer #1 is ready and willing, it's hard to believe that the throngs will follow.

Wednesday, August 1, 2012

Share the load

For load-testing OvationTix, we've tried a few approaches over the years. The first time around, we used HP LoadRunner, which is an enterprise-level tool with a price to match. It was pretty easy to use, and we got the data we needed, but it was too expensive to become a part of our ongoing development process. Ideally, we'll load-test every release before deploying, and I don't want cost concerns to intimidate us into holding back from deploying good code when it's ready to go.

So we moved to jmeter, running in Amazon EC2 cloud instances, which of course was cheaper. I set up some (admittedly clunky) Windows instances -- a controller and some generators -- and went to work. Again, we got the data we needed, but now the workflow was cumbersome. We had to launch the generators, hope they booted correctly, figure out their IPs, copy those back to the controller, then fire up the scripts, and then we had problems with the test data saturating the connection between the generators and the controller. It was fair, but not great.

For this year, we made it our goal to have a smoothly automated system -- still based around jmeter, which we like. First, we tried BlazeMeter. It's a jmeter PaaS, which is a really cool idea and promises to take care of the infrastructure so we could focus on writing the tests. It's not bad at all, and I think we may use it in the future, but for now, the costs were higher than we wanted, there were too many limitations on usage (the price tiers control things like ramp-up time, max users etc.), and the reporting wasn't as transparent as we wanted.

Finally, we found jmeter-ec2, which is a wrapper around Amazon's API that automates launching linux micro instances, deploying resources to those instances, firing up the test, and aggregating results. It's a lightweight script that runs in a shell and eliminates the need for a dedicated controller -- instead, each generator controls its own virtual users, and the condensed results are sent back to the shell, which makes for much less traffic between the instances (therefore, no saturation). The data collected isn't as deep as with the other approaches, but for our purposes, that's okay. We're mostly interested in simply finding out how many users we can throw at the site before it crashes. Since our plan is to take over the world, our target for concurrent users is currently 7,057,131,972. Wish us luck.

Friday, July 6, 2012

Cache Cachet

One of our goals for TheaterMania is to achieve infinite* scalability. I would like to be able to feel deeply confident that we could handle as much load as could possibly be thrown at us, because we have infinite* scalability. Why the asterisk? Because I'm only really looking to scale reads, and only reads of non-personalized data. There are, of course, ways to scale out writes and personalized reads (e.g. for logged-in users) but the nature of the application is that those are much less essential, and besides, it would be an isolated project so let's do first things first.

So then, infinite* scalability: of course, it's about caching. The approach we've decided on is to render complete HTML pages and store them on a CDN. Any personalization can happen via AJAX calls; as long as those calls fail gracefully, the server handling dynamic content can crash, and the core content of the site is still live, being served by the CDN. For a lot of static content, we use Amazon S3 as a sort of cheapo CDN, but it's not really designed to serve massively parallel requests (I'm not sure what would happen if we tried), and it won't request content updates automatically from an origin server. Fortunately, true CDNs abound, and our plan is to leverage one. Next step is to comparison-shop CloudFront, CloudFlare, and ??? (Akamai?). I'm hoping that since our needs are relatively modest -- we don't need ultra-low latency or global edge servers -- we can find one that fits our budget.

Our challenge then will be to make sure we really understand the cache-manipulation API. As Gautam told me, "when you cache complete pages, you have to be sure you have a very reliable cache-busting mechanism." Wise.

Monday, May 21, 2012

301 means 301

Ever since I took control of my own DNS, I've been doing a lot of redirecting, bouncing people around to temporary sites, or adding special subdomains. (My host, dnsmadeeasy.com has a feature called "HTTP redirection records" that lets me serve the redirect straight from DNS, which is convenient.)

One mistake I've made a few times, though, is using a 301 (Moved Permanently) when I should use a 302 ("Found" a/k/a Moved Temporarily). The problem with this is that because 301's are permanent, browsers are allowed to cache them. Which means that once you establish a 301, it can be very hard to undo it, if it's cached by users' browsers. 302s, meanwhile, are loose and flexible; the browser will re-request the original URI on every request, and if the redirect has been removed, or changed, the browser will detect that.

So use 301s with care. Start with a 302, and make sure it works -- and make sure you really really want this to be permanent -- before locking it down as a 301.

Tuesday, May 15, 2012

Dave's Simple Rules

  1. Engage the user. Respect the user. Create an environment for collaborative discovery.
  2. Complexity ≠ Sophistication. Seek elegance.
  3. Find a question. Find the answer. Share the answer. Find another question.
  4. Be excellent today.