Web Scale


Development at Streamy is always done with the mindset of “will it scale” in the back of our minds.

Generally speaking, scalability deals with the ability for a software system to handle increasing load when given additional resources.  Increased load could mean more concurrency, a larger data set, or increased complexity.  Additional resources refers to hardware and either scaling vertically (upgrading a single node) or scaling horizontally (adding additional nodes).  While vertical scale is important in terms of squeezing all the performance you can out of each node, and rapidly dropping hardware prices means even cheap nodes are powerful, the key to achieving true scalability is the ability to horizontally scale, or distribute.

Distributed systems are becoming more and more mainstream as the web has flourished.  Search engines index billions of web pages and social networks support millions of concurrent users.  Content continues to evolve from being static, to dynamic, to the current emphasis on personalization and customization.  The adoption of AJAX and COMET have further increased requirements for concurrency.  And all of this must be highly-available and low-latency (sub-second).

This is what I call Web Scale.

So what exactly is it?

It’s a new set of technical requirements borne out of Web 2.0, and the move towards distributed systems and cloud computing as the solution.  It encompasses all the different aspects of today’s web applications: the data, the storage, the caches, the web servers, the communications, the realtime queries, the batch queries, and everything in between.  It marks a departure from the vertical scaling of relational databases and web servers as the solution to scale towards a new world of horizontal scalability: distributed hash tables, consistent hashing, column-orientation, horizontal partitioning, eventual consistency, elastic computing, and every other buzzword you can think of.

Who has solved it?

Google. They deserve a great deal of credit for being the first to really achieve web scale.  Long before web sites really considered their architecture as part of their competitive advantage, Google embraced the notion and invented their own solutions.  They subsequently published a number of papers describing their efforts:  The Google File System, MapReduce, and BigTable.

Amazon.  The enormous amount of work that was done in order to achieve scalability for the world’s premier e-commerce site is obvious; one need only look at their extensive and fast-growing elastic storage and computing services like S3 and EC2.  An e-commerce site becoming a service provider?  There’s only so much cost-savings a good architecture can give an e-commerce site, so it only makes sense that they try to profit from their proprietary systems.  The CTO of Amazon, Werner Vogels, has an excellent blog AllThingsDistributed.  Read through some of his posts and you can quickly see that Amazon is a company that understands scaling, distribution, and the right way to go about design and engineering in today’s Web Scale world.

Who struggles with it?

Facebook.   Today’s prime example for what happens when you don’t build for scale early on: the only short-term solution to rapid growth is to throw money at the problem.  Utilizing both horizontal and vertical scale, Facebook has enormous clusters of MySQL and Memcached to deal with storing user data and serving user queries.  As reported nearly a year ago, they already had close to 2,000 MySQL boxes and 1,000 Memcached boxes in addition to their 10,000 web servers.

They have been playing catch-up ever since, slowly developing (and open-sourcing in some cases) distributed systems to deal with their enormous scale.  Though almost always plagued by the lack of community and direct support from Facebook engineers, they have some very interesting projects including Cassandra and Hive.  More recently, they seem to finally have solved their photo storage cost issues with Haystack, saving them from needing to buy an additional $2M+ server every month just to keep up.  The difference in architectures is well described in Niall Kennedy’s post.

Twitter.  Considering how well known the fail whale is, it’s clear to most that Twitter has been plagued by slow responses and downtime, even to this day.  Though extremely simple in its requirements, the personalized views, emphasis on search, and massive use of the API by developers creates huge amounts of load for the microblogging (or is it nanoblogging?) service.

FriendFeed.  A company led by ex-Googlers, and to my knowledge without major technical issues, seems to be going down a path of scaling that seems clunky and backwards.  Generally speaking, scaling a database means letting go of some of the traditional restrictions, first things like normalization and secondary indexes, and then more significantly by relaxing ACID-compliance or adding eventual consistency.  As outlined in a blog post by one of their founders, Bret Taylor, their attempt at a schema-less storage system atop MySQL seems to be a good idea and good effort gone wrong.

When what you’re after is schema-less storage, and the need for partitioning/distribution, why would you base it on a system completely tied to schemas and full-blown transactions?  You’re bringing with you all the things you don’t want, at the expense of performance and flexibility, because they “trust” MySQL and are already familiar with it.  Good reasons, no doubt, but the whole thing appears misguided.  In any case, I bet that it works and performance is acceptable.  But it’s not always about finding a solution that works.  Flexibility, simplicity, and of course, additional scalability, are also important and something so confusing to do something so simple just ask for you to not want to touch it once it works :)

Note:  This is not to say that these companies are “doing it wrong”.  I point out these examples because they are cases of costs gone out of control, continued performance and uptime issues, approaches I personally would not recommend, etc.  MySQL + Memcached is certainly part of the Web Scale tool box and in a great deal of use cases is satisfactory.  For more information on relational databases and how they compare to something like HBase, check out my presentation on Hadoop and HBase vs RDBMS.

So, how does Streamy solve it?

Stay tuned!  This will be the topic of an upcoming series of posts over the next few weeks.

, , , , , , , , , , , , ,

  1. #1 by Lee F at April 23rd, 2009

    Good write up. I am looking forward to details on how you achieve web scale for Streamy.

    Your categorization these companies seems right on to me except in the case of FriendFeed. Your analysis seems to rely solely on your belief that using any SQL based product will be slower than one that is not SQL. This is a myth I have run into where I work (one of the companies you have mentioned) and has led to the the use of a lot of BDB based systems that aren’t really that performant.

    The main thing people do not seem to know is that with many SQL products (mysql, oracle, sqlite, etc) you can turn off just about all the ACID and transactional properties, create a simple key/blob table, and you have an extremely fast datastore. For example, with mysql you can have the transaction log flush to disk about every second instead of with every user commit. This is definitely not a transactional system anymore because if your box crashes you may lose a second or so of write data. However because writes are now batched by mysql instead of doing disk I/O for every single write, this can take write performance from a few hundred writes/second to several thousand.

    In my team we use a distributed storage solution backed by a mysql/innodb and configured with the majority of the ACID/transactional stuff turned off. We dump the output of our Hadoop jobs with millions of rows into it for the rest of our production systems to use. It is extremely fast, reliable, and is just about the only production system our company uses that can handle the write load (Most datastores seem to be all about read speed at the cost of write speed).

    I encourage you to test this out for yourself or at least think again about the SQL means slow and clunky paradigm.

  2. #2 by Jonathan Gray at April 24th, 2009

    @Lee Thanks for reading the article.

    As for FriendFeed, it’s not so much that I think anything SQL is slow or clunky. This is certainly not the case. Though never big on MySQL (in previous years it was seriously lacking in extensibility and had corruption issues), I’m a huge proponent of PostgreSQL and many times steer people away from more complicated approaches to stick with a RDBMS.

    In this case, and I probably didn’t point this out clearly enough above, the issue is the complexity of what they’re building on top of it. From what I gather, it’s not a simple task to add new nodes to their distributed system. And they actually introduce weak consistency into the system. They went out to build something to distribute, but what they ended up with does not seem all that easy. Rather than build up software around MySQL, my approach would be to look for something that does less and maybe handles the distribution for me, or does less but has strong consistency, etc…

    Their approach is an interesting one, and like I said, I’m sure that it performs as well as they need it to. I’m really not questioning the performance of it, since it’s being sharded, they should be able to get all the performance they need with sufficient equipment. But is it extensible, can it scale linearly by adding nodes and without touching code, is this in addition to a normal SQL database, how does it handle node failures, is there rebalancing, are there additional caching layers outside this, etc…

    Again, my distaste for their approach is not a distaste for SQL, but rather based on my own preferences and personal “best practices”. RDBMS are wonderful, but when I think scale, distribution, key-value, etc., I do not think RDBMS.

  3. #3 by Jonathan Gray at April 24th, 2009

    One thing I did not touch on here that I plan to write about in the future is how your data storage and query system affects your business and products.

    An RDBMS that does not need to support much load is like an infinitely extensible system that will do anything you tell it. A scaled RDBMS is a much different beast. Changes to schema make a huge difference. Adding a feature might just mean one more index on this table, but if that table has billions of rows, or is very write heavy, well then maybe it’s a no-go.

    I like complexity to live in my application and business logic where I can be as efficient as possible and make use of domain-specific optimizations. The querying of my data should be straightforward and I should be able to store whatever I want.

    As I said, a system like FriendFeed’s seems like one of those “don’t shake the beast” systems that are very delicate once you have them up and running.

    That being said, I think it’s a neat approach, they are certainly a group of very smart guys and I’m sure it’s working great for them; it’s just not the approach I would take.

  4. #4 by Lee F at April 26th, 2009

    Hi Jonathan,

    I see. So its the complexity of mysql not any assumed slowness? There is certainly many complexities to mysq, no argument there. But the nice part is many of those complexities are optional. The FriendFeed solution is basically using a single table with key/value columns and an index. No schemas, no joins, no foreign keys, etc. This dramatically reduces the complexity you have to deal with. Combine this with fast performance (innodb) and an excellent concurrent client connection library (mysql) and mysql starts to look like an excellent building block for a distributed system. Also, most companies consider mysql a dependable, ‘production’ system whereas anything new you create needs to be tested by fire. As you probably know, a production ready technology is a huge value that is usually overlooked by people not use to running large 24×7 services.

    For FriendFeed, schemas are handled at the application/library layer here and adding/removing indexes is of course quite easy as that is one of the main driving forces behind this system. I am not sure where you see the fragility as I didn’t see any mentioned of how they handle replication and consistency between mirrors, or how they handle new nodes…?
    They did add inconsistency to the system by choosing to keep the alternate indexes on different boxes than the primary index. There are good reasons I can see for doing this but is unrelated to their use of mysql. They could have chosen to keep primary and alternate indexes on the same box for the same entity… but again that’s a different issue (and interesting in itself).

  5. #5 by Jonathan Gray at April 27th, 2009

    I’ll have to go back to my original disclaimer. I commend them for understanding their requirements and doing the work required to custom-engineer a solution.

    Their approach is not one I recommend; it’s not something that I would do.

    There are a number of reasons but at the core, the problems MySQL is good at solving are not being used and there’s complexity at the wrong layer.

    Yes, you can strip down MySQL and have a very good storage engine and client library. There are MANY good storage engines and client libraries that, rather than having disabled relational/transactional features, actually have other features like automated distribution, fault-tolerance, parallel processing, only on-disk, only in-memory, mix of both… whatever you might be after.

    And I generally want to abstract distribution from the application layer. I have not re-read the post in some time, but it seems they were using some kind of hashing, perhaps consistent hashing. I really don’t want that kind of logic near my app.

    I’m hoping that developers stop only thinking of their data in terms of SQL schemas and queries. Though useful, it can be very helpful to instead think in data structures. If you are really optimizing, you care about physical representation of your data and want control at that level.

    I’d rather build-on-top than strip-away.

(will not be published)