<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>
<channel>
	<title>Comments on: Web Scale</title>
	<atom:link href="http://devblog.streamy.com/2009/04/14/web-scale/feed/" rel="self" type="application/rss+xml" />
	<link>http://devblog.streamy.com/2009/04/14/web-scale/</link>
	<description></description>
	<pubDate>Fri, 12 Mar 2010 04:08:16 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Insights on Parallel and Distributed Systems &#171; Metaphysical Developer</title>
		<link>http://devblog.streamy.com/2009/04/14/web-scale/comment-page-1/#comment-194</link>
		<dc:creator>Insights on Parallel and Distributed Systems &#171; Metaphysical Developer</dc:creator>
		<pubDate>Tue, 01 Sep 2009 01:12:15 +0000</pubDate>
		<guid isPermaLink="false">http://devblog.streamy.com/?p=49#comment-194</guid>
		<description>[...] 31, 2009 in Uncategorized    At least two trends are making paralell and distributed programming come to focus: computers with multiple cores [...]</description>
		<content:encoded><![CDATA[<p>[...] 31, 2009 in Uncategorized    At least two trends are making paralell and distributed programming come to focus: computers with multiple cores [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jonathan Gray</title>
		<link>http://devblog.streamy.com/2009/04/14/web-scale/comment-page-1/#comment-18</link>
		<dc:creator>Jonathan Gray</dc:creator>
		<pubDate>Mon, 27 Apr 2009 16:20:40 +0000</pubDate>
		<guid isPermaLink="false">http://devblog.streamy.com/?p=49#comment-18</guid>
		<description>I'll have to go back to my original disclaimer.  I commend them for understanding their requirements and doing the work required to custom-engineer a solution.

Their approach is not one I recommend; it's not something that I would do.  

There are a number of reasons but at the core, the problems MySQL is good at solving are not being used and there's complexity at the wrong layer.

Yes, you can strip down MySQL and have a very good storage engine and client library.  There are MANY good storage engines and client libraries that, rather than having disabled relational/transactional features, actually have other features like automated distribution, fault-tolerance, parallel processing, only on-disk, only in-memory, mix of both... whatever you might be after.

And I generally want to abstract distribution from the application layer.  I have not re-read the post in some time, but it seems they were using some kind of hashing, perhaps consistent hashing.  I really don't want that kind of logic near my app.

I'm hoping that developers stop only thinking of their data in terms of SQL schemas and queries.  Though useful, it can be very helpful to instead think in data structures.  If you are really optimizing, you care about physical representation of your data and want control at that level.

I'd rather build-on-top than strip-away.</description>
		<content:encoded><![CDATA[<p>I&#8217;ll have to go back to my original disclaimer.  I commend them for understanding their requirements and doing the work required to custom-engineer a solution.</p>
<p>Their approach is not one I recommend; it&#8217;s not something that I would do.  </p>
<p>There are a number of reasons but at the core, the problems MySQL is good at solving are not being used and there&#8217;s complexity at the wrong layer.</p>
<p>Yes, you can strip down MySQL and have a very good storage engine and client library.  There are MANY good storage engines and client libraries that, rather than having disabled relational/transactional features, actually have other features like automated distribution, fault-tolerance, parallel processing, only on-disk, only in-memory, mix of both&#8230; whatever you might be after.</p>
<p>And I generally want to abstract distribution from the application layer.  I have not re-read the post in some time, but it seems they were using some kind of hashing, perhaps consistent hashing.  I really don&#8217;t want that kind of logic near my app.</p>
<p>I&#8217;m hoping that developers stop only thinking of their data in terms of SQL schemas and queries.  Though useful, it can be very helpful to instead think in data structures.  If you are really optimizing, you care about physical representation of your data and want control at that level.</p>
<p>I&#8217;d rather build-on-top than strip-away.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lee F</title>
		<link>http://devblog.streamy.com/2009/04/14/web-scale/comment-page-1/#comment-15</link>
		<dc:creator>Lee F</dc:creator>
		<pubDate>Mon, 27 Apr 2009 05:51:42 +0000</pubDate>
		<guid isPermaLink="false">http://devblog.streamy.com/?p=49#comment-15</guid>
		<description>Hi Jonathan,

I see.  So its the complexity of mysql not any assumed slowness?  There is certainly many complexities to mysq, no argument there.  But the nice part is many of those complexities are optional.  The FriendFeed solution is basically using a single table with key/value columns and an index.  No schemas, no joins, no foreign keys, etc.  This dramatically reduces the complexity you have to deal with. Combine this with fast performance (innodb) and  an excellent concurrent client connection library (mysql) and mysql starts to look like an excellent building block for a distributed system.  Also, most companies consider mysql a dependable, 'production' system whereas anything new you create needs to be tested by fire.  As you probably know, a production ready technology is a  huge value that is usually overlooked by people not use to running large 24x7 services.


For FriendFeed, schemas are handled at the application/library layer here and adding/removing indexes is of course quite easy as that is one of the main driving forces behind this system.  I am not sure where you see the fragility as I didn't see any mentioned of how they handle replication and consistency between mirrors, or how they handle new nodes...?  
They did add inconsistency to the system by choosing to keep the alternate indexes on different boxes than the primary index.  There are good reasons I can see for doing this but is unrelated to their use of mysql.  They could have chosen to keep primary and alternate indexes on the same box for the same entity... but again that's a different issue (and interesting in itself).</description>
		<content:encoded><![CDATA[<p>Hi Jonathan,</p>
<p>I see.  So its the complexity of mysql not any assumed slowness?  There is certainly many complexities to mysq, no argument there.  But the nice part is many of those complexities are optional.  The FriendFeed solution is basically using a single table with key/value columns and an index.  No schemas, no joins, no foreign keys, etc.  This dramatically reduces the complexity you have to deal with. Combine this with fast performance (innodb) and  an excellent concurrent client connection library (mysql) and mysql starts to look like an excellent building block for a distributed system.  Also, most companies consider mysql a dependable, &#8216;production&#8217; system whereas anything new you create needs to be tested by fire.  As you probably know, a production ready technology is a  huge value that is usually overlooked by people not use to running large 24&#215;7 services.</p>
<p>For FriendFeed, schemas are handled at the application/library layer here and adding/removing indexes is of course quite easy as that is one of the main driving forces behind this system.  I am not sure where you see the fragility as I didn&#8217;t see any mentioned of how they handle replication and consistency between mirrors, or how they handle new nodes&#8230;?<br />
They did add inconsistency to the system by choosing to keep the alternate indexes on different boxes than the primary index.  There are good reasons I can see for doing this but is unrelated to their use of mysql.  They could have chosen to keep primary and alternate indexes on the same box for the same entity&#8230; but again that&#8217;s a different issue (and interesting in itself).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jonathan Gray</title>
		<link>http://devblog.streamy.com/2009/04/14/web-scale/comment-page-1/#comment-11</link>
		<dc:creator>Jonathan Gray</dc:creator>
		<pubDate>Fri, 24 Apr 2009 16:18:39 +0000</pubDate>
		<guid isPermaLink="false">http://devblog.streamy.com/?p=49#comment-11</guid>
		<description>One thing I did not touch on here that I plan to write about in the future is how your data storage and query system affects your business and products.

An RDBMS that does not need to support much load is like an infinitely extensible system that will do anything you tell it.  A scaled RDBMS is a much different beast.  Changes to schema make a huge difference.  Adding a feature might just mean one more index on this table, but if that table has billions of rows, or is very write heavy, well then maybe it's a no-go.

I like complexity to live in my application and business logic where I can be as efficient as possible and make use of domain-specific optimizations.  The querying of my data should be straightforward and I should be able to store whatever I want.

As I said, a system like FriendFeed's seems like one of those "don't shake the beast" systems that are very delicate once you have them up and running.

That being said, I think it's a neat approach, they are certainly a group of very smart guys and I'm sure it's working great for them; it's just not the approach I would take.</description>
		<content:encoded><![CDATA[<p>One thing I did not touch on here that I plan to write about in the future is how your data storage and query system affects your business and products.</p>
<p>An RDBMS that does not need to support much load is like an infinitely extensible system that will do anything you tell it.  A scaled RDBMS is a much different beast.  Changes to schema make a huge difference.  Adding a feature might just mean one more index on this table, but if that table has billions of rows, or is very write heavy, well then maybe it&#8217;s a no-go.</p>
<p>I like complexity to live in my application and business logic where I can be as efficient as possible and make use of domain-specific optimizations.  The querying of my data should be straightforward and I should be able to store whatever I want.</p>
<p>As I said, a system like FriendFeed&#8217;s seems like one of those &#8220;don&#8217;t shake the beast&#8221; systems that are very delicate once you have them up and running.</p>
<p>That being said, I think it&#8217;s a neat approach, they are certainly a group of very smart guys and I&#8217;m sure it&#8217;s working great for them; it&#8217;s just not the approach I would take.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jonathan Gray</title>
		<link>http://devblog.streamy.com/2009/04/14/web-scale/comment-page-1/#comment-10</link>
		<dc:creator>Jonathan Gray</dc:creator>
		<pubDate>Fri, 24 Apr 2009 16:11:29 +0000</pubDate>
		<guid isPermaLink="false">http://devblog.streamy.com/?p=49#comment-10</guid>
		<description>@Lee  Thanks for reading the article.

As for FriendFeed, it's not so much that I think anything SQL is slow or clunky.  This is certainly not the case.  Though never big on MySQL (in previous years it was seriously lacking in extensibility and had corruption issues), I'm a huge proponent of PostgreSQL and many times steer people away from more complicated approaches to stick with a RDBMS.

In this case, and I probably didn't point this out clearly enough above, the issue is the complexity of what they're building on top of it.  From what I gather, it's not a simple task to add new nodes to their distributed system.  And they actually introduce weak consistency into the system.  They went out to build something to distribute, but what they ended up with does not seem all that easy.  Rather than build up software around MySQL, my approach would be to look for something that does less and maybe handles the distribution for me, or does less but has strong consistency, etc...

Their approach is an interesting one, and like I said, I'm sure that it performs as well as they need it to.  I'm really not questioning the performance of it, since it's being sharded, they should be able to get all the performance they need with sufficient equipment.  But is it extensible, can it scale linearly by adding nodes and without touching code, is this in addition to a normal SQL database, how does it handle node failures, is there rebalancing, are there additional caching layers outside this, etc...

Again, my distaste for their approach is not a distaste for SQL, but rather based on my own preferences and personal "best practices".  RDBMS are wonderful, but when I think scale, distribution, key-value, etc., I do not think RDBMS.</description>
		<content:encoded><![CDATA[<p>@Lee  Thanks for reading the article.</p>
<p>As for FriendFeed, it&#8217;s not so much that I think anything SQL is slow or clunky.  This is certainly not the case.  Though never big on MySQL (in previous years it was seriously lacking in extensibility and had corruption issues), I&#8217;m a huge proponent of PostgreSQL and many times steer people away from more complicated approaches to stick with a RDBMS.</p>
<p>In this case, and I probably didn&#8217;t point this out clearly enough above, the issue is the complexity of what they&#8217;re building on top of it.  From what I gather, it&#8217;s not a simple task to add new nodes to their distributed system.  And they actually introduce weak consistency into the system.  They went out to build something to distribute, but what they ended up with does not seem all that easy.  Rather than build up software around MySQL, my approach would be to look for something that does less and maybe handles the distribution for me, or does less but has strong consistency, etc&#8230;</p>
<p>Their approach is an interesting one, and like I said, I&#8217;m sure that it performs as well as they need it to.  I&#8217;m really not questioning the performance of it, since it&#8217;s being sharded, they should be able to get all the performance they need with sufficient equipment.  But is it extensible, can it scale linearly by adding nodes and without touching code, is this in addition to a normal SQL database, how does it handle node failures, is there rebalancing, are there additional caching layers outside this, etc&#8230;</p>
<p>Again, my distaste for their approach is not a distaste for SQL, but rather based on my own preferences and personal &#8220;best practices&#8221;.  RDBMS are wonderful, but when I think scale, distribution, key-value, etc., I do not think RDBMS.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lee F</title>
		<link>http://devblog.streamy.com/2009/04/14/web-scale/comment-page-1/#comment-9</link>
		<dc:creator>Lee F</dc:creator>
		<pubDate>Fri, 24 Apr 2009 06:31:32 +0000</pubDate>
		<guid isPermaLink="false">http://devblog.streamy.com/?p=49#comment-9</guid>
		<description>Good write up.  I am looking forward to details on how you achieve web scale for Streamy.

Your categorization these companies seems right on to me except in the case of FriendFeed.  Your analysis seems to rely solely on your belief that using any SQL based product will be slower than one that is not SQL.  This is a myth I have run into where I work (one of the companies you have mentioned) and has led to the the use of a lot of BDB based systems that aren't really that performant.

The main thing people do not seem to know is that with many SQL products (mysql, oracle, sqlite, etc) you can turn off just about all the ACID and transactional properties, create a simple key/blob table, and you have an extremely fast datastore.  For example, with mysql you can have the transaction log flush to disk about every second instead of with every user commit.  This is definitely not a transactional system anymore because if your box crashes you may lose a second or so of write data.  However because writes are now batched by mysql instead of doing disk I/O for every single write, this can take write performance from a few hundred writes/second to several thousand.  

In my team we use a distributed storage solution backed by a mysql/innodb and configured with the majority of the ACID/transactional stuff turned off.  We dump the output of our Hadoop jobs with millions of rows into it for the rest of our production systems to use.  It is extremely fast, reliable, and is just about the only production system our company uses that can handle the write load (Most datastores seem to be all about read speed at the cost of write speed).

I encourage you to test this out for yourself or at least think again about the SQL means slow and clunky paradigm.</description>
		<content:encoded><![CDATA[<p>Good write up.  I am looking forward to details on how you achieve web scale for Streamy.</p>
<p>Your categorization these companies seems right on to me except in the case of FriendFeed.  Your analysis seems to rely solely on your belief that using any SQL based product will be slower than one that is not SQL.  This is a myth I have run into where I work (one of the companies you have mentioned) and has led to the the use of a lot of BDB based systems that aren&#8217;t really that performant.</p>
<p>The main thing people do not seem to know is that with many SQL products (mysql, oracle, sqlite, etc) you can turn off just about all the ACID and transactional properties, create a simple key/blob table, and you have an extremely fast datastore.  For example, with mysql you can have the transaction log flush to disk about every second instead of with every user commit.  This is definitely not a transactional system anymore because if your box crashes you may lose a second or so of write data.  However because writes are now batched by mysql instead of doing disk I/O for every single write, this can take write performance from a few hundred writes/second to several thousand.  </p>
<p>In my team we use a distributed storage solution backed by a mysql/innodb and configured with the majority of the ACID/transactional stuff turned off.  We dump the output of our Hadoop jobs with millions of rows into it for the rest of our production systems to use.  It is extremely fast, reliable, and is just about the only production system our company uses that can handle the write load (Most datastores seem to be all about read speed at the cost of write speed).</p>
<p>I encourage you to test this out for yourself or at least think again about the SQL means slow and clunky paradigm.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
