Archive for June, 2007


Angora fire smoking us out

by Jeremy Cole on Tuesday, June 26th, 2007 at 15:06:57 in General

The Angora fire at South Lake Tahoe is generating crazy amounts of smoke, which is now headed north into Reno. It’s getting kind of crazy here now…

This is the fire as of yesterday on NASA’s MODIS Aqua satellite:

You can see the fire (outlined in red) at the bottom left, and Reno city to the top right.

And the view from our backyard today:

My top 5 wishes for MySQL

by Jeremy Cole on Wednesday, June 20th, 2007 at 13:56:11 in MySQL, MySQL Tips, Proven Scaling

Jay Pipes, Stewart Smith, Marten Mickos, Frank Mash, and Kevin Burton have all gotten into it, and Marten suggested that I should write my top five. I’m usually not into lists, but this sounds interesting, so here goes!

My top 5 wishes for MySQL are:

1. Restore some of my sanity

OK, well, this actually has several sub-wishes…

a. Global and session INFORMATION_SCHEMA

This is just starting to become interesting, but I’ve told MySQL several times that it’s a mistake to mix session-scoped and global-scoped things together in INFORMATION_SCHEMA. I can only hope they will listen.

b. Namespace-based configuration

A long time ago I started writing a proposal for this, but really anything would be better than today’s jumbled mess. This would also allow plugins and storage engines to bring in not a random smattering of variables, but an entire namespace.

c. Better memory management

I’ve also started writing a proposal for this. Right now nothing really constrains memory within MySQL, and there is no effective way to be sure that you both have enough memory configured in various variables for your needs, and that you don’t run out of memory (or start swapping, or crash on 32-bit systems, etc).

d. Stop writing ugly hacks

We now have FEDERATED, BLACKHOLE, and soon a whole boat load of new stuff that’s quite hacky, and although useful in some situations, the general public should stay away from them. Yet, I continually see them come up in all different situations where people think they are a good idea. FEDERATED should have been implemented as Oracle’s DATABASE LINK feature, which is much more user-friendly, safer, and just generally better. BLACKHOLE was created to solve a stupid deficiency in replication, to allow relay slaves to not need a complete copy of the data. Why not just allow replication to pass on logs raw, or write a log proxy?

e. Fix subquery optimization

Subqueries have been available in the same broken state for over 4 years now. Why are subqueries in an IN (...) clause still optimized in an incredibly stupid and slow way, such as to make them useless? We have customers tripping over this all the time. MySQL can check off “subqueries” on the to-do list, since they do in fact work. The SQL standard doesn’t say anything about not sucking.

2. Parallel query

Parallel execution of a single query is really a requirement for larger reporting/data warehouse systems. MySQL just doesn’t make good use of having lots of disks available in its current state. Parallel execution of single queries could solve this (for reads) to a large extent.

3. Less dependency on latency in writing

MySQL (especially InnoDB, and assuming failsafes1 ON) is really dependent on the latency of the underlying IO system to get reasonable performance on writes. MySQL really ought to batch syncs to disk together better, but this is complex because of the storage engine model and replication using different logs than the storage engines. This means two things practically:

  • You must have a battery-backed write cache to make any decent use of your disks for writes whatsoever
  • Once a system fills its battery-backed write cache with random writes, performance degrades much more than it should

4. Better GIS support

MySQL is being left behind by PostGIS and Oracle Spatial and missing a large segment of the market because the GIS support is so terrible. Nobody can figure out how to get data in and out, which I’ve tried to address (as a hobby project) with libmygis. But there are still too many inherent limitations in the GIS support to make it really useful for serious projects:

  • Spatial indexes only exist in MyISAM, so you cannot use spatial indexes and transactions
  • Currently only 2-dimensional types are supported, while many users needs N-dimension support (but at least 3)
  • Lack of non-MBR-based spatial relationship functions means things like CONTAINS() are really lame
  • Spatial types store everything as a 2-dimensional DOUBLE, so every point costs 16 bytes, while many systems do not need that much accuracy for most things — a choice of lower resolution types would be nice

5. Better binary logs and log management

This is a big nebulous topic, but MySQL’s binary log format sucks, the tools (mysqlbinlog and the SQL commands in MySQL) to deal with them suck, and the replication protocol sucks. Yes, it all works. Yes, I tell all of our customers to use it (and they should!). However, overall there could be a lot to gain by fixing things up. I would like to see:

  • The log format should have each entry checksummed to catch corruption and transmission errors
  • The logs need internal indexes over the entries, in order to be able to scan forwards and backwards, and quickly find a given entry
  • Each entry needs a unique transaction ID instead of basing everything on log positions and filenames
  • A proper C library needs to exist to access the logs, so that new tools can be written and existing ones extended
  • Log archival, and later log-serving tools for the archived logs need to be written (but they would be much easier given the above library)

1 By failsafes, I mean innodb_flush_log_at_trx_commit=1 and sync_binlog=1.

UPDATE: Ronald Bradford, Alan Kasindorf, Jim Winstead, and Jonathon Coombes posted their 5 wishes, and Antony Curtis suggests that it’s not useful to wish.

UPDATE: Antony Curtis finally gave in, and Paul McCullagh, Peter Zaitsev, and Konstantin Osipov also got in on it.

MySQL’s newest marketing fluff on scale out

by Jeremy Cole on Monday, June 11th, 2007 at 06:40:15 in MySQL, Proven Scaling, Scalability

MySQL today launched their newest marketing effort, “The 12 Days of Scale-Out“, which is quite timely to our most recent discussions. Zack Urlocker has been busy plugging it onto Planet MySQL. Day one is about Booking.com, “Europe’s largest online hotel travel reservations agency”. Sounds exciting! This could be really interesting!

Only one problem: There is no actual content in their stories. Maybe I missed some link that says “Click here to read the full story”, but I don’t think so.

In summary, we’ve got:

  • Headline blurb with plug for “MySQL Enterprise Unlimited”
  • Small marketing blurb from Booking.com about what they do
  • Big headline text about the site’s Alexa ranking showing solid growth
  • Lead-in paragraph with blurb for MySQL’s consulting, er, “professional services” group
  • One buzzword-heavy paragraph, containing a single run-on sentence which is somewhat technical
  • One longer paragraph touting MySQL Enterprise’s benefits
  • A big list of all 12 days
  • A remarkably silly looking “sticker” to contact MySQL about MySQL Enterprise Unlimited
  • A buzzword-heavy blurb about what scale-out is
  • A couple of links for forums and a webinar
  • A picture of a “reference” implementation of scale-out (NOT Booking.com’s implementation)
  • Eleven sales links

So, to count that up, we have a single, quite small paragraph containing a single run-on sentence, which is unique to Booking.com and relates to scale out.

Wait, what? Why do we care to wait 12 days to get a single sentence about each scale-out story? Let’s hope that days two through twelve are more meaty, but if so, day one is an awfully sad way to start things out. I’m not hopeful. But heck, maybe this entry can convince some writers over at MySQL to spend a few long nights. Maybe.

Scaling out AND up, a compromise

by Jeremy Cole on Sunday, June 10th, 2007 at 18:13:58 in MySQL, Proven Scaling, Scalability, Technology

You might have noticed that there’s been quite a (mostly civil, I think) debate about RAID and scaling going on recently:

I’d like to address some of the—in my opinion—misconceptions about “scaling out” that I’ve seen many times recently, and provide some of my experience and opinions.

It’s all about compromise.

Human time is expensive. Having operations, engineering, etc. deal with tasks (such as re-imaging a machine) when fixing a problem that could have been a 30-second disk swap is inefficient use of human resources. Don’t cut corners where it doesn’t make sense. This calls back to Brian’s comments about the real cost of your failed $200 part.

Scaling out doesn’t mean using crappy hardware. I think people take the “scale out” model (that they’ve often only read about from outdated conference presentations) to quite an extreme. They think scaling out means using desktop-class, bad hardware, and just buying a ton of them. That model doesn’t work, and it’s hell to maintain in the long term.

Compromise. One of the key points in the scale-out model: size the physical hardware reasonably to achieve the best compromise between scaling out and scaling UP. This is the main reason that I assert RAID is not going anywhere… it is often simply the best and cheapest way to achieve the performance and reliability that you need in each physical machine in order to make the scale out model work.

Use commodity hardware. You often hear the term “commodity hardware” in reference to scale out. While crappy hardware is also commodity, what this means is that instead of getting stuck on the low-end $40k machine, with thoughts of upgrading to the $250k machine, and maybe later the $1M machine, you use data partitioning and any number of let’s say $5k machines. That doesn’t mean a $1k single-disk crappy machine as said above. What does it mean for the machine to be “commodity”? It means that the components are standardized, common, and the price is set by the market, not by a single corporation. Use commodity machines configured with a good balance of price vs. performance.

Use data partitioning (sharding). I haven’t talked much about this in my previous posts, because it’s sort of a given. My participation in the HiveDB project and my recent talks on “Scaling and High Availability Architectures” at the MySQL Conference and Expo should say enough about my feelings on this subject. Nonetheless I’ll repeat a few points from my talk: data partitioning is the only game in town, cache everything, and use MySQL replication for high availability and redundancy.

Nonetheless, RAID is cheap. I’ve said it several times already, just to be sure you heard me correctly: RAID is a cheap and efficient way to gain both performance and reliability out of your commodity hardware. For most systems, engineering time, operations time, etc., is going to be a lot more expensive to get the same sort of reliability out of a non-RAID partitioned system versus a RAID partitioned system. Yes, other components will fail, but in a sufficiently large data-centric system with server class hardware, disks will fail 10:1 or more over anything else.

That is all, carry on.

Update: Sebastian Wallberg has translated this entry to German. Thanks Sebastian!

RAID: Alive and well in the real world

by Jeremy Cole on Thursday, June 7th, 2007 at 23:23:49 in MySQL, Proven Scaling, Scalability, Technology

Kevin Burton wrote a sort-of-reply to my call for action in getting LSI to open source their CLI tool for the LSI MegaRAID SAS aka Dell PERC 5/i, where he asserted that “RAID is dying”. I’d like to assert otherwise. In my world, RAID is quite alive and well. Why?

  • RAID is cheap. Contrary to popular opinion, RAID isn’t really that expensive. The controller is cheap (only $299 for Dell’s PERC 5/i, with BBWC, if you pay full retail). The “2x” disk usage in RAID 10 is really quite debatable, since those disks aren’t just wasting space, they are also improving read (and subsequently write) performance.
  • Latency. The battery-backed write cache is a necessity. If you want to safely store data quickly, you need a place to stash it that is reliable1. This is one of the main reasons (or only reasons, even) for using hardware RAID controllers.
  • Disks fail. Often. If anything, we should have learned that from Google. Automatic RAID rebuild is proven and effective way to manage this without sinking a huge amount of time and/or resources into managing disk failures. RAID turns a disk failure into a non-event instead of a crisis.
  • Hot swap ability. If you forgo hardware RAID, but make use of multiple disks in the machine, there’s a very good chance you will not be able to hot swap a failed disk. Most hot-swappable disk controllers are RAID controllers. So, if you want to hot-swap your disks, you likely end up paying the cost for the controller anyway.

I don’t think it’s fair for anyone to say “Google doesn’t use RAID”. For a few reasons:

  1. I would be willing to bet there are a number of hardware RAIDs spread across Google (feel free to correct me if I’m wrong, Googlers, but I very much doubt I am). Google has many applications. Many applications with different needs.
  2. As pointed out by a commenter on Kevin’s entry, Google is, in many ways, its own RAID. So even in applications where they don’t use real RAID, they are sort of a special case.

In the latter half of his entry, Kevin mentions some crazy examples using single disks running multiple MySQL daemons, etc., to avoid RAID. He seems fixated on “performance” and talks about MBps, which is, in most databases, just about the least important aspect of “performance”. What his solution does not address, and in fact where it makes matters worse, is latency. Running four MySQL servers against four disks individually is going to make absolutely terrible use of those disks in the normal case.

One of the biggest concerns our customers, and many other companies have, is power consumption. I like to think of hardware in terms of “critical” and “overhead” components. Most database servers are bottlenecked on disk IO, specifically on latency (seeks). This means that their CPUs, power supplies, etc., are all “overhead” — components necessary to support the “critical” component: disk spindles. The less overhead you have in your overall system, the better, obviously. This means you want to make the best use (in terms of seek capacity) of your disks possible, and minimize downtime, in order to make the best use of the immutable overhead.

RAID 10 helps in this case by making the best use of the available spindles, spreading IO across the disks so that as long as there is work to be done, in theory, no disk is underutilized. This is exactly something you cannot accomplish using single disks and crazy multiple-daemon setups. In addition, in your crazy setup, you will waste untold amounts of memory and CPU by handling the same logical connection multiple times. Again, more overhead.

What do I think is the future, if RAID is not dying? Better RAID, faster disks (20k anyone? 30k? Bring it on!), bigger battery-backed write caches, and non-spinning storage, such as flash.

1 There’s a lot to be said for treating the network as “reliable”, for instance with Google’s semi-synchronous replication, but that is not available at this time, and isn’t really a viable option for most applications. Nonetheless, I would still assert that RAID is cheap compared to the cost (in terms of time, wasted effort, blips, etc.) of rebuilding an entire machine/daemon due to a single failed disk.

Help convince Dell to leverage LSI to Open Source MegaCli

by Jeremy Cole on Wednesday, June 6th, 2007 at 13:21:24 in MySQL, Rants, Technology

I’ve just submitted “Leverage LSI to Open Source MegaCli” to the Dell IdeaStorm website:

Dell makes some awesome and affordable hardware. Many new Dell machines have the PERC 5/i SAS RAID controller, which is a rebranded LSI MegaRAID SAS.

LSI makes some nice RAID cards. Dell likes LSI. Dell made a deal with LSI to provide the chips for their fancy new PERC 5/i cards.

We buy machines with these cards in them. We need to monitor our RAIDs, rebuild them, and do all manner of other maintenance tasks. We do not expect LSI to provide perfect tools. LSI is a hardware vendor, and it’s understandable that they provide terrible *software*. What is NOT understandable, though, is why LSI’s terrible tools are closed source.

What is further incomprehensible is why Dell is willing to accept this situation on behalf of their enterprise customers. Has anyone from Dell even tried to use the tools LSI provides, and Dell recommends, to manage a RAID array on Linux?

MegaCli is the worst command-line utility I have ever seen, bar none. But, we don’t expect LSI to make it better, we expect LSI to OPEN SOURCE it. That way we software professionals can spend our own time to make them better. We need better tools. We are willing to work for free. Give us the source, or give us good documentation, but give us something.

We’re willing to provide infinite amounts of value to both Dell and LSI. Dell has enough clout with LSI to make this happen. Please make it happen.

Signed,

Jeremy Cole
Open Source Database Guy

Please go there and “promote” this if you care about Dell and RAID!