On disk performance and MySQL replication

If you’re not using replication, the only thing you have to worry about is MyISAM and InnoDB performance. There are a lot of things you can easily do to get more (or less!) performance out of either of them. So, you get your system tuned and handling 1500 updates per second, easy as pie. You’ve got a single IDE, SATA, or SCSI disk, or maybe two using software RAID.

Then, due to natural progression, needed scalability, or catastrophic failure, you realize that you should probably be using replication, so that you can have more than one of these systems working at once.

The first step in setting up replication is normally to turn on binary logging (by enabling log-bin) on the future master. This ensures that every change to the database is logged so that it can be sent to your future slaves and executed there. Here, though, you find your first obstacle.

You enable binary logging, and all of the sudden, performance goes to hell. Your server was running 1500 updates per second, but now it’s running only 150 per second. You freak out. You disable binary logging, and things return to normal. What’s happening?

In order to replicate to slaves, and to maintain consistency, MySQL’s binary logging mechanism writes the binary log “event” for a transaction as part of the commit process. Before a transaction can return “OK” for a COMMIT command (or an implicit commit), the transaction must have been logged to disk in the binary log. In order to ensure that the OS doesn’t cheat and cache the write, MySQL calls fsync on the binary log as part of this commit.

As long as the system is also doing some reads, a few other writes, etc., you can assume that any time you write data and force it to disk with fsync, it will require a disk seek in order to be written. Disk seeks on most disks take anywhere from 4-8ms each, so a single disk can only do anywhere from 125-250 of them per second, on average.

Are things starting to come together now? Enabling binary logging drops your update rate to ~150 per second because each of those updates will need to be written to the binary log, and flushed to disk, and you’re limited by the number of seeks that the disk can physically do in a given second. The only way to get more updates per second through the binary logging mechanism is to do fewer disk seeks. There are a few ways to accomplish this:

  • Unsafe syncing — You can use the sync_binlog=n option, where n > 1 to not call fsync for every single transaction. This is the easiest option, but it means that if your server crashes (power failure, kernel panic, etc.), you will potentially lose n number of committed transactions. That’s generally bad.
  • RAID with battery-backed write cache (BBWC) — This option, though the most expensive, is the most common. The BBWC is some amount of RAM, usually 128MB or so, on the RAID card itself, that is used to cache writes. The RAID card will acknowledge a write once it has been saved in the RAM, but before it has been written to the disk.
  • Separate disk — Put the binary log on its own disk(s)1 using Linux’s ext3 filesystem with a sizable journal, mounted with the options data=journal,commit=10.
  • Ramdisk journal — Put the binary log on a shared disk, but on its own volume, using Linux’s ext3 filesystem, with the journal on a battery-backed hardware ramdisk device. I’ll be writing some more blog entries about this quite soon. :)

I hope you’ve learned something. Feel free to leave me some feedback!

1 Yes, it’s own disk. If anything else touches the disk, ever, this strategy fails.

6 thoughts on “On disk performance and MySQL replication

  1. Hi Jud,

    Good question! Actually, in the “Separate disk” scenario, the binary log needs to be on its own disk. If you put it on a disk even with any other database activity, it may not keep up.

    To clarify things, you should look at what happens when you do this, and the reasoning behind the mount options data=journal,commit=10 that I mentioned.

    The data=journal option forces all data being written to disk to be written to the journal. The default is that only metadata is written to the journal, but this means bad things for sequential writes, because the journal is in a different place on disk than the binary log, so the disk must seek between them.

    The commit=10 option means to flush the journal to disk at least every 10 seconds. The default is 5 seconds. This allows up to 10 seconds worth of binary log to be accumulated in the journal before Linux will write it out to the actual file (which causes some unwanted seeking, but is unavoidable).

    If you have the binary log on its own disk, each time MySQL needs to append a transaction to the binary log, the disk’s head will already be parked in the correct place, at the ext3 journal, to write without a seek. If there is any other activity, read or write, taking place on the same disk, the disk heads will have moved from the journal and parked somewhere else.

    Technically, an unjournaled filesystem, such as ext2 will give similar performance for sequential committed writes, but it isn’t really optimal because of the filesystem repair time necessary on a bad restart.

    Using this method with ext3, you can get 500-1500 transactions per second committed, without any sort of battery-backed write cache. It works!

    Regards,

    Jeremy

  2. Jeremy,

    Beware :) If you’re speaking about 500-1500 transactions per second on the disk without battery backed up cache you’re not speaking about ACID transactions. What I mean – they will be likely lost on power failure. Due to physical limits you should have some 100-200 commits per second even with fully sequential writes (disk needs to do full rotation per commit anyway)

    The difference you observing probably comes from disk cache policy, which is on the drive itself – quite frequently sequential writes are accomulated in it while seeks cause it to flush.

    Also did not you try data=writeback ?

  3. MySQL Performance Blog » Blog Archive » Jeremy Cole on MySQL Replication

  4. Fyi, sync_binlog defaults to 0, which means no force flushing until the os decides to flush.

    http://dev.mysql.com/doc/refman/4.1/en/server-system-variables.html

    #

    sync_binlog

    If the value of this variable is positive, the MySQL server synchronizes its binary log to disk (using fdatasync()) after every sync_binlog writes to the binary log. Note that there is one write to the binary log per statement if autocommit is enabled, and one write per transaction otherwise. The default value is 0, which does no synchronizing to disk. A value of 1 is the safest choice, because in the event of a crash you lose at most one statement or transaction from the binary log. However, it is also the slowest choice (unless the disk has a battery-backed cache, which makes synchronization very fast). This variable was added in MySQL 4.1.3.

    If the value of sync_binlog is 0 (the default), no extra flushing is done. The server relies on the operating system to flush the file contents occasionaly as for any other file.

  5. Hi Jeremy
    Maybe here is not the best place to post my doubt, but I guess you know.
    Do you know if exist a lenght limit in the MySql replication?
    In “show binlog events in ‘Ulisses.000003’ from 22392415 limit 1;” the query is not completely. And because it my slave gone down.
    Any tip?
    Thanks!
    Jonatas – Brazil

What do you think?