"That which is overdesigned, too highly specific, anticipates outcome; the anticipation of outcome guarantees, if not failure, the absence of grace."
-- William Gibson, All Tomorrow's Parties
Adventures on the Sun: Part II

In which our intrepid sysadmin may as well have been eating wall candy as doing work for all it got him.

I spent all of Friday and most of last night trying to figure out why pgbench was giving me what seemed to be totally insane results.

  • >1000 tps on SATA 3.0Gb/s (Linux 2.6, XFS, metadisk + LVM)
  • ~200 tps on SATA 3.0Gb/s (Solaris 10 11/06, UFS)
  • ~500 tps on SATA 3.0Gb/s (Solaris 10 11/06, ZFS mirror)
  • <100 tps on 10x U320 SCSI (Solaris 10 11/06, ZFS RAID-Z)

Setting recordsize=8192 for the ZFS volume is helpful. Or whatever your PostgreSQL blocksize is compiled with.

It basically came down to Postgres fsync() absolutely murdering ZFS's write performance. I couldn't understand why XFS would perform 10 times as well as ZFS on blocking writes. Totally brainlocked.

And of course the bonnie++ benchmarks were exactly what I expected: the SCSI array was kicking ass all over the place... except with -b enabled, while the lone SATA drive just sort of sucked air next to it on any OS/filesystem.

The zpool iostat <pool> <interval> command is seriously useful. Give it -v to get per-disk stats.

Anyway, I was banging my head against the wall literally the entire day. The ticket for "Test PgSQL on Sol10" is more pages long than I'd like to think about. Finally I bitched about it in #slony, not out of any particular thought that I'd get an answer, just frustration. mastermind pointed out that, hey, cheap SATA drives have, you know... write caches. And like? SCSI drives? They have it like, turned off by default, because reliability is far more important than performance for most people. (Valley-girlese added to relate how stupid I felt once he mentioned it.) And anyway, that's what battery-backed controllers are for: So you get the perf and reliability.

Once he said "write cache", it clicked, total epiphany, and I felt like a complete jackass for spending more than ten hours fighting with what amounts to bad data. In the process I read a fair amount on good database architecture, though, and that will be very helpful in the coming weeks for making our discards database(s) not suck.

Live and learn, I suppose.

The whole experience has really driven home the point that getting information out of a Solaris box is so much less tedious than a Linux box, though. While I was under the impression that the SATA drives were not actually lying to me about their relative performance, I kept thinking "These Sol10 tools are so frakking nice, I don't want to get stuck with Linux tools for this stuff!" Especially the ZFS interaces.

February 10, 2007 4:26 PM