-- William Gibson, All Tomorrow's Parties
For the last week and a half I've been learning up on Solaris 10 again. The last time I touched it, about a year ago, I was just screwing around with no real interest in using it in a production environment. After reading a few posts over at Theo Schlossnagle's blog regarding choosing Solaris over Linux and his OSCON slides relating to the same, both relating to PostgreSQL performance, I became much more interested in Solaris 10.
(hdp made noises about how evidently the company Schlossnagle works for wrote OmniMTA, which is what the Cloudmark Gateway product uses, among other things; evidently it's a small enough world, after all.)
We have a service at work which stores spam for 30 days. We refer to the messages as "discards", because the system has decided you probably don't want to see them, but it's not like we're going to drop the things on the floor. The thing is, it's insanely slow, right to the very edge of usability (and probably beyond for the vast majority of people). Getting results out of the database takes minutes.
There are a number of issues with the system as a whole, but evidently Postgres configuration is not one of them (jcap, my predecessor, set the service up properly, and a PgSQL Guy agreed there wasn't much else could be done on the service end). So that leaves hardware and OS optimizations. The hardware is fine, save for the fact it's running on metadisk, which is running on SATA (read: bloody slow, especially for PgSQL, which is a pig for disk I/O). We'll be fixing that with a SCSI JBOD and a nice SCSI RAID card RSN. The OS is Linux, and has also been optimized... to a point. Screwing with the scheduler would probably get me something. However, based on my own research (I've read both the Basic Administration and Advanced Administration books over at docs.sun, as well as numerous websites, etc), and Schlossnagle's posts, I've made up my mind that Solaris is the way to go here. So what sold me?
Well, there's the standard features all new to Solaris 10:
- StorageTek Availability Suite (I can't seem to get away from network block devices... we use DRBD right now, and frankly I've really come to hate it; but the basic idea is sound enough and far too useful to ignore)
- Fault Management
- Zones (not very useful to me in this case)
- Service Management Facility (while not a deal-breaker or maker, it's incredibly nice being able to define service dependencies and milestones, it also ties into FM)
- DTrace (for me, this is a deal-maker; check out the DTraceToolkit for examples why, compared to debugging problems under Linux, it's a huge win for any admin)
- Trusted Extensions (while really interesting and hardcore, not something I care much about just yet)
- Stability (not only in terms of the system itself, but the APIs, ABIs, and the like; you can use any device driver written for the last three major versions of Solaris in 10 -- compare not only the technology there, but the philosophy behind it, to any freenix)
- RBAC (while not something I'm going to use immediately, it's something that I really want to utilize moving forward)
That's a fair feature-set that should get any admin to perk up and take notice. Of course, if it weren't for OpenSolaris I wouldn't care. Solaris 8 and 9 are sturdy and well-known systems, but I have no interest in them. They don't get me anything except service contracts and annoying interfaces. With OpenSolaris, Sun is actively making progress in the same friendly directions freenixes have always tried for -- while adding some seriously engineered and robust tech into the mix. It's a nice change. A more open development model, with lots of incremental releases (building into an official Solaris 10 release every six months or so) give me the warm fuzzies.
So, now that the advertisement is out of the way, what are my impressions after a week of configuring and using it?
Well, Solaris with great new features is still Solaris. Config files are in strange places for legacy reasons, there are symlinks to binaries (or binaries) in /etc, SunSSH is ... well. SunSSH (perhaps sometime soon they'll just switch over to OpenSSH and all the evil politicking can be forgotten, yes?). /home is not /home because it's really /export/home.
Commands exist in odd locations that aren't in my path by default, logging is strange. In short, it's a commercial UNIX. It's steeped in history and the reasons for things are not always immediately clear. The documentation (both from docs.sun, OpenSolaris, and the man pages) is excellent. I am not coming to Solaris as a total newb. I've used it before, but not particularly extensively; the learning curve is expectedly high.
As always, UNIX is UNIX. Nothing changes but the dialect and where they put the salad fork.
So, I've got this core system that does lots of really great stuff, some of which is confusing and maybe not so great, but overall it's a pretty obvious win. Unfortunately it has a bunch of tools I'm not used to, or don't like, and it lacks a lot of tools I require. So I need to go out and find a third-party package management utility. Well, you've got Sun Freeware, which is pretty basic. There's Blastwave, which has a large repository of software, a trivial way of interfacing with it all, but seems to have some QA issues (that's an old impression and may have become invalidated).
And then there's pkgsrc, the NetBSD Ports System. And you know what? It's pretty great. After bootstrapping pkgsrc's gcc from Sun Freeware's (Sun packages gcc now, so you have access to a compiler with the OS -- this was not true before -- but apparently Sun's gcc is Weird and not to be trusted), I was building packages with no issues whatsoever. OpenSSH, Postfix, PgSQL, vim7... Anyway, with an hour's worth of work (which only ever need be done once, on one system, to build the packages), you've got all the programs you're used to using, or require. Suddenly the weird and craggy vastness of Solaris -- expat from the world of commercial UNIX -- becomes much more friendly and livable.
A couple simple hints about your environment: Set
TERM=dtterm and add
TERMINFO=/usr/share/lib/terminfo. The former seems to be the proper shell for xterm or xterm-like terminals, and the latter fixes pgup/pgdown in pagers and vim, though not in Sun's vi.
It's also easy to create your own packages -- something we've been wanting to do at work for a long time (before I started, certainly). Moving our current custom "packaging" system to pkgsrc would be tedious, but certainly something we could automate with some work. Standardizing on it would be a big win not just for the Solaris servers, but for the architecture as a whole. So, a double win.
(I would be remiss not to mention Nexenta, a project which merges GNU software into OpenSolaris's base. It's very, very interesting, especially in that they use Ubuntu's repos, but regardless of the purported stability of their alpha releases, I can't say I am very interested in running it on my servers. Still, it's definitely something someone who wants to give Solaris 10 a poke without too much effort should take a look at. The same way that Ubuntu is there for people who want to try out Linux. I imagine, frankly, that eventually they will occupy the same ecological niche.)
As you might have guessed, I'm quite happy with my week and change of testing. All the basic management is similar to my BSD experience, and the vast wealth of information I can trivially get out of the system compared to other UNIXes makes it hard to argue against. pkgsrc means not only I, but our developers, have access to everything they need or are used to. The default firewall is ipf, which I'm not thrilled about (pf is my bag), but is certainly usable, and no doubt an improvement over whatever came before.
My next step is to take a Reduced Networking installation and build it up into a Postgres server running Slony-1 for replication services. I expect it to go pretty smoothly. The step after that will be a JumpStart server to live next to my (now beloved) FAI server.
There are a few things I need to pursue before we roll to live testing, including root RAID (the X2100's nvidia "RAID" is not supported by Solaris 10, weirdly enough). ZFS root is apparently possible, though tedious and weird. It would give me a trivial way to do mirroring, though. A install server would probably make it easier to do (though that's just a guess). Barring that, I'm guessing that a ZFS pool for system volumes (/usr, /var) and a ZFS pool for data/applications would be good enough. Mirroring / in the typical way (which certainly appears to be simple), until ZFS on root becomes common and supported.
I expect I'll drop some more updates as I move forward. Hopefully with good news for our PgSQL performance. ;-)
<bda> "Every block is checksummed to prevent silent data corruption, and the data is self-healing in replicated (mirrored or RAID) configurations. If one copy is damaged, ZFS will detect it and use another copy to repair it."
* bda sighs.
<bda> It's so dreamy.
<kitten-> You really need a girlfriend.
<bda> I doubt she'd come with fault management and a scriptable kernel debugger.
<kitten-> I suppose you're right.
Andrew hooked me up with a Wii tonight and I played around with it for a couple hours. Fun stuff. The Virtual Console has Toe Jam & Earl. I'd forgotten how totally unsane that game is (chickens with tomato cannons!).
Only complaint is that it's freezing downstairs, which is not very conducive to jumping around like an idiot playing boxing. cough.
I set up my Mii. Wii code is 4389 8290 7785 1813.
Must. Get. Zelda.
Continuing on my "pkgsrc is pretty awesome" schtick, here's how easy it is to get it running on OS X. The most annoying part is downloading XCode (if you haven't got it already).
Once you have the DevTools/XCode installed, you'll need to create a case-sensitive volume for pkgsrc. Until somewhat recently you couldn't resize volumes, and this would have been far more annoying. However, these days it's pretty trivial.
Another option would be to create a disk image and run pkgsrc out of that. The documentation suggests this course. It has the added bonus of being portable (pkgsrc on your iPod?); I'm just doing this on my laptop and don't want to have to deal with managing a dmg, so I'm going the resize route.
[root@selene]:[~]# diskutil list
#: type name size identifier
0: GUID_partition_scheme *74.5 GB disk0
1: EFI 200.0 MB disk0s1
2: Apple_HFS selene 74.2 GB disk0s2
[root@selene]:[~]# diskutil resizeVolume disk0s2 70G "Case-sensitive Journaled HFS+" pkgsrc 4.2G
Once it's done resizing the volume, you'll need to reboot.
The reboot is required because we're monkeying around with the boot volume. If you're doing this on an external disk, you can just refresh
[root@selene]:[~]# diskutil list
#: type name size identifier
0: GUID_partition_scheme *74.5 GB disk0
1: EFI 200.0 MB disk0s1
2: Apple_HFS selene 70.0 GB disk0s2
3: Apple_HFS 4.1 GB disk0s3
Well, leetsauce, our volume exists. But it's not mounted, because
resizeVolume doesn't actually format it.
[root@selene]:[~]# diskutil eraseVolume "Case-sensitive Journaled HFS+" pkgsrc disk0s3
Started erase on disk disk0s3
Finished erase on disk disk0s3 pkgsrc
And now it shows up happily:
/dev/disk0s2 on / (local, journaled)
devfs on /dev (local)
fdesc on /dev (union)
automount -nsl  on /Network (automounted)
automount -fstab  on /automount/Servers (automounted)
automount -static  on /automount/static (automounted)
/dev/disk0s3 on /Volumes/pkgsrc (local, journaled)
And is it indeed case-sensitive:
[root@selene]:[/Volumes/pkgsrc]# touch foo Foo
[root@selene]:[/Volumes/pkgsrc]# ls -l ?oo
-rw-r--r-- 1 root wheel 0 Feb 8 02:09 Foo
-rw-r--r-- 1 root wheel 0 Feb 8 02:09 foo
If you care to, you can change the mountpoint using Netinfo Manager, but I'm lazy and just symlinked
/Volumes/pkgsrc -> /usr/pkg. You could also change
--prefix=/Volumes/pkgsrc, but as I'm using it on other OSes, I like having it all in the same place. Personal preference.
(When Netinfo Manager stops SPODing at startup I'll probably change the mountpoint.)
[root@selene]:[~]# cd /usr/pkg
[root@selene]:[/usr/pkg]# curl -O ftp://ftp.NetBSD.org/pub/pkgsrc/pkgsrc-2006Q3/pkgsrc-2006Q3.tar.gz
[root@selene]:[/usr/pkg]# tar -xzf pkgsrc-2006Q3.tar.gz
[root@selene]:[/usr/pkg]# cd pkgsrc/bootstrap
And off it goes.
===> bootstrap started: Thu Feb 8 02:37:50 EST 2007
===> bootstrap ended: Thu Feb 8 02:40:15 EST 2007
[root@selene]:[/usr/pkg/pkgsrc/bootstrap]# mkdir ../../etc/
[root@selene]:[/usr/pkg/pkgsrc/bootstrap]# cp /usr/pkg/pkgsrc/bootstrap/work/mk.conf.example ../../etc/mk.conf
/usr/pkg/bin:/usr/pkg/sbin to your
The main reason I did this was to upgrade vim to 7.x, because I want tab support (I never really got the hang of managing buffers, but to the dismay of all my Elite Vim hax0r Friends).
[root@selene]:[/usr/pkg]# cd pkgsrc/editors/vim
[root@selene]:[/usr/pkg]# bmake package
And after downloading a few billion patches and compilation...
[root@selene]:[/usr/pkg/pkgsrc/editors/vim]# which vim
[root@selene]:[/usr/pkg/pkgsrc/editors/vim]# vim --version
VIM - Vi IMproved 7.0 (2006 May 7, compiled Feb 8 2007 02:59:09)
The LSI SCSI card for our new discards database server (Sun X2100 M2) came in today. I ran up to UniCity to get some cables from Harry (thanks, Harry!) and plugged the Dell PowerVault 210S we got into the thing. All the disks showed up happily in
cfgadm -al, but not in
cfgadm to configure didn't seem to do much for me, so I decided to be lazy and
touch /reconfigure and reboot. Once that was done, all seemed to be quite happy.
The goal here is to create a ZFS pool on the JBOD for Postgres to live on. As you can see, it is super easy:
[root@mimas]:[~]# zpool create data2 raidz c3t0d0 c3t1d0 c3t2d0 c3t3d0 c3t4d0 c3t5d0 c3t8d0 c3t9d0 c3t10d0 c3t11d0 c3t12d0 c3t13d0
[root@mimas]:[~]# zpool status
scrub: none requested
NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
mirror ONLINE 0 0 0
c1d0p3 ONLINE 0 0 0
c2d0p3 ONLINE 0 0 0
errors: No known data errors
scrub: none requested
NAME STATE READ WRITE CKSUM
data2 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0
c3t5d0 ONLINE 0 0 0
c3t8d0 ONLINE 0 0 0
c3t9d0 ONLINE 0 0 0
c3t10d0 ONLINE 0 0 0
c3t11d0 ONLINE 0 0 0
c3t12d0 ONLINE 0 0 0
c3t13d0 ONLINE 0 0 0
errors: No known data errors
[root@mimas]:[~]# zfs create data2/postgresql
[root@mimas]:[~]# zfs set mountpoint=/var/postgresql2 data2/postgresql
[root@mimas]:[~]# mv /var/postgresql/* /var/postgresql2/
[root@mimas]:[~]# zfs umount data/postgresql
[root@mimas]:[~]# zfs destroy data/postgresql
[root@mimas]:[~]# zfs umount data2/postgresql
[root@mimas]:[~]# zfs set mountpoint=/var/postgresql data2/postgresql
[root@mimas]:[~]# zfs mount data2/postgresql
[root@mimas]:[~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
data 885M 124G 24.5K /data
data/pkg 885M 124G 885M /usr/pkg
data2 17.1G 346G 44.8K /data2
data2/postgresql 17.1G 346G 17.1G /var/postgresql
RAID-Z is explained in the
zpool man page.
While I am a little worried that we really want hardware RAID, I didn't want to spend the extra cash. If we do end up needing the extra performance, trust me, the system can be repurposed easily. ;-)
In which our intrepid sysadmin may as well have been eating wall candy as doing work for all it got him.
I spent all of Friday and most of last night trying to figure out why
pgbench was giving me what seemed to be totally insane results.
- >1000 tps on SATA 3.0Gb/s (Linux 2.6, XFS, metadisk + LVM)
- ~200 tps on SATA 3.0Gb/s (Solaris 10 11/06, UFS)
- ~500 tps on SATA 3.0Gb/s (Solaris 10 11/06, ZFS mirror)
- <100 tps on 10x U320 SCSI (Solaris 10 11/06, ZFS RAID-Z)
recordsize=8192 for the ZFS volume is helpful. Or whatever your PostgreSQL blocksize is compiled with.
It basically came down to Postgres
fsync() absolutely murdering ZFS's write performance. I couldn't understand why XFS would perform 10 times as well as ZFS on blocking writes. Totally brainlocked.
And of course the
bonnie++ benchmarks were exactly what I expected: the SCSI array was kicking ass all over the place... except with
-b enabled, while the lone SATA drive just sort of sucked air next to it on any OS/filesystem.
zpool iostat <pool> <interval> command is seriously useful. Give it -v to get per-disk stats.
Anyway, I was banging my head against the wall literally the entire day. The ticket for "Test PgSQL on Sol10" is more pages long than I'd like to think about. Finally I bitched about it in
#slony, not out of any particular thought that I'd get an answer, just frustration. mastermind pointed out that, hey, cheap SATA drives have, you know... write caches. And like? SCSI drives? They have it like, turned off by default, because reliability is far more important than performance for most people. (Valley-girlese added to relate how stupid I felt once he mentioned it.) And anyway, that's what battery-backed controllers are for: So you get the perf and reliability.
Once he said "write cache", it clicked, total epiphany, and I felt like a complete jackass for spending more than ten hours fighting with what amounts to bad data. In the process I read a fair amount on good database architecture, though, and that will be very helpful in the coming weeks for making our discards database(s) not suck.
Live and learn, I suppose.
The whole experience has really driven home the point that getting information out of a Solaris box is so much less tedious than a Linux box, though. While I was under the impression that the SATA drives were not actually lying to me about their relative performance, I kept thinking "These Sol10 tools are so frakking nice, I don't want to get stuck with Linux tools for this stuff!" Especially the ZFS interaces.
Note the "Behind Schedule" section:
- Basic infrastructure projects failed.
- Medical projects failed.
- Security projects failed.
- Oil production is almost nominal.
And yet this "embassy" we're dropping $600,000,000 on will have "the biggest swimming pool in Iraq, a state-of-the-art gymnasium, a cinema, restaurants offering delicacies from favourite US food chains, tennis courts and a swish American Club for evening functions."
Ah, well. "Government's work is God's work."
Presumably the in-house harem service is being billed as a few gold-plated toilet seats.
Smith will begin lensing next month on "Reaper," an ABC Television Studio-produced hour about a 21-year-old slacker who ends up becoming Satan's bounty hunter, retrieving souls lost from Hell.
A useful entry for people who are confused by how OpenSolaris turns into Solaris 10, and what the differences actually are.
dragorn pointed this out the other day. After having to endure the fist-shaking of my co-workers at svn/svk perhaps they will be satiated. Or not.
OpenSolaris also uses hg, it seems.
hdp noted that since our databases are now entirely InnoDBized (all the MyISAMisms have been ripped out), we no longer need to lock the entire database to get a consistant dump.
Others have written about MySQL backups, so I will just say that it makes me happy I no longer have to entirely rely on the validity of our replicas to get good backups. (Or lock the master for half an hour to get a dump.)
(Of course, all may not be light and fairies in the land of MySqueel, but if you've used it extensively for any period of time, I guess you know that already.)
And yes, it terrifies me to consider that the replicas are out of sync with the master, but errors definitely creep in. It saddens me.
In happier news, there is a new Slony-I site in the works, it seems.
After work tonight I had to go fix a box at ATX which had blown a drive. By "fix" I mean I just pulled the drive and stopped all the services that lived on it. It's the old mirrorshades box, which sees only little use these days. I'll have to go sort it out tomorrow. The only partition on the blown drive that matters is
/var/spool (and of course all the system was really doing is MX...), and it shows up half the time so hopefully I'll be able to just throw it in the freezer and copy or something.
It may motivate me to throw Solaris 10 on the box and move the rest of the MXes I have to support over to it (philtered, ghetto, PWF). Yay for zones, maybe.
After getting the box online, I walked down to Borders to pick up some books to tide me over while waiting for the rest of the Peter Watts books to come in (Blindsight is a GREAT read. You should all go read it. It is free. But then you should buy it because reading stuff online sucks and it is absolutely worth giving him money for.)
Apparently a "new" PKD book came out last month: Voices from the Street.
If only I could harness the EM field that surrounds me and kills hardware dead, I could totally zap brains that are annoying me.
Version 1.3.34-4 of Apache in the Debian Linux distribution contains a hole that allows a local user to access a root shell if the webserver has been restarted manually. This bug does not exist in the upstream apache distribution, and was patched in specifically by the Debian distribution. The bug report is located at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=357561 . At the time of writing (over a month since the root hole was clarified), there has been no official acknowledgement. It is believed that most of the developers are tied up in more urgent work, getting the TI-86 distribution of Debian building in
time for release.
Unlike every other daemon, apache does not abdicate its controlling tty on startup, and allows it to be inherited by a cgi script (for example, a local user's CGI executed using suexec). When apache is manually restarted, the inherited ctty is the stdin of the (presumably root) shell that invoked the new instance of apache. Any process is permitted to invoke the TIOCSTI ioctl on the fd corresponding to its ctty, which allows it to inject characters that appear to come from the terminal master. Thus, a user created CGI script can inject and have executed any input into the shell that spawned apache.
As a Debian user, this concerns me greatly, as any non-privileged user would be able to install non-free documentation (GFDL) on any system I run.