Recently, I was at a customer site to upgrade a
SunFire E2900 SPARC based server: installing two fiber channel HBAs, connecting two
Oracle StorEdge ST2540 disk arrays, and install an up-to-date patch set.
After breaking up the system mirror, we started patching. During patch 28 of 188 patches (recommended patch cluster), the system caught a kernel panic, obviously caused by some UFS problems. First idea: boot into single user and see, what happens:
{0} ok boot -s
reseting...
[...]
Rebooting with command: boot
TL = 1, TT = 10. ERROR: Illegal Instruction
TSTATE= 0x1400 [ccr = 0x0, asi = 0x0, pstate = 0x14, cwp = 0x0]
TPC= 0000000000006004
TNPC= fffffffffd159118
TICK= 8000004be5bbfe34, TICKCMP = 8000000000000000
debugger entered
{0} ok*ouch* Well then... there is a mirrored system disk, so let's boot that one:
{0} ok reset-all
resetting...
[...]
{0} ok boot disk1 -s
WARNING: rmalloc_picky: Invalid numeric argument: size = 0x0
WARNING: rmalloc_picky: Invalid numeric argument: size = 0x0
WARNING: rmalloc_picky: Invalid numeric argument: size = 0x0
WARNING: rmalloc_picky: Invalid numeric argument: size = 0x0
WARNING: rmalloc_picky: Invalid numeric argument: size = 0x0
WARNING: rmalloc_picky: Invalid numeric argument: size = 0x0
Boot load failed
{0} okLooks, like somebody forgot to
installboot(1M) a valid bootblock on the mirror disk. Bad luck. Next, we tried to boot from a Solaris 10 DVD:
{0} ok reset-all
resetting...
[...]
{0} ok boot cdrom -s
Boot device: /pci@18,700000/pci@4/ide@2/cdrom@0,0:b File and args: -s
SunOS Release 5.10 Version Generic 64-bit
Copyright 1983-2005 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
ide: no major number
Cannot load drivers for /ssm@0,0/pci@18,700000/pci@4/ide@2/cdrom@0,0:b
Can't load the root filesystem
{0} ok Well then... even more bad luck. We then tried several things to get the machine back online. Swapping disks, removing HBAs and such. Nothing, that helped. Around 9pm we decided to stop at this point and get some sleep.
Next day, we brought another – more current – Solaris 10 DVD with us. You never know... And we also fetched the install server notebook from our office. Booting the machine from net was easy. Then a
fsck(1M) on the system disk. Oh well... the filesystem looked really damaged. Lots of duplicate inodes and lots of purged directory entries. Including pretty important system files. Another attempt to boot the repaired filesystem ended up in the same
RED_state alert as above.
Well... second boot from net,
fscking the boot mirror, writing a boot block on it and hey! The machine boots off the system mirror. Neat. Of course,
fsck purged some important entries on this machine as well. We checked for missing file by crosschecking the files on disk against the package database with
pkgchk(1M), and copied those files – as much as possible – from another Solaris 10 system with same patchlevel. Some device nodes were created by another reconfigure boot and finally we could start installing patches on the system. Which even fixed some more missing files.
A couple of hours later, the system was up and running, the new LUNs were configured in the storage and a
zpool(1M) was created. Another happy customer. The broken system disk will be replaced by a spare disk in a couple of days. I won't trust the old disk anymore, although it might not be broken at all.
Lessons learned:
UFS is a pretty robust filesystem and we still prefer
Solaris Volume Manager on system disks. Makes some troubleshooting just much simpler.