So this article is about the Zeta File System (ZFS) a relatively new file system and one that is supposedly very resilient to errors. It has some great features to prevent dataloss and is touted as the be-all and end-all of filesystems. Well my experience is varied, both good and bad, but the reality for me is that if you want to take advantage of some of the features ensure you have backups.. of everything because ultimately your data is not safe.
So a little history, many years ago I decided I needed a storage device to keep all my important stuff, and less important stuff and I was drawn to ZFS with its ‘self healing’ features so often touted, so I built a new server with 16x3T drives and configured it up using RAIDZ2 (the sorta equivalent of RAID6) with 15 disks and a hotspare. I started moving my data from the myriad of external drives to the storage array. I then suffered my first issues…. a drive died.. silently at first.. when I eventually spotted it I immediately issued a replace using the hotspare – which turned out to be not so ‘hot’ as you had to manually switch it in and the array was recovered following a week long resilver process.
Several years went past without incident, big 6kva UPSs kept power outages at bay from causing problems, regular scrubs seemed to work. A minor issue happened on another drive fail where the metadata was reported corrupt and to fix it I had to blow away the directory and it’s contents which were not important as could easily be recovered/rebuilt. This should have been the warning to me that all is not as safe as it should be, but I had bought into the hype and carried on blindly. Then came the undersea cable fire in Malta (well the fire wasn’t undersea, but it was on the end of the undersea cable directly under 4 substations which supplied my area) .. out went the power.. for some 12+ hours, the generator ran out of fuel, the UPSes all ran out of battery and the host died, hot and hard. It was not pretty, and on return of power things were not happy. The reboot was fine for the RAID set, but the ZFS was not happy it wouldn’t mount, infact it wouldn’t do anything especially import it. Calling one the FreeBSD Devs it was suggested I try, “zpool import -FfX storage” and a week of watching the drive activity I got the entire pool back without and visible errors.
I was sold, that was it this was the file system that was unbreakable. I threw caution to the wind and instead of keeping a server that was identical I started using the more useful features of ZFS. I created ZVOLs and built VMs, I created ‘timemachine’ drives for my Macs, and started to reap the short term benefits of ZFS.
Then I moved across the world from Malta to Australia, back home, and shipped the servers through an international removalists. The servers, despite being prepared correctly and packaged properly arrived damaged. Three (yes three) drives were damaged so as RAID6 (and ZRAID2) can only cope with two drives failed I decided to try something ….. I byte copied all three corrupt drives to new drives and put the new drives back into the server/array and ran the import as before. Several weeks of ‘rebuild’ and the array came back online – again without dataloss.. RESULT!
That’s not the end of it though, after all this you’d think I would be raving about it’s resilience. Well each time the issues have happened the critical data structures on disk have not been affected so recovery has been possible. Fast forward to March 2019, a drive died and I replaced it, then whilst resilvering on March 9, 2019 (just before midnight) a transformer blew up down the road, taking out all power. UPSes did not kick in and the generator did no good at all because the power was lost when the power went down. Power was resorted a short time later and the zpool was reporting it could not be imported without rolling back 5 seconds of data… this of course was not an issue as I had been minimising writes from the moment of the rebuild.
I went to bed leaving it to rebuild…
6am March 10, 2019 some idiot drunk or on drugs took out the power pole down the road from me, and the 11KV lines contacting the 240V lines ensured that the UPSes wouldn’t save a thing. Power went down and on return of power the problems of ZFS became very very apparent.
Here’s the issue (in sorta non technical terms)… The file system is a big database, one with lots of redundancy and checks, but it has a fundamental flaw built in. This flaw you can see in a multitude of posts where the devs state with resounding coherence of the party line, “The data on disk is always correct.” Well it is and isn’t… the data is all there, and it reports correct, but if one of the critical structures is corrupt, (eg a spacemap – in my case) the metadata (the stuff the makes sure that your data is right and stays right) is deemed corrupt and so ZFS in its wisdom pronounces the whole drive corrupt.
So lets reiterate that…
A small part of the data in the pool (drive).. just the right (or wrong – depending on your point of view) part got part written because of a power outage and now the entire pool (drive) cannot be mounted. All 36,964,899 files, some 21.2 terrabytes of data.. In fact according to “zpool status” there are just 3 errors in total and examination with “zdb” it appears they are all checksum errors of “metaslab 122” because of the spacemap corruption. So many weeks later I’m still trying to recover the data – I’ve just got myself another 36T of drive space after trying in place recovery, but still no luck.
I don’t have backups as I was moving stuff around and as previously stated had already thrown caution to the wind. The next step for me is to modify the code to ignore the checksum errors and see if I can ‘walk the dataset’ for all the files
I’ll let you know how I get on, but with all the ZFS devs posting that there is no need for a “fsck” in ZFS as the disk is always right I can only suggest anyone thinking of deploying ZFS to only do so if:
1/ You can make full backups, or
2/ You can afford to lose all the data
(and it should be noted, FreeBSD devs are advocating that the root file system should be on ZFS and is now actually the default when installing… good luck laptop owners on road…!)
Edit/Update: Added links to the news articles for the large power issues, changed data from 19th March -> 10th March as my initial post incorrectly detailed the date.
Update [2]: I posted this blog link to the FreeBSD mailing lists here: https://lists.freebsd.org/pipermail/freebsd-stable/2019-April/090988.html and unfortunately many chose to follow the same line of ‘ZFS is always right’ that I see elsewhere (eg: ZFS on Linux mailing lists) .. which is part of the problem. To their credit though a couple of the FreeBSD Devs contacted me onlist with helpful suggestions (you can see these in the links) and others contacted me offlist with really helpful information. One even (off list) pointed me at: Klennet Storage Software ZFS-Recovery which I have not had chance to test yet (being that I need to setup a Windows 7 image on an external USB drive) – but if it delivers what it promises it is the missing link that ZFS needs (in my opinion.)