InfluxDB broke, again

It is not a long time ago that I blogged about how I experienced my first fault with InfluxDB. And now it has happened again. This time a bit different, though. And much worse.

It all started with the familiar NATS alert, so I tried rebuilding the indexes again with buildtsi and thought that would be it. I was wrong.

This time there was an invalid CRC on one data block, which caused segfaults(lol) within the InfluxDB executable. Speaking of error checking... the verify command in influx_inspect has a bug. It erroneusly reports a block as healthy, because the right counter is never incremented. But anyway... The block is faulty. What now? Where is the option to fix it, or remove the block? Removing the offending file manually doesn't help.

So in the end this made me abandon the data, and start from scratch :'( And for some reason docker-compose kept doing some weird caching with the data (even when it was a filesystem mount), even after downing and removing the old container, and emptying the old data directory, and it was an effort to get everything working again...

I should probably consider upgrading the hardware. If that is the real problem. For the lulz I also asked for a quote of InfluxDB enterprise. If I cluster it, then one node can fail and it recovers, right? Right?? But I don't expect the quote to be realistic. And even if it was, it's probably still too much. One alternative might be Apache Druid. But it, too, seems a bit too young a product.

No comments: