Technical difficulties
Since the start of May I found the server suddenly hanging, most probably due to a kernel panic. I ignored it at first, and under two weeks later it reoccurred. I had already planned to upgrade my server with more RAM, so while I was doing that I ran some RAM checks on both the old sticks and the new sticks. It was only 4 passes, but they all passed after 14 hours of running. I then kept it going and didn’t bother more about, until it crashed again. And this time, it could crash once every five days up to twice a day. Everything was nondeterministic, meaning I had no idea why and where the issue was. As this had happened last year before, I did some more tests to find some sort of root cause of the issue, which I assumed was due to a corrupt kernel. I finally thought that the issue was due to the SSD.
I ordered a cheap used enterprise SSD (Less than 10 days running) while I tested, and when I got it (and a used CPU too), I switched the current boot drive with the “new” one and reinstalled Proxmox. I had backed up the configs and kept the VM/CT data on the drives, so everything took less than an hour before it was back being as it was before.
Now, almost two weeks later, it is still running. I’ll reboot it soon as I need to update to latest version, but for now I’ll keep it going for two more days to ensure that this may be the last fix.