Update
For over 2 months now, since A2000 was installed, the server started to crash randomly. If that was the cause is unknown, but this is the story on how to stabilize a broken system.
A known fact was that A2000 may be a good SSD for its price, but the problem was that it did not play well with Linux systems. The solution was to update its firmware, which could only be done through the software provided only for Windows. Hence, PCIe-passthrough was done to a Windows 10 VM in order to do this(There was no other MB that supported M.2). When the VM was attempted to boot, the whole server was thrown into a complete disarray, spewing errors in the console and no input could make this stop. The server was brought to an abrupt halt and then rebooted. After a lengthy boot, an investigation was being made to locate any issues that may have been caused. The only thing was that one of the HDDs in the RAID1 array was through out and the whole array had to be resynced, which takes a couple of hours. The HDD was shredded and removed from the slot, as it had already expended its lifetime, even though it looked like it was fine. Everything then worked as it should.
A communication channel was opened to Kingston in order to find out if there existed a firmware update for the A2000, which there was not. The situation was then explained, but they relayed any further issue to the retailer.
A couple of days later, the server stopped responding. A kernel panic had occurred, resulting in rebooting and resyncing the array again. The A2000 was then initialized with LVM-thin and some services was migrated to it. Everything seemed to work out fine. After everything had been migrated, the RAID1 was replaced with ZFS mirror. The crashes kept happening, sometimes within a couple of hours, while others after a week. This happened either randomly or after some time when using smartctl
.
The old server was brought to life and services was migrated to that. Then began a lengthy monitoring over the new server in order to locate the issue. First was the RAM, which after hours of testing found out that that was not the cause. Tests on HDD and SSD was also done, but it was close to impossible to find if this was the issue. And even if it was, it would have been found out sooner.
The new server even continued to crash, even with no service. Then the old server also started to crash. Unnecessary services was brought down in order to pin-point the culprit, but in vain. Lastly, the PSUs was switched between the servers. This time the new server went down like expected, but the old server still crashed.
In the meantime the old server continued to crash per usual, resulting in being down for over a week as it was unattended. Also, it refused to reboot unless a RAM stick was removed, so most probably the RAM was faulty. When booting back again it was found out that the crashes had something to do with the backups, so most probably this may be related to the same issue as the new server. A regular backup was done on a VM, and it crashed again. Hence, automatic backups on VMs was disabled and it has been running until manually shut down.
Finally, before this, the PSUs was switched back and an RMA was filed on PSU, MB and A2000. They replied after a week and replacement parts where ordered and sent within a couple of days. After everything was built, the new server was being picky, refusing to boot and crashing as before but now with a different error. The BIOS was updated and the ZFS pool was replaced with a new one, and it has not crashed since.