Task Batching
Four years ago I posted an update regarding a benchmark of the parallel processing in PixelMap was found inadequate, and also the plan of doing more thorough research in the area of optimizing task processing in parallel. I finished the paper about a subject partly related to this problem, and found a couple of potential sections where I could optimize the implementation too.
Theory
- Increase the amount of work a task can do. 1
- Increase the amount of tasks being worked on before asking for more work.
- Reduce interaction with task queue.
All of the above points can be reduced to the following: The more continued work done in a thread, the less it has to communicate with the task queue, leaving it open for other threads for communication. As the title suggests, all these points can be assessed with task batching. It is the process of putting tasks in a list and process them as one single task. In our case, we need to implement this in two locations: pushing to the task queue; popping from the task queue. Pushing is basically just like a transaction 2, so I made it like that. To make it even faster, I added swapping mechanics 3 to reduce overhead when committing the tasks to the task queue, as you need to push each task manually. Popping is trickier, but in this case we only pop up to the amount of work divisible by the amount of threads, with a maximum amount of threads 4, to avoid having a thread work for too long or stealing all work other threads could do. When less work in the task queue, it will create the same overhead as done previously, but it is negligible as it only happens over a smaller time span than the whole process, will therefore optimize the last part of the processing, and effectively re-balance the amount of work being done if the amount of tasks being provided are unbalanced.
Bugs and solutions
Of course, while implementing this I found several bugs that had popped up over the couple years when I was adding more features, one of them being particularly hard to pin point down. First was the file cache I added to increase the processing of the data. While this made it faster, it severely increased the amount of RAM used, as each file was stored in its whole in memory. To fix this, I made the thread clear it whenever it was done with the region. This was not a memory leak, just a side effect due to how I utilized a new feature.
Second problem was while I previously had added a way to limit amount of opened files, it was not thoroughly tested and therefore just hanged or killed the program as we silently went over the soft limit. This was a bit trickier, as we do not want to read a whole file and put in the memory, which would resulting in too much data in memory. Instead, we read only the headers, per usual, and then close the file. When we later want to read the whole file, we re-open it and put the data into the cache, before closing it once again. This means the amount of files open will be at most the amount of threads plus one, which is significantly lower than the normal opened file limit 5. If more cores are being added to CPUs, most systems with that amount of cores would most probably increase the limit.
Furthermore, while this was not a big concern, it still kept some data in the RAM, so putting a priority on the data would make it prioritize some files above others, resulting in less data being kept in queue and more regions being rendered. This works even better for the map
mode, as it will first finish first batch of regions before it starts on the next ones, instead of first going through all regions and renders them, and then write to disk at the same time. I previously thought this solution was the main issue, but it was the previous two problems that had to be solved first.
Finally, the last bug which took me a couple of weeks, and several hours of work before I realized why the program marked thousands of files as invalid: The world was made for 1.12.1, and while they used the field DataVersion
, some of the chunks used the older V
field, probably due to this world having been worked on since pre-1.9 6, and someone walked around in 1.12.1 and it therefore got some chunks updated. The issue at hand was that due to the version refactorization added better separation of block ID 7 and namespace ID 8, resulted in having to manually add the use of either depending on the version, but forgot to do that for pre-1.9 worlds. Fixing this issue was therefore just adding one operation.
Benchmark
EPYC 7702P, 64 cores/128 threads, 128GB RAM
Ryzen 7 3800XT, VM 8 cores, 8GB RAM
I first tried rendering a single image, but that resulted in it sitting over 20 minutes with just writing that image, and it most probably was faulty as well, while the file size was much larger than the previously created. On my dev VM it even crashed, probably because it ran out of memory (8GB). Therefore I chose to only use the map mode while processing the 15GB world I used last time. First I tried several times to see if any of my changes actually fixed the issue and therefore fully utilized all available threads. Sadly, it did not. But that was because not all of the previous changes was applied. After everything was added, all bugs fixed, and tested briefly, I ran again, and was pleased to see the glorious 12800% CPU usage. I therefore went to benchmark both on the EPYC and my dev VM.
System | EPYC | Ryzen |
---|---|---|
8 threads | 273s | 213s |
16 threads | 222s | - |
32 threads | 198s | - |
64 threads | 143s | - |
128 threads | 117s | - |
First we need to explain the elephant in the room. Even when we increased the thread count by a magnitude of 16, it only increased the overall throughput with about 233%, which is far from the 1600% expected increase, but still significant to warrant a success. This is related to diminishing returns when scaling at any direction. It is a known problem where you cannot fully utilize all threads in a system, mostly because of the CPU scheduling, but also due to the system being shared with other processes and users (In this case it had a load of up to 12). In this specific case, it could also be related to NUMA. In addition, EPYC uses way less frequency turbo when using more cores. Ryzen itself is significantly faster, mostly due to the much higher base frequency, but also the great turbo frequency. Their overall performance is very different, especially between single-core and multi-core processing 9.
I was wondering if RAM and caching would affect the performance, but apparently it had only a little effect at all. It was run multiple times in the VM, and only once it was found that caching matters a couple of seconds for Ryzen. I would guess that if I were going to use all threads on Ryzen, I would probably end up at around 50% better timing. Plotting the values from EPYC also shows a line close to the values, showing that adding more threads gives a better boost, but the small hurdle at 32 threads could indicate an issue with NUMA.
It was noted that the amount of RAM allocated did not change much. In map
mode, we minimize the amount of RAM used by freeing it as soon as it has been used. For 8 threads this ended up a bit over 1 GB, and for 128 threads it ended up over 8 GB, which means it is significantly more effective even when more cores are used.
Conclusion
I am very happy by the result. While I do know that adding more threads will result in more overhead, but at least be able to reach this milestone makes it possible for me to instead focus on other optimizations which makes more sense for single-core processing. But why was it harder to notice the approaching bugs until I added task batching? The reason is related to the amount of threads that actually did some work at all. Previously they sat and waited for doing work, as well as I never tried to do any processing on the huge world until recently. That is also the reason while it was quite easy to add task batching to the thread pool itself, making it work with the rest of the system, forced me to go on a bug hunt to fix issues that had popped up and that I had sadly missed.
-
Already fulfilled by reducing working with chunks to working with whole regions. ↩︎
-
min((queue_size + thread_count - 1) / thread_count, max_batch)
↩︎ -
Default file limit differentiates between OS, and can sometimes be changed. Windows have a default on 512, while Linux lays around 4096. ↩︎
-
https://minecraft.fandom.com/wiki/Java_Edition_data_values/Pre-flattening ↩︎
-
https://minecraft.fandom.com/wiki/Java_Edition_data_values ↩︎
-
https://www.cpu-monkey.com/en/compare_cpu-amd_ryzen_7_3800xt-vs-amd_epyc_7702 ↩︎