Internal Optimizations
I previously went through some of the libraries I was using heavily, and while those kinds or optimizations are basically low hanging fruits, in most cases their optimizations are rather minimal. The harder to get improvements would be to profile your own program 1. Luckily, I have managed to do that since basically the start of this project, mostly because the Beta version lacked such type of profiling. I wrap specific sections of the program with timers, and then accumulate several runs to get a total time. The sections being profiles are for Java Edition (JE): lonely chunk checking (LCC); decompression; pre-parsing (checking version); parsing; chunk rendering; region rendering; image rendering. Do note that some of these changes dramatically when changing render mode.
LCC do not do much work, and this is shown when profiling them, they taking up less than 1% of the execution time, therefore not worth diving into. We looked at decompression in previous post, so lets ignore that. For rendering, most of the processing is more toward PNG encoding, and it is rather difficult to optimize rendering pipelines overall. For (pre-)parse I need to explain what it is and what it does. Minecraft is well-known to have its own internal format to store data, namely NBT 2. Parsing this format is therefore crucial to read Minecraft worlds. To understand why pre-parse is necessary, we need to dive into how I have implemented the NBT parsing.
NBT structure
Most NBT parsers will iterate through the format and generate a tree-like structure you could traverse and optimize for quick access. The problem with such structures are that they are very big 3. Therefore, even since Beta, I decided for a visitor design pattern 4. This did not just make it possible to reduce the amount of RAM used, but also allowed to effectively skip over data that was redundant for then next steps in the program. This type of skipping was what allowed me to decrease parsing time with up to 3 times. But due to this type of parsing, and how Minecraft put Version
tag last in the structure, pre-parsing is necessary in order for us to determine which visitor to use. PixelMap currently for JE has 4 uniquely identified visitors for a set ranges of versions. Guessing what visitor to use is both slower and makes the code a hot mess, so using a pre-parse makes things a bit easier. In comparison with normal parsing, it takes up only 2-5% of its time.
However, despite the parsing being rather fast, it is still the slowest portion of the whole program (About 80% in total), and therefore required to be optimized. As the NBT implementation is rather convoluted and intricate, it is quite hard to profile small parts of it in order to find bottlenecks, forcing me to use external profiling tools. As I use valgrind 5 to locate memory leaks, it also comes with the profiling tool callgrind 6. The output from the tool is one or more big text files that I threw at kcachegrind 7. From it I get a table of function calls, their total time being executed, and their time excluded all other function calls made internally. It could be a bit cryptic at first, especially if the binary was compiled with an optimization flag, removing or changing function depending on how it optimized.
Fixes
With these tables, I first find that memory allocations/deallocations where consuming 25% of the overall processing time. NBT in itself has both names (strings) and lists to keep track of, and I replaced both with string_view
8 and VectorView
9 respectively. How these are used is that they specifically refer to the raw data structure which the NBT format reside in. As the data itself should be immutable, and the containers referring to them also should be so, it is safe to refer to them directly while parsing. Both these changes reduced the overall computation with 14% and 11% respectively, not surprising as there usually are more strings in NBT than lists.
Next find was the copying of data between the visitor and the chunk structure. In some cases we are required to store the data temporarily in the visitor in order to process it into a format we can process much easier in the render. But when giving this data to the chunk, we do not give the structure, but the data in the structure. Due to this, we copy the data between two memory structures. C++11 introduced move-semantics 1011, which is basically an optimization to transfer ownership of a structure to another structure. Making use of this feature could potentially make a significant improvement. Unfortunately, it did not. It actually made it a bit worse, with 5% worse performance. But that is no the whole picture. Looking at the total time, it made it 10% faster. What happened is that part of that extra calculation made in parsing went to less time spent in rendering instead. The guess here is that this is due to less cache misses as the memory allocated may be more close together 1213. Basically, we added some calculations to one part of the code, and improved the performance of the timings of that calculations by the double in another part of the code, making the overall performance better.
The last find was to reduce the amount of allocations even further. In the case of NBT, I had an object that was reused, but due to NBT having different types, it still had to change the type for each iteration. However, since C++17 they added std::variant
14, a union type that in one structure can keep several types at minimum size used possible. It also enforce the types used, entirely replacing a loose void pointer with static typing. Performance wise, this does not improve a thing, but for compatibility and type safety, it makes sure that it is hard to not use types not available, and will also avoid reading a type when it is not the current type. Under this, due to endianess of NBT, the type is stored dynamically in memory, and as we now have changed the type structure to be of a constant size, we just need to allocate it only if it is not allocated. That way, we reduce the amount of allocations significantly again. When checking the performance improvement, it is about 23% which falls under what was expected.
In addition, I tried tinkering with thread affinity 15 a bit, but while it did work a bit at first, after doing these optimizations, it does not help at all anymore. I suspect that if I tested this in a CPU with NUMA 16, it might help.
Conclusion
The total improvement was about 27%, while the parsing processing is cut in half. Sometimes small changes with how you manage memory allocations may improve certain functions dramatically. There are more improvements to be changed, but these were the next level of hanging fruits to fix. From here I might want to try out scratchpad memory 17 or even replace the memory handling entirely with jemalloc 18. However, this is certainly not necessary. Next move may be to look further on Bedrock Edition (BE), as that part has some flaws in its processing that makes it consume more RAM than necessary. I figured this out while I was implementing task batching. Other than that, I feel like PixelMap is getting closer to being a program I want to release publicly. Although, I got a few more features to implement before it feels complete.
-
https://en.wikipedia.org/wiki/Profiling_(computer_programming) ↩︎
-
NBT should per default be compressed, but that requirement has loosen up more the last couple years. ↩︎
-
https://en.cppreference.com/w/cpp/string/basic_string_view ↩︎
-
As I currently use only C++17, I do not have access to span, and therefore created my own crude implementation of it. This will of course create slight friction toward STL, but no problem an adapter cannot fix. ↩︎