Thread local
I managed to reduce the total time to 14 times native by integrating thread_local
for Lua states. I will probably stop pursuing this further as now I probably need to micro optimize Lua the integration itself. Could be that as it does not output correctly would be a potential reason why it is slower too.
Worth noting is that the Lua rendering is not completely implemented as it currently does not work with blending. Just need to integrate ray tracing first, but that should be rather quick.
And implemented blending with ray tracing, so after doing that the time rendering increased by 3 times, and total time increased by at least the double. After locating some bugs and fixing them, I finally got some output to work, but somehow there are some issues, probably due to void pointer conversions, but it looks like it could be related to XYZ alignment, so the next step would be to find and fix that bug.
The final results, after mixing with object pointers being used as references, and therefore “goes missing”: Lua is, with blending, about 870 times slower than native C++. I believe that I can make this several times faster if I use some sort of caching or const reference, but other than that, this is really bad. However, at least I get something drawn and now just need to optimize it. I suspect that reducing Lua being used at all may reduce this significantly, or maybe less C api intervention. Anyway, I most probably will need to use the optimized library that I found two weeks ago which could replace my own implementation entirely.