@undefined I know that I said that I have a book about that and already implemented it …. my question about optimization ? the shader programs GLSL and HLSL run parallel in GPU that means that while one pixel is rendered another one is to be rendered too so when you pass more data in one draw call that will be faster than calling more than one draw call (Batch rendering)……… but CPU is fucking unitasking machine so each frame you found your self have to run for loops zillions of time clearing screen draw stuff with texture mapping and scaling pixel by pixel (imagen you run 10 for loops each one having million iteration calculating things and memset and memcpy are not options as I should calculate new color of each pixel in case of alpha blending) and that kill frame rate SO MY QUESTION IS ABOUT HOW IN 90s GAMES USE SOFTWARE RENDERING WITH LIMITED HARDWARE AND NOW ADAYS WITH MULTICORED SUPER CPUS I CANT STICK WITH RESONABLE FRAME RATE WITH MY ENGINE … I hope any one here understand me well
Software renderers in the 90s didn’t have or need SIMD. What they did have is:
- Very low resolutions.
- Very simple lighting model or no lighting model.
- 8 bit indexed color.
- In many cases, specialized rasterizers for floors and walls instead of or in addition to arbitrary polygon rasterizers.
- Very skilled programmers who specialized in optimizing software renderers.
First do a sanity check. Is what you are trying to achieve even viable? What is your resolution? How complex is your geometry? What kinds of rendering techniques are you using? If a 90’s game can do it in software, then you can also do it in software, but it’s easy to accidentally go past the limits of what’s possible for software rendering when writing a “modern retro” game.
Next, look at your innermost loops and optimize the fuck out of them. Are there function calls? Try inlining them manually. Are there operations that can be moved out of the loop? Move them out of the loop. Are there multiplications that can be replaced with additions? Replace them with additions.
Then, look at the disassembly of your innermost loops and use what you see to optimize again. Especially look for any function calls. It’s easy to miss these in the first optimization pass if the function call comes from overloaded operator. Also look for nested loops, and anything that looks unnecessary. Modern compilers are very smart in some respects, but very stupid in others.
Optimizing software renderers is hard. Entire books have been written on that topic.
@a light breeze thanks for these tips, the book use some assembly language in Optimizing vectors and matrices and writing memset_Quad al ittle bit faster than memset and avoid alpha belnding all these things ok …. but Im asking for SIMD and multithreading resources to utilise newer tech and benifit from it in addition to making software renderer is a great learning experience with some problems if you pass it the graphics APIs will be not complex at all …… So IM ASKING ABOUT BOOKS THAT I CAN LEARN SIMD AND MULTIHREADING WITH EXAMPLES NEAR TO PROBLEMS I FACE IN SOFTWARE RENDERING THATS ALL
Please also note that in the 90s, graphics cards were only able to do some simple texture mapping on hardware (with the voodoo graphics). If I’m not wrong we had to wait for the nvidia Riva to have first real GPUs, wich were capable of doing transform and lighting in hardware. That came late in the 90s (again if I’m not wrong about the date). So before this, all OpenGL and direct3D implementation were running in software, except for the texturing part (depending on the graphic card).
Note also that in the 90s, memory frequency was about the same order than the cpu clock.
You can of course achieve better these days, with a full software rasterizer. You can apply a matrix transform to millions of polygons in parallel for example. You’d also need to have direct access to the GPU framebuffer. And a decent rasterizer. But as some previously said, you’ll have to learn, and more importantly, to discover tricks to reduce the usage of each of your functions, and even more.
Have a look on how intel is doing software occlusion fully on the cpu side (this includes an optimized rasterizer). This involves a lot of AVX. Code could be seen here: https://github.com/GameTechDev/MaskedOcclusionCulling. There’s even a more optimized version, but couldn’t manage to find it.
I guess a good book to start might be this one: https://www.amazon.com/Assembly-Language-Step-Step-Third/dp/0470497025/ref=zg_bs_3954_4?_encoding=UTF8&psc=1&refRID=D6WEE15D7JMRA6RC7T79
After this, you’ll have to read maybe this one: https://www.amazon.com/Modern-X86-Assembly-Language-Programming/dp/1484200659/ref=sr_1_3?dchild=1&keywords=simd+avx&qid=1631617384&s=books&sr=1-3
but CPU is fucking unitasking machine
No, just no! There are a lot of tricks to convince the CPU to do multiple things for you other than SIMD. If you worry about performance so much, go ahead and take 1 or 2 of your CPU cores and use OpenCL to make batch processing lightning fast
but CPU is fucking unitasking machine so each frame you found your self have to run for loops zillions of time clearing screen draw stuff with texture mapping and scaling pixel by pixel (imagen you run 10 for loops each one having million iteration calculating things and memset and memcpy are not options as I should calculate new color of each pixel in case of alpha blending)
In the 90s you didn’t do generic texture mapping, scaling and alpha blending in the way we do nowadays. It may look that way, but games are very good at faking stuff that isn’t really there 🙂 Also in the 90s, programmers went down to assembly language level to code the really tight loops where speed mattered (which is only a tiny fraction of the code).
Computers have changed since. In the 90s CPUs were the slow thing. Now, it’s easy to make memory the slow thing by using bad memory access patterns so your powerful processor is just idling eons for getting data from memory.
Throwing more power at it with SIMD and multi-threading isn’t doing much good if your single thread is already a mess. Those power techniques start moving (much) more data to/from memory, and if memory access is the bottleneck, that only gives longer queues in getting data from the memory (more eons wasted cpu time). In addition, multi-threading adds concurrency, making it very complicated to make head or tails what is happening or why stuff isn’t fast.
First do the single thread case. Learn about profiling, and understand how memory access patterns affects CPU speed. Also check that your memory access patterns are what you think they are. Make single thread fast, then think about throwing more power at it.