[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Debug Overlay Cleanup and Render Group Performance Investigation" vod_platform=youtube id=Y2fxi_lFwE0 annotator=Miblo]
[0:30]["Today's episode might be a little bit of a potpourri"][quote 567]
[1:06][Determine to address the "Fix for profiler clipping issue on threads/frames/clocks view" issue[ref
    site="GitHub"
    page="HandmadeHero/cpp Issues"
    url="https://github.com/HandmadeHero/cpp/issues"]]
[2:07][Run the game and demo the incorrect clipping of the profiler]
[3:54][handmade_render_group.cpp: Fix GetClipRect() to use Rectangle.Min]
[5:00][Run the game to see that the profiler clipping is fixed]
[6:00][handmade_debug.cpp: Make DEBUGStart() offset the text Z]
[8:22][Run the game to see that the text drop-shadow looks correct]
[9:08][handmade_debug.cpp: Greatly increase the Z offset for the tooltip, so it draws above the profile bars]
[10:20][Run the game to see the tooltip box above the profile bars, but not its text contents, and discuss possible reasons why]
[17:53][handmade_debug_ui.cpp: Consider simplifying the text system]
[24:08][handmade_debug_ui.cpp: Enable DrawTooltips() to draw its text above everything]
[25:02][Run the game to see our tooltip text back]
[28:16][handmade_debug_interface.h: Temporarily conditionally append the function name to the UniqueFileCounterString__]
[29:19][Run the game to see those function names]
[30:42][handmade_debug.cpp: Enable DrawProfileBars() and DrawFrameBars() to put the Element->Name in their tooltip]
[32:59][Run the game to see our complete tooltips]
[34:16][Investigate why UpdateAndRenderEntities() takes so much time]
[37:10][Run the game to see that the EntityRender routine is the main culprit]
[39:45][handmade_entity.cpp: Further Isolate EntityRender]
[41:01][Run the game to see that RenderVolume and RenderPieces are both slow]
[41:24][Investigate this slowness]
[43:43][Step through the assembly for the RenderPieces block]
[47:16][handmade_render_group.cpp: Rewrite GetBitmapDim() without function overloading]
[50:40][Inspect the assembly for the RenderPieces block and note that GetBitmapDim() has now been inlined, as has PushQuad()]
[57:03][Investigate why RoundReal32ToInt32() isn't being inlined][quote 568]
[59:29][Determine to excise roundf()]
[1:00:55][Consult the Intel Intrinsics Guide for rounding instructions[ref
    site="Intel"
    page="Intrinsics Guide"
    url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]]
[1:03:53][handmade_intrinsics.h: Make RoundReal32ToInt32() and RoundReal32ToUInt32() use _mm_cvtss_si32()]
[1:05:25][Run the game to see that RenderPieces now takes 0.87% of the time, and consider the importance of profiling and looking at the assembly]
[1:09:27]["Death by a thousand function calls"][quote 569]
[1:10:05][Investigate why the RenderVolume block takes a lot of time]
[1:10:55][Inspect the assembly for PushLineSegment()][quote 570]
[1:13:47][handmade_intrinsics.h: Make SquareRoot() use _mm_sqrt_ss()[ref
    site="Intel"
    page="Intrinsics Guide"
    url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]]
[1:17:49][Run the game to see that that just works, but that RenderVolume is not faster]
[1:20:05][handmade_opengl.cpp: Add some debug timing blocks in OpenGLRenderCommands()]
[1:22:30][Run the game to see that OpenGL::TexturedQuads takes the most time, and investigate why]
[1:25:51][Consider how to speed up passing the textures to the GPU]
[1:28:08][handmade_opengl.cpp: Experiment with ways to speed up the OpenGL::QuadLoop block in OpenGLRenderCommands()]
[1:35:41][Run the game to see garbage, note the speed up of the block, but back out those changes in favour of potentially using texture arrays]
[1:38:20][Run the game to see that the cutscene does not work with the new 3D system]
[1:39:33][Calculate how large of a texture array we would need for the cutscene]
[1:43:06][handmade_cutscene.cpp: Reacquaint ourselves with the cutscene code]
[1:49:29][Run the game and try to interpret how the cutscene is being drawn]
[1:50:14][handmade_render_group.cpp: Make SetCameraTransform() take the NearClipPlane and FarClipPlane]
[1:53:19][handmade_cutscene.cpp: Make RenderLayeredScene() specify the clip planes]
[1:53:47][Run the game and note that our distance based fog is knocking out some of the cutscene]
[1:54:28][handmade_render_group.cpp: Make SetCameraTransform() take Fog]
[1:56:34][Run the game to see all the pieces of the cutscene]
[1:58:03][handmade_cutscene.cpp: Make RenderLayeredScene() set the FocalLength and run the game to see it]
[2:01:45]["Where Santa Claus went, I don't know"][quote 571]
[2:01:57][Run the game and investigate where Santa Claus went]
[2:08:45][Discover that the w values passed to PushQuad() are not all 0]
[2:09:57][handmade_render_group.cpp: Make PushBitmap() only set the ZBias for upright sprites]
[2:10:24][Run the game to see that Santa Claus is back]
[2:10:30][Q&A][:speech]
[2:11:05][@desuused][What will be our way forward to reduce glDrawArray calls? Will we look into texture atlases, megatextures, or invent our own way? Is it better to separate cutscene rendering because it's so different?]
[2:12:03][@kilo_pasztetowej][Are we forced to use SIMD registers to use those intrinsics or are there some ways to do that without those, like inline assembly or something else?]
[2:16:48][@desuused][Have you seen musl standard library? It has very good code quality. sqrtf on x86_64 is one instruction on it. It's for Linux, but parts of it are very portable[ref
    site="musl"
    url="http://www.musl-libc.org/"]]
[2:21:17][@botondar][Do you know whether the sqrt function in math.h uses processor instructions or calculates the results via some software algorithm?[ref
    site="Intel"
    page="Intrinsics Guide"
    url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]]
[2:23:42][Close it down][:speech]
[/video]