[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Finishing the Main SIMD Raycasting Loop" vod_platform=youtube id=VvxxX9LxR9I annotator=Miblo]
[0:01][Recap and set the stage for the day finishing SIMD optimising the :lighting][:optimisation :rendering :speech]
[0:47][Toggle on the :threading and add a b32 Hit to raycast_result for RayCast() to encode that a ray did not hit][:lighting :optimisation :rendering]
[5:57][:Run the game to find that we're running at 32ms]
[6:29][Toggle off the :threading][:lighting :rendering]
[6:49][:Run the game to see that we're running at 128ms per frame][:performance]
[7:18][Toggle on the :threading][:lighting :optimisation :rendering]
[7:26][:Run the game and consider our 32ms per frame rate][:performance]
[9:43][Excise from RayCast() the non-SIMD tRay code, and start to consider how to retire rays hits][:lighting :optimisation :rendering]
[11:42][Preserving ray hits vs traversing the spatial hierarchy, when :threading][:blackboard :geometry :lighting :rendering]
[15:57][Enable RayCast() to record ray hits for each SIMD component before traversing the spatial hierarchy][:lighting :optimisation :rendering]
[20:02][:Run the game to see that we're running at the same 32ms][:lighting :optimisation :rendering]
[20:37][Revert RayCast() to traverse the spatial hierarchy, applying the ray hit mask for each component, and streamline how this works, introducing f32_4x versions of &= and |=][:lighting :optimisation :rendering]
[26:58][:Run the game to see that that's fine][:lighting :optimisation :rendering]
[27:06][Start to streamline the tRay setting code][:lighting :optimisation :rendering]
[28:56][Fix the CloseEnough check in RayCast()][:lighting :optimisation :rendering]
[30:05][:Run the game to see not much difference][:lighting :optimisation :rendering]
[30:20][Introduce Select() to streamline the tRay setting code[ref
[1:02:20][:Run the game to see that this would put us back up to 30ms per frame, and note why][:lighting :optimisation :owl :performance :rendering :statistics]
[1:21:12][:Run the game to see no difference][:lighting :optimisation :performance :rendering]
[1:21:37][Toggle off the snake][:"entity system" :lighting :optimisation :rendering]
[1:21:47][:Run the game with our consistently lit scene][:lighting :optimisation :performance :rendering]
[1:22:28][Inline AccumulateSample() in ComputeLightPropagation()][:lighting :optimisation :rendering]
[1:25:37][:Run the game to see no difference, and consider further improvements][:lighting :optimisation :performance :rendering]
[1:27:06][Make SampleHemisphere() operate entirely SIMD, introducing RandomBilateral_4x() and versions of Inner(), LengthSq() and NOZ() that take v3_4x][:lighting :optimisation :prng :rendering :statistics]
[1:41:35][A few words on _mm_rsqrt_ps and _mm_sqrt_ps[ref
[1:49:24][:Run the game to see no difference][:lighting :optimisation :performance :rendering]
[1:49:52][Inline SampleHemisphere() in ComputeLightPropagation()][:lighting :optimisation :rendering :statistics]
[1:51:51][:Run the game at \~22ms per frame, and consider that this CPU rendered :lighting is performant enough for us][:optimisation :performance :rendering]
[1:52:52][Q&A][:speech]
[1:53:22][@printf_armin][How much % of the CPU does it drain?][:performance]
[1:56:07][@tbodt_][Q: What version control system do you use?][:vcs]
[1:56:50][@nxsy][Q: Can you look at the thread profile view when you lower the samples from 16 to 4 and don’t improve frame rate?][:performance :run]
[1:57:48][Temporarily Toggle off VSync][:rendering]
[1:58:46][:Run the game and consult the profiler to determine that the pixel shader is too slow][:performance :rendering]
[2:02:17][@mallesbixie][Q: You cast the input value in the U32_4x loader to a float. Why is that?[ref
[2:19:38][Read about PBLENDVB - 'Variable Blend Packed Bytes' in the Intel 64 and IA-32 Architectures Software Developer Manuals[ref
site="Intel"
page="Intel 64 and IA-32 Architectures Software Developer Manuals"
url=https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html] and consult the Steam Hardware Survey[ref
site="Valve"
page="Steam Hardware & Software Survey"
url=https://store.steampowered.com/hwsurvey] for instruction set use][:isa :optimisation :research]
[2:25:23][@alexkelbo][Q: Why does the light flicker even when the cube is not moving?][:lighting :rendering]
[2:25:42][@vaualbus][Q: Could we ship with SSE4, so we have _m256 and get more :performance improvement?][:isa]
[2:25:52][@sgtrumbi][Q: What can you see in TaskManager's GPU tab?][:performance]
[2:27:25][@longboolean][Q: Do you have any tips on talking with non programmers about programming related concepts?]
[2:27:34][@tbodt_][Q: How does your profiler work? Does it hook into the compiler or something?][:profiling]
[2:27:51][@alexkelbo][Q: Could we compile one exe with SSE4 that falls back to SSE2 if it's not present on the CPU, or would we need to compile into several exe's?][:isa]