[video output=day432 member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Finishing the Main SIMD Raycasting Loop" vod_platform=youtube id=VvxxX9LxR9I annotator=Miblo] [0:01][Recap and set the stage for the day finishing SIMD optimising the :lighting][:optimisation :rendering :speech] [0:47][Toggle on the :threading and add a b32 Hit to raycast_result for RayCast() to encode that a ray did not hit][:lighting :optimisation :rendering] [5:57][:Run the game to find that we're running at 32ms] [6:29][Toggle off the :threading][:lighting :rendering] [6:49][:Run the game to see that we're running at 128ms per frame][:performance] [7:18][Toggle on the :threading][:lighting :optimisation :rendering] [7:26][:Run the game and consider our 32ms per frame rate][:performance] [9:43][Excise from RayCast() the non-SIMD tRay code, and start to consider how to retire rays hits][:lighting :optimisation :rendering] [11:42][Preserving ray hits vs traversing the spatial hierarchy, when :threading][:blackboard :geometry :lighting :rendering] [15:57][Enable RayCast() to record ray hits for each SIMD component before traversing the spatial hierarchy][:lighting :optimisation :rendering] [20:02][:Run the game to see that we're running at the same 32ms][:lighting :optimisation :rendering] [20:37][Revert RayCast() to traverse the spatial hierarchy, applying the ray hit mask for each component, and streamline how this works, introducing f32_4x versions of &= and |=][:lighting :optimisation :rendering] [26:58][:Run the game to see that that's fine][:lighting :optimisation :rendering] [27:06][Start to streamline the tRay setting code][:lighting :optimisation :rendering] [28:56][Fix the CloseEnough check in RayCast()][:lighting :optimisation :rendering] [30:05][:Run the game to see not much difference][:lighting :optimisation :rendering] [30:20][Introduce Select() to streamline the tRay setting code[ref site=Intel page="Intel Intrinsics Guide" url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:lighting :optimisation :rendering] [34:15][:Run the game to see that we are at \~28ms per frame][:lighting :optimisation :performance :rendering] [34:44][Make RayCast() set the BoxIndex and BoxSurface in SIMD using Select()][:lighting :optimisation :rendering] [42:30][:Run the game and crash in ComputeLightPropagation()][:lighting :optimisation :rendering] [44:45][Step in to GetBox() to see that our BoxIndex is busted][:lighting :optimisation :rendering :run] [46:33][Step through RayCast() to see what's happening][:lighting :optimisation :rendering :run] [49:46][Make RayCast() actually set the BoxIndex and BoxSurfaceIndex][:lighting :optimisation :owl :programming :rendering] [50:55][:Run the game with the selection happening][:lighting :optimisation :owl :performance :rendering] [51:09][Make RayCast() set the RayP in SIMD using a v3_4x version of Select()][:lighting :optimisation :owl :programming :rendering] [53:38][:Run the game to see that we are down to \~26ms][:lighting :optimisation :owl :performance :rendering] [54:22][Add a TIMED_FUNCTION() in RayCast()][:lighting :optimisation :owl :programming :rendering] [54:41][:Run the game to consult the profiler][:lighting :optimisation :owl :performance :rendering] [54:55][Add a TIMED_BLOCK() around the startup code in RayCast()][:lighting :optimisation :owl :programming :rendering] [55:35][:Run the game and consult the profiler to see that the startup cost is not high][:lighting :optimisation :owl :performance :rendering] [56:00][Perform SampleHemisphere() in SIMD][:lighting :optimisation :owl :programming :rendering :statistics] [1:01:22][:Run the game to see that we're down to 22ms per frame][:lighting :optimisation :owl :performance :rendering] [1:02:04][Temporarily make SampleHemisphere() use complete randomisation][:lighting :optimisation :owl :programming :rendering :statistics] [1:02:20][:Run the game to see that this would put us back up to 30ms per frame, and note why][:lighting :optimisation :owl :performance :rendering :statistics] [1:04:08][Drop the RayCount down to 4 in ComputeLightPropagation()][:lighting :optimisation :owl :programming :rendering] [1:04:25][:Run the game and unexpectedly see no speed improvement][:lighting :optimisation :owl :rendering :run] [1:05:45][Remove variable suffixes in RayCast()][:lighting :optimisation :owl :programming :rendering] [1:08:55][Consider removing the Depth loop in RayCast() and reposition the AnyTrue(Mask) test][:lighting :optimisation :rendering] [1:10:31][:Run the game and consider where to go from here][:lighting :optimisation :rendering] [1:11:25][Inspect the assembly of RayCast()][:asm :lighting :optimisation :rendering] [1:14:32][Remove the Mask tests from RayCast() entirely][:lighting :optimisation :rendering] [1:15:16][:Run the game to see no real difference][:lighting :optimisation :performance :rendering] [1:15:40][Try removing the AnyTrue(tCheck)][:lighting :optimisation :rendering] [1:15:55][:Run the game to see that that would put us up to \~25ms per frame][:lighting :optimisation :performance :rendering] [1:16:51][Compute RayP at the very end of RayCast()][:lighting :optimisation :rendering] [1:18:30][:Run the game to see no difference][:lighting :optimisation :performance :rendering] [1:18:41][Replace RayP with tRay in RayCast()][:lighting :optimisation :rendering] [1:20:40][:Run the game to see no difference][:lighting :optimisation :performance :rendering] [1:21:01][Let RayCast() break if(AllTrue(Mask))][:lighting :optimisation :rendering] [1:21:12][:Run the game to see no difference][:lighting :optimisation :performance :rendering] [1:21:37][Toggle off the snake][:"entity system" :lighting :optimisation :rendering] [1:21:47][:Run the game with our consistently lit scene][:lighting :optimisation :performance :rendering] [1:22:28][Inline AccumulateSample() in ComputeLightPropagation()][:lighting :optimisation :rendering] [1:25:37][:Run the game to see no difference, and consider further improvements][:lighting :optimisation :performance :rendering] [1:27:06][Make SampleHemisphere() operate entirely SIMD, introducing RandomBilateral_4x() and versions of Inner(), LengthSq() and NOZ() that take v3_4x][:lighting :optimisation :prng :rendering :statistics] [1:41:35][A few words on _mm_rsqrt_ps and _mm_sqrt_ps[ref site=Intel page="Intel Intrinsics Guide" url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:mathematics :research] [1:44:20][Rename our new NOZ() to ApproxNOZ() for SampleHemisphere() to call, and introduce ApproxInvSquareRoot() using _mm_rsqrt_ps[ref site=Intel page="Intel Intrinsics Guide" url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:optimisation :mathematics] [1:49:24][:Run the game to see no difference][:lighting :optimisation :performance :rendering] [1:49:52][Inline SampleHemisphere() in ComputeLightPropagation()][:lighting :optimisation :rendering :statistics] [1:51:51][:Run the game at \~22ms per frame, and consider that this CPU rendered :lighting is performant enough for us][:optimisation :performance :rendering] [1:52:52][Q&A][:speech] [1:53:22][@printf_armin][How much % of the CPU does it drain?][:performance] [1:56:07][@tbodt_][Q: What version control system do you use?][:vcs] [1:56:50][@nxsy][Q: Can you look at the thread profile view when you lower the samples from 16 to 4 and don’t improve frame rate?][:performance :run] [1:57:48][Temporarily Toggle off VSync][:rendering] [1:58:46][:Run the game and consult the profiler to determine that the pixel shader is too slow][:performance :rendering] [2:02:17][@mallesbixie][Q: You cast the input value in the U32_4x loader to a float. Why is that?[ref site=Intel page="Intel Intrinsics Guide" url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]] [2:05:32][@jamoflaw][Q: How does the mask replace an if in the SIMD instructions?] [2:05:46][Masking in SIMD[ref site=Intel page="Intel Intrinsics Guide" url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/][ref site=GitHub page="rygorous / ryg_rans / rans_word_sse41.h" url=https://github.com/rygorous/ryg_rans/blob/master/rans_word_sse41.h]][:blackboard :optimisation] [2:19:38][Read about PBLENDVB - 'Variable Blend Packed Bytes' in the Intel 64 and IA-32 Architectures Software Developer Manuals[ref site="Intel" page="Intel 64 and IA-32 Architectures Software Developer Manuals" url=https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html] and consult the Steam Hardware Survey[ref site="Valve" page="Steam Hardware & Software Survey" url=https://store.steampowered.com/hwsurvey] for instruction set use][:isa :optimisation :research] [2:25:23][@alexkelbo][Q: Why does the light flicker even when the cube is not moving?][:lighting :rendering] [2:25:42][@vaualbus][Q: Could we ship with SSE4, so we have _m256 and get more :performance improvement?][:isa] [2:25:52][@sgtrumbi][Q: What can you see in TaskManager's GPU tab?][:performance] [2:27:25][@longboolean][Q: Do you have any tips on talking with non programmers about programming related concepts?] [2:27:34][@tbodt_][Q: How does your profiler work? Does it hook into the compiler or something?][:profiling] [2:27:51][@alexkelbo][Q: Could we compile one exe with SSE4 that falls back to SSE2 if it's not present on the CPU, or would we need to compile into several exe's?][:isa] [2:28:50][Close up shop][:speech] [/video]