104 lines
9.4 KiB
Plaintext
104 lines
9.4 KiB
Plaintext
[video output=day432 member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Finishing the Main SIMD Raycasting Loop" vod_platform=youtube id=VvxxX9LxR9I annotator=Miblo]
|
||
[0:01][Recap and set the stage for the day finishing SIMD optimising the :lighting][:optimisation :rendering :speech]
|
||
[0:47][Toggle on the :threading and add a b32 Hit to raycast_result for RayCast() to encode that a ray did not hit][:lighting :optimisation :rendering]
|
||
[5:57][:Run the game to find that we're running at 32ms]
|
||
[6:29][Toggle off the :threading][:lighting :rendering]
|
||
[6:49][:Run the game to see that we're running at 128ms per frame][:performance]
|
||
[7:18][Toggle on the :threading][:lighting :optimisation :rendering]
|
||
[7:26][:Run the game and consider our 32ms per frame rate][:performance]
|
||
[9:43][Excise from RayCast() the non-SIMD tRay code, and start to consider how to retire rays hits][:lighting :optimisation :rendering]
|
||
[11:42][Preserving ray hits vs traversing the spatial hierarchy, when :threading][:blackboard :geometry :lighting :rendering]
|
||
[15:57][Enable RayCast() to record ray hits for each SIMD component before traversing the spatial hierarchy][:lighting :optimisation :rendering]
|
||
[20:02][:Run the game to see that we're running at the same 32ms][:lighting :optimisation :rendering]
|
||
[20:37][Revert RayCast() to traverse the spatial hierarchy, applying the ray hit mask for each component, and streamline how this works, introducing f32_4x versions of &= and |=][:lighting :optimisation :rendering]
|
||
[26:58][:Run the game to see that that's fine][:lighting :optimisation :rendering]
|
||
[27:06][Start to streamline the tRay setting code][:lighting :optimisation :rendering]
|
||
[28:56][Fix the CloseEnough check in RayCast()][:lighting :optimisation :rendering]
|
||
[30:05][:Run the game to see not much difference][:lighting :optimisation :rendering]
|
||
[30:20][Introduce Select() to streamline the tRay setting code[ref
|
||
site=Intel
|
||
page="Intel Intrinsics Guide"
|
||
url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:lighting :optimisation :rendering]
|
||
[34:15][:Run the game to see that we are at \~28ms per frame][:lighting :optimisation :performance :rendering]
|
||
[34:44][Make RayCast() set the BoxIndex and BoxSurface in SIMD using Select()][:lighting :optimisation :rendering]
|
||
[42:30][:Run the game and crash in ComputeLightPropagation()][:lighting :optimisation :rendering]
|
||
[44:45][Step in to GetBox() to see that our BoxIndex is busted][:lighting :optimisation :rendering :run]
|
||
[46:33][Step through RayCast() to see what's happening][:lighting :optimisation :rendering :run]
|
||
[49:46][Make RayCast() actually set the BoxIndex and BoxSurfaceIndex][:lighting :optimisation :owl :programming :rendering]
|
||
[50:55][:Run the game with the selection happening][:lighting :optimisation :owl :performance :rendering]
|
||
[51:09][Make RayCast() set the RayP in SIMD using a v3_4x version of Select()][:lighting :optimisation :owl :programming :rendering]
|
||
[53:38][:Run the game to see that we are down to \~26ms][:lighting :optimisation :owl :performance :rendering]
|
||
[54:22][Add a TIMED_FUNCTION() in RayCast()][:lighting :optimisation :owl :programming :rendering]
|
||
[54:41][:Run the game to consult the profiler][:lighting :optimisation :owl :performance :rendering]
|
||
[54:55][Add a TIMED_BLOCK() around the startup code in RayCast()][:lighting :optimisation :owl :programming :rendering]
|
||
[55:35][:Run the game and consult the profiler to see that the startup cost is not high][:lighting :optimisation :owl :performance :rendering]
|
||
[56:00][Perform SampleHemisphere() in SIMD][:lighting :optimisation :owl :programming :rendering :statistics]
|
||
[1:01:22][:Run the game to see that we're down to 22ms per frame][:lighting :optimisation :owl :performance :rendering]
|
||
[1:02:04][Temporarily make SampleHemisphere() use complete randomisation][:lighting :optimisation :owl :programming :rendering :statistics]
|
||
[1:02:20][:Run the game to see that this would put us back up to 30ms per frame, and note why][:lighting :optimisation :owl :performance :rendering :statistics]
|
||
[1:04:08][Drop the RayCount down to 4 in ComputeLightPropagation()][:lighting :optimisation :owl :programming :rendering]
|
||
[1:04:25][:Run the game and unexpectedly see no speed improvement][:lighting :optimisation :owl :rendering :run]
|
||
[1:05:45][Remove variable suffixes in RayCast()][:lighting :optimisation :owl :programming :rendering]
|
||
[1:08:55][Consider removing the Depth loop in RayCast() and reposition the AnyTrue(Mask) test][:lighting :optimisation :rendering]
|
||
[1:10:31][:Run the game and consider where to go from here][:lighting :optimisation :rendering]
|
||
[1:11:25][Inspect the assembly of RayCast()][:asm :lighting :optimisation :rendering]
|
||
[1:14:32][Remove the Mask tests from RayCast() entirely][:lighting :optimisation :rendering]
|
||
[1:15:16][:Run the game to see no real difference][:lighting :optimisation :performance :rendering]
|
||
[1:15:40][Try removing the AnyTrue(tCheck)][:lighting :optimisation :rendering]
|
||
[1:15:55][:Run the game to see that that would put us up to \~25ms per frame][:lighting :optimisation :performance :rendering]
|
||
[1:16:51][Compute RayP at the very end of RayCast()][:lighting :optimisation :rendering]
|
||
[1:18:30][:Run the game to see no difference][:lighting :optimisation :performance :rendering]
|
||
[1:18:41][Replace RayP with tRay in RayCast()][:lighting :optimisation :rendering]
|
||
[1:20:40][:Run the game to see no difference][:lighting :optimisation :performance :rendering]
|
||
[1:21:01][Let RayCast() break if(AllTrue(Mask))][:lighting :optimisation :rendering]
|
||
[1:21:12][:Run the game to see no difference][:lighting :optimisation :performance :rendering]
|
||
[1:21:37][Toggle off the snake][:"entity system" :lighting :optimisation :rendering]
|
||
[1:21:47][:Run the game with our consistently lit scene][:lighting :optimisation :performance :rendering]
|
||
[1:22:28][Inline AccumulateSample() in ComputeLightPropagation()][:lighting :optimisation :rendering]
|
||
[1:25:37][:Run the game to see no difference, and consider further improvements][:lighting :optimisation :performance :rendering]
|
||
[1:27:06][Make SampleHemisphere() operate entirely SIMD, introducing RandomBilateral_4x() and versions of Inner(), LengthSq() and NOZ() that take v3_4x][:lighting :optimisation :prng :rendering :statistics]
|
||
[1:41:35][A few words on _mm_rsqrt_ps and _mm_sqrt_ps[ref
|
||
site=Intel
|
||
page="Intel Intrinsics Guide"
|
||
url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:mathematics :research]
|
||
[1:44:20][Rename our new NOZ() to ApproxNOZ() for SampleHemisphere() to call, and introduce ApproxInvSquareRoot() using _mm_rsqrt_ps[ref
|
||
site=Intel
|
||
page="Intel Intrinsics Guide"
|
||
url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:optimisation :mathematics]
|
||
[1:49:24][:Run the game to see no difference][:lighting :optimisation :performance :rendering]
|
||
[1:49:52][Inline SampleHemisphere() in ComputeLightPropagation()][:lighting :optimisation :rendering :statistics]
|
||
[1:51:51][:Run the game at \~22ms per frame, and consider that this CPU rendered :lighting is performant enough for us][:optimisation :performance :rendering]
|
||
[1:52:52][Q&A][:speech]
|
||
[1:53:22][@printf_armin][How much % of the CPU does it drain?][:performance]
|
||
[1:56:07][@tbodt_][Q: What version control system do you use?][:vcs]
|
||
[1:56:50][@nxsy][Q: Can you look at the thread profile view when you lower the samples from 16 to 4 and don’t improve frame rate?][:performance :run]
|
||
[1:57:48][Temporarily Toggle off VSync][:rendering]
|
||
[1:58:46][:Run the game and consult the profiler to determine that the pixel shader is too slow][:performance :rendering]
|
||
[2:02:17][@mallesbixie][Q: You cast the input value in the U32_4x loader to a float. Why is that?[ref
|
||
site=Intel
|
||
page="Intel Intrinsics Guide"
|
||
url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]]
|
||
[2:05:32][@jamoflaw][Q: How does the mask replace an if in the SIMD instructions?]
|
||
[2:05:46][Masking in SIMD[ref
|
||
site=Intel
|
||
page="Intel Intrinsics Guide"
|
||
url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/][ref
|
||
site=GitHub
|
||
page="rygorous / ryg_rans / rans_word_sse41.h"
|
||
url=https://github.com/rygorous/ryg_rans/blob/master/rans_word_sse41.h]][:blackboard :optimisation]
|
||
[2:19:38][Read about PBLENDVB - 'Variable Blend Packed Bytes' in the Intel 64 and IA-32 Architectures Software Developer Manuals[ref
|
||
site="Intel"
|
||
page="Intel 64 and IA-32 Architectures Software Developer Manuals"
|
||
url=https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html] and consult the Steam Hardware Survey[ref
|
||
site="Valve"
|
||
page="Steam Hardware & Software Survey"
|
||
url=https://store.steampowered.com/hwsurvey] for instruction set use][:isa :optimisation :research]
|
||
[2:25:23][@alexkelbo][Q: Why does the light flicker even when the cube is not moving?][:lighting :rendering]
|
||
[2:25:42][@vaualbus][Q: Could we ship with SSE4, so we have _m256 and get more :performance improvement?][:isa]
|
||
[2:25:52][@sgtrumbi][Q: What can you see in TaskManager's GPU tab?][:performance]
|
||
[2:27:25][@longboolean][Q: Do you have any tips on talking with non programmers about programming related concepts?]
|
||
[2:27:34][@tbodt_][Q: How does your profiler work? Does it hook into the compiler or something?][:profiling]
|
||
[2:27:51][@alexkelbo][Q: Could we compile one exe with SSE4 that falls back to SSE2 if it's not present on the CPU, or would we need to compile into several exe's?][:isa]
|
||
[2:28:50][Close up shop][:speech]
|
||
[/video]
|