cinera_handmade.network/cmuratori/hero/code/code432.hmml

104 lines
9.4 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[video output=day432 member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Finishing the Main SIMD Raycasting Loop" vod_platform=youtube id=VvxxX9LxR9I annotator=Miblo]
[0:01][Recap and set the stage for the day finishing SIMD optimising the :lighting][:optimisation :rendering :speech]
[0:47][Toggle on the :threading and add a b32 Hit to raycast_result for RayCast() to encode that a ray did not hit][:lighting :optimisation :rendering]
[5:57][:Run the game to find that we're running at 32ms]
[6:29][Toggle off the :threading][:lighting :rendering]
[6:49][:Run the game to see that we're running at 128ms per frame][:performance]
[7:18][Toggle on the :threading][:lighting :optimisation :rendering]
[7:26][:Run the game and consider our 32ms per frame rate][:performance]
[9:43][Excise from RayCast() the non-SIMD tRay code, and start to consider how to retire rays hits][:lighting :optimisation :rendering]
[11:42][Preserving ray hits vs traversing the spatial hierarchy, when :threading][:blackboard :geometry :lighting :rendering]
[15:57][Enable RayCast() to record ray hits for each SIMD component before traversing the spatial hierarchy][:lighting :optimisation :rendering]
[20:02][:Run the game to see that we're running at the same 32ms][:lighting :optimisation :rendering]
[20:37][Revert RayCast() to traverse the spatial hierarchy, applying the ray hit mask for each component, and streamline how this works, introducing f32_4x versions of &= and |=][:lighting :optimisation :rendering]
[26:58][:Run the game to see that that's fine][:lighting :optimisation :rendering]
[27:06][Start to streamline the tRay setting code][:lighting :optimisation :rendering]
[28:56][Fix the CloseEnough check in RayCast()][:lighting :optimisation :rendering]
[30:05][:Run the game to see not much difference][:lighting :optimisation :rendering]
[30:20][Introduce Select() to streamline the tRay setting code[ref
site=Intel
page="Intel Intrinsics Guide"
url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:lighting :optimisation :rendering]
[34:15][:Run the game to see that we are at \~28ms per frame][:lighting :optimisation :performance :rendering]
[34:44][Make RayCast() set the BoxIndex and BoxSurface in SIMD using Select()][:lighting :optimisation :rendering]
[42:30][:Run the game and crash in ComputeLightPropagation()][:lighting :optimisation :rendering]
[44:45][Step in to GetBox() to see that our BoxIndex is busted][:lighting :optimisation :rendering :run]
[46:33][Step through RayCast() to see what's happening][:lighting :optimisation :rendering :run]
[49:46][Make RayCast() actually set the BoxIndex and BoxSurfaceIndex][:lighting :optimisation :owl :programming :rendering]
[50:55][:Run the game with the selection happening][:lighting :optimisation :owl :performance :rendering]
[51:09][Make RayCast() set the RayP in SIMD using a v3_4x version of Select()][:lighting :optimisation :owl :programming :rendering]
[53:38][:Run the game to see that we are down to \~26ms][:lighting :optimisation :owl :performance :rendering]
[54:22][Add a TIMED_FUNCTION() in RayCast()][:lighting :optimisation :owl :programming :rendering]
[54:41][:Run the game to consult the profiler][:lighting :optimisation :owl :performance :rendering]
[54:55][Add a TIMED_BLOCK() around the startup code in RayCast()][:lighting :optimisation :owl :programming :rendering]
[55:35][:Run the game and consult the profiler to see that the startup cost is not high][:lighting :optimisation :owl :performance :rendering]
[56:00][Perform SampleHemisphere() in SIMD][:lighting :optimisation :owl :programming :rendering :statistics]
[1:01:22][:Run the game to see that we're down to 22ms per frame][:lighting :optimisation :owl :performance :rendering]
[1:02:04][Temporarily make SampleHemisphere() use complete randomisation][:lighting :optimisation :owl :programming :rendering :statistics]
[1:02:20][:Run the game to see that this would put us back up to 30ms per frame, and note why][:lighting :optimisation :owl :performance :rendering :statistics]
[1:04:08][Drop the RayCount down to 4 in ComputeLightPropagation()][:lighting :optimisation :owl :programming :rendering]
[1:04:25][:Run the game and unexpectedly see no speed improvement][:lighting :optimisation :owl :rendering :run]
[1:05:45][Remove variable suffixes in RayCast()][:lighting :optimisation :owl :programming :rendering]
[1:08:55][Consider removing the Depth loop in RayCast() and reposition the AnyTrue(Mask) test][:lighting :optimisation :rendering]
[1:10:31][:Run the game and consider where to go from here][:lighting :optimisation :rendering]
[1:11:25][Inspect the assembly of RayCast()][:asm :lighting :optimisation :rendering]
[1:14:32][Remove the Mask tests from RayCast() entirely][:lighting :optimisation :rendering]
[1:15:16][:Run the game to see no real difference][:lighting :optimisation :performance :rendering]
[1:15:40][Try removing the AnyTrue(tCheck)][:lighting :optimisation :rendering]
[1:15:55][:Run the game to see that that would put us up to \~25ms per frame][:lighting :optimisation :performance :rendering]
[1:16:51][Compute RayP at the very end of RayCast()][:lighting :optimisation :rendering]
[1:18:30][:Run the game to see no difference][:lighting :optimisation :performance :rendering]
[1:18:41][Replace RayP with tRay in RayCast()][:lighting :optimisation :rendering]
[1:20:40][:Run the game to see no difference][:lighting :optimisation :performance :rendering]
[1:21:01][Let RayCast() break if(AllTrue(Mask))][:lighting :optimisation :rendering]
[1:21:12][:Run the game to see no difference][:lighting :optimisation :performance :rendering]
[1:21:37][Toggle off the snake][:"entity system" :lighting :optimisation :rendering]
[1:21:47][:Run the game with our consistently lit scene][:lighting :optimisation :performance :rendering]
[1:22:28][Inline AccumulateSample() in ComputeLightPropagation()][:lighting :optimisation :rendering]
[1:25:37][:Run the game to see no difference, and consider further improvements][:lighting :optimisation :performance :rendering]
[1:27:06][Make SampleHemisphere() operate entirely SIMD, introducing RandomBilateral_4x() and versions of Inner(), LengthSq() and NOZ() that take v3_4x][:lighting :optimisation :prng :rendering :statistics]
[1:41:35][A few words on _mm_rsqrt_ps and _mm_sqrt_ps[ref
site=Intel
page="Intel Intrinsics Guide"
url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:mathematics :research]
[1:44:20][Rename our new NOZ() to ApproxNOZ() for SampleHemisphere() to call, and introduce ApproxInvSquareRoot() using _mm_rsqrt_ps[ref
site=Intel
page="Intel Intrinsics Guide"
url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:optimisation :mathematics]
[1:49:24][:Run the game to see no difference][:lighting :optimisation :performance :rendering]
[1:49:52][Inline SampleHemisphere() in ComputeLightPropagation()][:lighting :optimisation :rendering :statistics]
[1:51:51][:Run the game at \~22ms per frame, and consider that this CPU rendered :lighting is performant enough for us][:optimisation :performance :rendering]
[1:52:52][Q&A][:speech]
[1:53:22][@printf_armin][How much % of the CPU does it drain?][:performance]
[1:56:07][@tbodt_][Q: What version control system do you use?][:vcs]
[1:56:50][@nxsy][Q: Can you look at the thread profile view when you lower the samples from 16 to 4 and dont improve frame rate?][:performance :run]
[1:57:48][Temporarily Toggle off VSync][:rendering]
[1:58:46][:Run the game and consult the profiler to determine that the pixel shader is too slow][:performance :rendering]
[2:02:17][@mallesbixie][Q: You cast the input value in the U32_4x loader to a float. Why is that?[ref
site=Intel
page="Intel Intrinsics Guide"
url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]]
[2:05:32][@jamoflaw][Q: How does the mask replace an if in the SIMD instructions?]
[2:05:46][Masking in SIMD[ref
site=Intel
page="Intel Intrinsics Guide"
url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/][ref
site=GitHub
page="rygorous / ryg_rans / rans_word_sse41.h"
url=https://github.com/rygorous/ryg_rans/blob/master/rans_word_sse41.h]][:blackboard :optimisation]
[2:19:38][Read about PBLENDVB - 'Variable Blend Packed Bytes' in the Intel 64 and IA-32 Architectures Software Developer Manuals[ref
site="Intel"
page="Intel 64 and IA-32 Architectures Software Developer Manuals"
url=https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html] and consult the Steam Hardware Survey[ref
site="Valve"
page="Steam Hardware & Software Survey"
url=https://store.steampowered.com/hwsurvey] for instruction set use][:isa :optimisation :research]
[2:25:23][@alexkelbo][Q: Why does the light flicker even when the cube is not moving?][:lighting :rendering]
[2:25:42][@vaualbus][Q: Could we ship with SSE4, so we have _m256 and get more :performance improvement?][:isa]
[2:25:52][@sgtrumbi][Q: What can you see in TaskManager's GPU tab?][:performance]
[2:27:25][@longboolean][Q: Do you have any tips on talking with non programmers about programming related concepts?]
[2:27:34][@tbodt_][Q: How does your profiler work? Does it hook into the compiler or something?][:profiling]
[2:27:51][@alexkelbo][Q: Could we compile one exe with SSE4 that falls back to SSE2 if it's not present on the CPU, or would we need to compile into several exe's?][:isa]
[2:28:50][Close up shop][:speech]
[/video]