[video output=day615 member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Optimized Grid Step Selection" vod_platform=youtube id=wAfhYY4GSYU annotator=Miblo] [0:02][Welcome to the stream with a plug of Handmade Seattle 2020[ref site="Handmade Seattle" page=Tickets url=https://www.handmade-seattle.com/#tickets] and thanks to [@abnercoimbre Abner]][:research] [5:25][Demo the current state of the :lighting][:run] [7:01][Explain our :lighting system's two hot zones GridRayCast() and ComputeLightPropagation()][:research] [8:52][hhlightprof total seconds elapsed: 4.534990][:lighting :performance :run] [9:39][Toggle off the DiffuseWeightMap update in ComputeLightPropagation()][:lighting] [9:46][hhlightprof total seconds elapsed: 3.599488][:lighting :performance :run] [11:36][Determine to further optimise GridRayCast()][:lighting :speech] [11:56][Try decreasing the CostMetric from 16 to 0 in GridRayCast()][:lighting] [12:17][hhlightprof total seconds elapsed: 2.211856][:lighting :performance :run] [12:33][Try increasing the CostMetric from 0 to 1 in GridRayCast()][:lighting] [12:57][hhlightprof total seconds elapsed: 2.629898][:lighting :performance :run] [13:22][Note the sensitivity of GridRayCast() to repetition][:lighting :speech] [14:36][Let GridRayCast() set the CostMetric to our default 16][:lighting] [14:55][Seek improvements to GridRayCast()][:lighting :optimisation :research] [18:28][Note the fine-grained nature of our :lighting grid][:run] [20:03][Make ProfileRun() print the spatial grid occupancy[ref site="Microsoft Docs" page="__popcnt16, __popcnt, __popcnt64" url=https://docs.microsoft.com/en-us/cpp/intrinsics/popcnt16-popcnt-popcnt64?view=vs-2019]][:lighting :simd] [31:45][Step in to ProfileRun()][:lighting :run] [32:08][Try to demo ~RemedyBG's , \[comma\] Watch window syntax, with thanks to @x13pixels][:admin] [33:57][~RemedyBG feature request: Formatters for regular variables in the Watch window][:admin] [34:33][Check the box occupancy values produced by ProfileRun()][:lighting :run :simd] [35:05][hhlightprof box occupancy: Low][:lighting :performance :run] [36:34][Determine to perform ComputeWalkTable() inline][:lighting :optimisation :research] [39:12][Introduce ComputeWalkTableFast(), which does not return anything, but may be used to verify our results][:lighting :optimisation] [43:03][:Run hhlightprof successfully][:lighting :optimisation] [43:12][Induce an error in ComputeWalkTableFast()][:lighting :optimisation] [43:22][:Run hhlightprof without faulting][:lighting :optimisation] [44:18][Step through ComputeWalkTableFast()][:lighting :run] [46:11][Use a hand-coded assertion in ComputeWalkTableFast()][:lighting] [47:13][:Run hhlightprof with a fault][:lighting :optimisation] [47:22][Remove our induced error from ComputeWalkTableFast()][:lighting :optimisation] [47:30][:Run hhlightprof successfully][:lighting :optimisation] [47:48][Embark on optimising ComputeWalkTableFast() in :SIMD][:lighting :optimisation] [55:07][:Run hhlightprof with a fault, due to tTerminateResult being totally wrong][:lighting :optimisation :simd] [56:02][Fix ComputeWalkTableFast() to compute At4 inside the loop][:lighting :optimisation :simd] [56:55][:Run hhlightprof successfully][:lighting :optimisation] [57:15][Optimise ComputeWalkTableFast() to compute BestDim using an HCompShuffler][:lighting :optimisation :simd] [1:02:26][:Run hhlightprof with a fault, due to dGridResult being wrong][:lighting :optimisation :simd] [1:03:17][Remove 14 and 15 from the HCompShuffler in ComputeWalkTableFast()][:lighting :optimisation :simd] [1:04:57][:Run hhlightprof with a fault, due to tBestRef and tBest differing][:lighting :optimisation :simd] [1:06:57][Consider how best to traverse the walk table][:lighting :optimisation :research] [1:10:43][Look into _mm_minpos_epu16() at the Intel Intrinsics Guide[ref site=Intel page="Intel Intrinsics Guide" url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:lighting :optimisation :research] [1:13:14][Introduce a second HCompShufflerLow to compare the low 16-bits of values with equivalent high 16-bits][:lighting :optimisation] [1:15:14][Revert the HCompShufflerLow][:lighting :optimisation] [1:15:54][Optimise our WalkTable traversal using all four :SIMD lanes, replacing the HCompShuffler with BestTable][:lighting :optimisation] [1:31:13][:Run hhlightprof with a verification fault][:lighting :optimisation :simd] [1:34:00][Assert in ComputeWalkTableFast() that the CompMask is within bounds of the BestTable][:lighting :optimisation :simd] [1:34:47][:Run hhlightprof with a verification fault not on the BestTable bounds][:lighting :optimisation :simd] [1:36:35][Add a breakpoint in ComputeWalkTableFast() on SampleDirIndex 135][:lighting :optimisation :simd] [1:36:58][Step through ComputeWalkTableFast() on SampleDirIndex 135][:lighting :optimisation :run :simd] [1:40:55][Linguistically flip the Best checker in (the working) ComputeWalkTable()][:lighting :optimisation] [1:41:21][:Run hhlightprof with a verification fault on SampleDirIndex 256][:lighting :optimisation :simd] [1:42:18][Logically flip the sense of the Best checker in ComputeWalkTable(), and redo the BestTable in ComputeWalkTableFast() in line with the original logic][:lighting :optimisation :simd] [1:45:14][:Run hhlightprof with a verification fault right off the bat][:lighting :optimisation :simd] [1:45:24][Verify the BestTable in ComputeWalkTableFast()][:lighting :optimisation :research :simd] [1:47:08][Reacquaint ourselves with the Best picking in ComputeWalkTable()][:lighting :optimisation :run :simd] [1:47:44][Revert the sense of the Best checker in ComputeWalkTable()][:lighting :optimisation] [1:48:57][:Run hhlightprof successfully][:lighting :optimisation :simd] [1:49:26][Introduce a ShuffleTable in ComputeWalkTableFast()][:lighting :optimisation :simd] [1:51:04][:Run hhlightprof successfully][:lighting :optimisation :simd] [1:51:12][Optimise ComputeWalkTableFast() to pick the tBest out of the ShuffleTable][:lighting :optimisation :simd] [1:54:12][:Run hhlightprof successfully][:lighting :optimisation :simd] [1:54:20][Optimise ComputeWalkTableFast() to track tTerminate in :SIMD][:lighting :optimisation] [1:55:18][:Run hhlightprof successfully][:lighting :optimisation :simd] [1:55:22][Optimise ComputeWalkTableFast() to initialise At4 before the loop, and individually offset the four steps by the CellDim][:lighting :optimisation :simd] [2:00:05][:Run hhlightprof successfully][:lighting :optimisation :simd] [2:00:08][Optimise ComputeWalkTableFast() to offset all four steps in :SIMD, branchless, using a MaskTable][:lighting :optimisation] [2:04:40][:Run hhlightprof with a verification fault][:lighting :optimisation :simd] [2:04:55][Scrutinise our MaskTable][:lighting :optimisation :research :simd] [2:05:37][Compute a Compare for At4 in ComputeWalkTableFast()][:lighting :optimisation :simd] [2:06:05][Break in to ComputeWalkTableFast() and compare the Compare with our actual At4][:lighting :optimisation :run :simd] [2:07:04][Set At4 equal to Compare, saving off the OldAt4][:lighting :optimisation :simd] [2:07:19][:Run hhlightprof successfully][:lighting :optimisation :simd] [2:07:34][Try making ComputeWalkTableFast() offset the At4 in two steps][:lighting :optimisation :simd] [2:08:18][:Run hhlightprof successfully][:lighting :optimisation :simd] [2:08:27][Gauge the :performance of our ComputeWalkTableFast()[ref site=uops.info url=https://uops.info/table.html]][:lighting :research :simd] [2:11:48][Build in -O2] [2:12:12][:Run the game successfully][:lighting :optimisation] [2:12:26][Make ComputeWalkTable() compute InvRayD before the stepping loop, to remove a divide within it][:lighting :optimisation] [2:13:01][The :lighting looks completely different][:optimisation :run] [2:13:35][Make ComputeWalkTable() compute the InvRayD using a safe ratio][:lighting :optimisation] [2:15:37][The :lighting remains different][:optimisation :run] [2:16:06][Fix ComputeWalkTable() to compute InvRayD after RayD itself][:lighting :optimisation] [2:16:26][The :lighting is back to how it was][:optimisation :run] [2:16:30][Make ComputeWalkTable() compute InvRayD as normal][:lighting :optimisation] [2:16:41][The :lighting is fine][:optimisation :run] [2:16:58][Build in -Od] [2:17:14][:Run hhlightprof with a verification fault][:lighting :optimisation :simd] [2:17:22][Make ComputeWalkTableFast() also precompute InvRayD][:lighting :optimisation :simd] [2:17:50][:Run hhlightprof successfully][:lighting :optimisation :simd] [2:18:50][Q&A][:speech] [2:19:11][@centhusiast][Q: Hi [@cmuratori Casey]! I was very sick and my health condition was very bad and for the last three months and now I am fortunately back to life and to [~hero Handmade Hero]. Could you briefly say what the focus of [~hero Handmade Hero] was in the last three months? Thank you!] [2:20:16][@infinum][Q: @handmade_hero Hello [@cmuratori Casey], this question may be off-topic but it's really important for me. I know you were doing some :UI development. I saw your video on immediate mode UI. I have the only job opportunity to develop UI for mobile app but I've never done that and I need this job. So can you please give me some advice on where to find information, maybe some guides on UI development and were you using some :library or did you write everything from scratch? It would be very helpful for me] [2:23:14][@somebody_took_my_name][Q: What is the best way to debug something that only happens in optimized code?] [2:23:50][@mindmark42][Q: How expensive do you think the table lookups are?[ref site=uops.info url=https://uops.info/table.html][ref site=Intel page="Intel Intrinsics Guide" url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:performance] [2:28:03][@tomtetlaw][Q: Can you give a general idea of how to optimise branches out of a function?] [2:29:49][@legendarior][Q: Hello, thank you for all the videos. I am a bored CS student that aced his exams and now does not know what to do during his vacation] [2:29:58][@lucid_frost][Q: What kinds of things would you like the compiler to do to help with this table stuff (if any)?][:language] [2:30:12][@centhusiast][Q: Could you explain the compile time execution as we have in jai?][:language] [2:30:19][@billdstrong][Q: Would meowhash be suitable to create a custom UUID? It wouldn't be part of the UUID spec, but could it serve the same purpose?][:hashing] [2:30:42][End it there][:speech] [/video]