[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Preparing a Function for Optimization" vod_platform=youtube id=_vkI9BedvKA annotator=dspecht annotator=Miblo] [1:31][Open things up and recap] [2:48][DrawRectangleSlowly: Increase efficiency] [3:33][Create DrawRectangleHopefullyQuickly] [4:34][DrawRectangleHopefullyQuickly: Skip the preamble] [5:42][Remove all unnecessary code] [6:44][Look at what's happening] [8:01][Make the edge testing code more explicit] [9:49][Blackboard: See what's happening with these inner products] [12:04][DrawRectangleHopefullyQuickly: Test U and V instead] [13:12][Run the game] [13:33][Make these U and V computations more efficient] [14:40][Run the game and ensure that everything still blits fine] [15:16][Continue pruning] [18:02][Flatten the routine] [19:55][Blow out v4 Blended into scalar form] [21:18][Take a close look at the routine and precompute InvTexelA] [23:35][Blow out v4 Dest and Texel into scalar form] [25:30][Flatten BilinearSample and SRGBBilinearBlend] [28:02][Assess our situation] [28:55][Unpack and optimise the Lerps] [33:57][Run the game and annotate the code] [35:33][Flatten SRGB255ToLinear1] [36:38][Flatten Unpack4x8] [38:59][That's everything flattened] [39:22][Note that the code is faster] [40:58][We have a nasty problem with the unpackings] [44:01][Blackboard: What is our "wide" strategy?] [48:43][Set the stage for SIMD] [50:45][Consider solidifying texture boundaries] [51:53][Leave it for today] [53:09][Q&A] [53:28][@braincruser][The way the code is written now you have a very long dependency chain (between instructions). Will you break down the code to remove it?] [56:42][@stelar7][Why did you write float instead of real32 this stream?] [57:14][@stelar7][Why use -O2 instead of -O3 or -Ofast (possibly with -fverbose-asm)?] [58:06][@garryjohanson][Do you ever use exclusive or operations to avoid pipeline stalls? If not, what do you use?] [59:04][@g3rain1][Aren't those square roots pretty expensive?[ref site="Intel" page="Intrinsics Guide" url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]] [1:03:31][@andsz_][Will you make multiple SIMD backends? (SSE?/AVX/FMA versions)] [1:04:04][@davidthomas426][You could loft some of those variables out one more loop] [1:04:58][@waterlimon][How expensive is the float<>int conversion compared to the rest of the workload?[ref site="Intel" page="Intrinsics Guide" url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]] [1:05:40][@davidthomas426][Since xAxis and yAxis are usually perpendicular, should we special case for that? In the same vein, should we special-case for axis-aligned?] [1:06:56][@waterlimon][Does the compiler do any automatic SSE optimization (or have option for it?)] [1:09:01][@stelar7][sqrt_ss vs sqrt_ps vs sqrt_pd?[ref site="Intel" page="Intrinsics Guide" url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]] [1:11:56][@waterlimon][Would SSE allow doing sRGB using exponent 2.2 instead of approximating using one of 2, without a huge performance hit?] [1:12:41][@pseudonym73][The main reason why you don't get automatic SIMD is precise exceptions. You probably need to tell the compiler that you don't need them] [1:14:44][@waterlimon][What happens if "/arch:AVX2" switch is enabled?] [1:15:26][Look at this AVX-512 stuff[ref site="Intel" page="Intrinsics Guide" url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]] [1:16:51][@braincruser][FMA is fused multiply add] [1:18:48][@andsz_][Yeah, looks like different caps bits] [1:19:23][Wrap things up] [/video]