[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Preparing a Function for Optimization" vod_platform=youtube id=_vkI9BedvKA annotator=dspecht annotator=Miblo]
[1:31][Open things up and recap]
[2:48][DrawRectangleSlowly: Increase efficiency]
[3:33][Create DrawRectangleHopefullyQuickly]
[4:34][DrawRectangleHopefullyQuickly: Skip the preamble]
[5:42][Remove all unnecessary code]
[6:44][Look at what's happening]
[8:01][Make the edge testing code more explicit]
[9:49][Blackboard: See what's happening with these inner products]
[12:04][DrawRectangleHopefullyQuickly: Test U and V instead]
[13:12][Run the game]
[13:33][Make these U and V computations more efficient]
[14:40][Run the game and ensure that everything still blits fine]
[15:16][Continue pruning]
[18:02][Flatten the routine]
[19:55][Blow out v4 Blended into scalar form]
[21:18][Take a close look at the routine and precompute InvTexelA]
[23:35][Blow out v4 Dest and Texel into scalar form]
[25:30][Flatten BilinearSample and SRGBBilinearBlend]
[28:02][Assess our situation]
[28:55][Unpack and optimise the Lerps]
[33:57][Run the game and annotate the code]
[35:33][Flatten SRGB255ToLinear1]
[36:38][Flatten Unpack4x8]
[38:59][That's everything flattened]
[39:22][Note that the code is faster]
[40:58][We have a nasty problem with the unpackings]
[44:01][Blackboard: What is our "wide" strategy?]
[48:43][Set the stage for SIMD]
[50:45][Consider solidifying texture boundaries]
[51:53][Leave it for today]
[53:09][Q&A]
[53:28][@braincruser][The way the code is written now you have a very long dependency chain (between instructions). Will you break down the code to remove it?]
[56:42][@stelar7][Why did you write float instead of real32 this stream?]
[57:14][@stelar7][Why use -O2 instead of -O3 or -Ofast (possibly with -fverbose-asm)?]
[58:06][@garryjohanson][Do you ever use exclusive or operations to avoid pipeline stalls? If not, what do you use?]
[59:04][@g3rain1][Aren't those square roots pretty expensive?[ref
    site="Intel"
    page="Intrinsics Guide"
    url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]]
[1:03:31][@andsz_][Will you make multiple SIMD backends? (SSE?/AVX/FMA versions)]
[1:04:04][@davidthomas426][You could loft some of those variables out one more loop]
[1:04:58][@waterlimon][How expensive is the float<>int conversion compared to the rest of the workload?[ref
    site="Intel"
    page="Intrinsics Guide"
    url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]]
[1:05:40][@davidthomas426][Since xAxis and yAxis are usually perpendicular, should we special case for that? In the same vein, should we special-case for axis-aligned?]
[1:06:56][@waterlimon][Does the compiler do any automatic SSE optimization (or have option for it?)]
[1:09:01][@stelar7][sqrt_ss vs sqrt_ps vs sqrt_pd?[ref
    site="Intel"
    page="Intrinsics Guide"
    url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]]
[1:11:56][@waterlimon][Would SSE allow doing sRGB using exponent 2.2 instead of approximating using one of 2, without a huge performance hit?]
[1:12:41][@pseudonym73][The main reason why you don't get automatic SIMD is precise exceptions. You probably need to tell the compiler that you don't need them]
[1:14:44][@waterlimon][What happens if "/arch:AVX2" switch is enabled?]
[1:15:26][Look at this AVX-512 stuff[ref
    site="Intel"
    page="Intrinsics Guide"
    url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]]
[1:16:51][@braincruser][FMA is fused multiply add]
[1:18:48][@andsz_][Yeah, looks like different caps bits]
[1:19:23][Wrap things up]
[/video]