[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Packing Pixels for the Framebuffer" vod_platform=youtube id=90eSF6jLzvQ annotator=dspecht annotator=Miblo] [2:05][Load up the code and consider optimisation] [4:09][handmade_render_group.cpp: Comment out if(ShouldFill\[I\])] [5:34][Blackboard: Interleaving four SIMD values] [14:27][Blackboard: Establishing the order we need] [15:46][handmade_render_group.cpp: Write the SIMD register names that we want to end up with] [16:29][Internet: Intel Intrinsics Guide[ref site="Intel" page="Intrinsics Guide" url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]] [17:23][Blackboard: __mm_unpackhi_epi32 and __mm_unpacklo_epi32] [19:04][Blackboard: Using these operations to generate what we need] [24:17][handmade_render_group.cpp: Name the registers in register order] [25:15][Internet: Double-check the parameter order of the unpack operations] [26:22][handmade_render_group.cpp: Start to populate the registers] [26:52][Internet: Keeping in mind how often you move between __m128 and __m128i] [28:39][handmade_render_group.cpp: Cast the Blended values from float to int] [29:47][Use structured art to enable us to see what's happening] [34:47][Debugger: Watch how our art gets shuffled] [38:40][handmade_render_group.cpp: Produce the rest of the pixel values we need] [41:43][Convert 32-bit floating point values to 8-bit integers] [44:07][// TODO(casey): Set the rounding to something known] [45:08][Blackboard: Using 8-bits of these 32-bit registers] [47:32][handmade_render_group.cpp: Bitwise OR and Shift these values] [50:27][Blackboard: How the shift operations work] [52:44][handmade_render_group.cpp: Implement these shifts] [55:06][Debugger: Take a look at the Out value] [57:33][handmade_render_group.cpp: Break out the values] [58:22][Debugger: Inspect these values] [58:35][handmade_render_group.cpp: Fix the test case] [59:32][Debugger: Inspect our stuff] [1:00:13][handmade_render_group.cpp: Write Out to Pixel] [1:01:08][Debugger: Crash and reload] [1:01:43][Debugger: Note that we are writing unaligned] [1:04:22][Blackboard: Alignment] [1:05:54][handmade_render_group.cpp: Issue _mm_storeu_si128 to cause the compiler to use the (unaligned) mov instruction] [1:07:23][Recap and glimpse into the future] [1:08:30][Q&A][:speech] [1:09:59][@braincruser][Will the operations be reordered to reduce the number of ops and load / stores?] [1:12:01][@mmozeiko][You are calculating Out like or(or(or(r, g), b), a). Would it be better to do it like this: or(or(r, g), or(b, a)), so first two or's are not dependent on each other?] [1:14:57][handmade_render_group.cpp: Write it the way mmozeiko suggests] [1:17:31][@uspred][Do you need to start with 32-bit floats? Is there further optimization that doesn't need the casting?] [1:18:21][Blackboard: Multiplying floats vs Multiplying integers] [1:19:54][@mmozeiko][Same for texture bilinear adds together] [1:20:03][handmade_render_group.cpp: Implement mmozeiko's suggestion] [1:23:00][@flaturated][Can you compile /O2 to compare it to last week's performance?] [1:23:16][@brblackmer][Why did you make macros for your SIMD operations (mmSquare, etc.) vs making functions?] [1:23:39][@quikligames][Are these intrinsics the same on other operating systems or compilers, as long as it's using Intel architecture?] [1:24:40][@mmozeiko][Why do you say unaligned store is nasty? As far as I know, for latest Intel CPUs (at least starting from Ivy Bridge) unaligned load / store is not very expensive anymore (<5% difference)] [1:26:25][@plain_flavored][Is scalar access to __m128 elements still slow on Intel?] [1:27:18][@braincruser][The processor window is 192 instructions] [1:28:01][@gasto5][I don't understand how one optimizes by using the intrinsic or function] [1:28:51][@mmozeiko][_mm_cvttps_epi32 always truncates. Would that be better than messing with rounding mode?] [1:30:45][handmade_render_group.cpp: Switch to _mm_cvttps_epi32] [1:32:50][Wrap up][:speech] [/video]