[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Packing Pixels for the Framebuffer" vod_platform=youtube id=90eSF6jLzvQ annotator=dspecht annotator=Miblo]
[2:05][Load up the code and consider optimisation]
[4:09][handmade_render_group.cpp: Comment out if(ShouldFill\[I\])]
[5:34][Blackboard: Interleaving four SIMD values]
[14:27][Blackboard: Establishing the order we need]
[15:46][handmade_render_group.cpp: Write the SIMD register names that we want to end up with]
[16:29][Internet: Intel Intrinsics Guide[ref
    site="Intel"
    page="Intrinsics Guide"
    url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]]
[17:23][Blackboard: __mm_unpackhi_epi32 and __mm_unpacklo_epi32]
[19:04][Blackboard: Using these operations to generate what we need]
[24:17][handmade_render_group.cpp: Name the registers in register order]
[25:15][Internet: Double-check the parameter order of the unpack operations]
[26:22][handmade_render_group.cpp: Start to populate the registers]
[26:52][Internet: Keeping in mind how often you move between __m128 and __m128i]
[28:39][handmade_render_group.cpp: Cast the Blended values from float to int]
[29:47][Use structured art to enable us to see what's happening]
[34:47][Debugger: Watch how our art gets shuffled]
[38:40][handmade_render_group.cpp: Produce the rest of the pixel values we need]
[41:43][Convert 32-bit floating point values to 8-bit integers]
[44:07][// TODO(casey): Set the rounding to something known]
[45:08][Blackboard: Using 8-bits of these 32-bit registers]
[47:32][handmade_render_group.cpp: Bitwise OR and Shift these values]
[50:27][Blackboard: How the shift operations work]
[52:44][handmade_render_group.cpp: Implement these shifts]
[55:06][Debugger: Take a look at the Out value]
[57:33][handmade_render_group.cpp: Break out the values]
[58:22][Debugger: Inspect these values]
[58:35][handmade_render_group.cpp: Fix the test case]
[59:32][Debugger: Inspect our stuff]
[1:00:13][handmade_render_group.cpp: Write Out to Pixel]
[1:01:08][Debugger: Crash and reload]
[1:01:43][Debugger: Note that we are writing unaligned]
[1:04:22][Blackboard: Alignment]
[1:05:54][handmade_render_group.cpp: Issue _mm_storeu_si128 to cause the compiler to use the (unaligned) mov instruction]
[1:07:23][Recap and glimpse into the future]
[1:08:30][Q&A][:speech]
[1:09:59][@braincruser][Will the operations be reordered to reduce the number of ops and load / stores?]
[1:12:01][@mmozeiko][You are calculating Out like or(or(or(r, g), b), a). Would it be better to do it like this: or(or(r, g), or(b, a)), so first two or's are not dependent on each other?]
[1:14:57][handmade_render_group.cpp: Write it the way mmozeiko suggests]
[1:17:31][@uspred][Do you need to start with 32-bit floats? Is there further optimization that doesn't need the casting?]
[1:18:21][Blackboard: Multiplying floats vs Multiplying integers]
[1:19:54][@mmozeiko][Same for texture bilinear adds together]
[1:20:03][handmade_render_group.cpp: Implement mmozeiko's suggestion]
[1:23:00][@flaturated][Can you compile /O2 to compare it to last week's performance?]
[1:23:16][@brblackmer][Why did you make macros for your SIMD operations (mmSquare, etc.) vs making functions?]
[1:23:39][@quikligames][Are these intrinsics the same on other operating systems or compilers, as long as it's using Intel architecture?]
[1:24:40][@mmozeiko][Why do you say unaligned store is nasty? As far as I know, for latest Intel CPUs (at least starting from Ivy Bridge) unaligned load / store is not very expensive anymore (<5% difference)]
[1:26:25][@plain_flavored][Is scalar access to __m128 elements still slow on Intel?]
[1:27:18][@braincruser][The processor window is 192 instructions]
[1:28:01][@gasto5][I don't understand how one optimizes by using the intrinsic or function]
[1:28:51][@mmozeiko][_mm_cvttps_epi32 always truncates. Would that be better than messing with rounding mode?]
[1:30:45][handmade_render_group.cpp: Switch to _mm_cvttps_epi32]
[1:32:50][Wrap up][:speech]
[/video]