cinera_handmade.network/cmuratori/hero/code/code117.hmml

[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Packing Pixels for the Framebuffer" vod_platform=youtube id=90eSF6jLzvQ annotator=dspecht annotator=Miblo]
[2:05][Load up the code and consider optimisation]
[4:09][handmade_render_group.cpp: Comment out if(ShouldFill\[I\])]
[5:34][Blackboard: Interleaving four SIMD values]
[14:27][Blackboard: Establishing the order we need]
[15:46][handmade_render_group.cpp: Write the SIMD register names that we want to end up with]
[16:29][Internet: Intel Intrinsics Guide[ref
    site="Intel"
    page="Intrinsics Guide"
    url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]]
[17:23][Blackboard: __mm_unpackhi_epi32 and __mm_unpacklo_epi32]
[19:04][Blackboard: Using these operations to generate what we need]
[24:17][handmade_render_group.cpp: Name the registers in register order]
[25:15][Internet: Double-check the parameter order of the unpack operations]
[26:22][handmade_render_group.cpp: Start to populate the registers]
[26:52][Internet: Keeping in mind how often you move between __m128 and __m128i]
[28:39][handmade_render_group.cpp: Cast the Blended values from float to int]
[29:47][Use structured art to enable us to see what's happening]
[34:47][Debugger: Watch how our art gets shuffled]
[38:40][handmade_render_group.cpp: Produce the rest of the pixel values we need]
[41:43][Convert 32-bit floating point values to 8-bit integers]
[44:07][// TODO(casey): Set the rounding to something known]
[45:08][Blackboard: Using 8-bits of these 32-bit registers]
[47:32][handmade_render_group.cpp: Bitwise OR and Shift these values]
[50:27][Blackboard: How the shift operations work]
[52:44][handmade_render_group.cpp: Implement these shifts]
[55:06][Debugger: Take a look at the Out value]
[57:33][handmade_render_group.cpp: Break out the values]
[58:22][Debugger: Inspect these values]
[58:35][handmade_render_group.cpp: Fix the test case]
[59:32][Debugger: Inspect our stuff]
[1:00:13][handmade_render_group.cpp: Write Out to Pixel]
[1:01:08][Debugger: Crash and reload]
[1:01:43][Debugger: Note that we are writing unaligned]
[1:04:22][Blackboard: Alignment]
[1:05:54][handmade_render_group.cpp: Issue _mm_storeu_si128 to cause the compiler to use the (unaligned) mov instruction]
[1:07:23][Recap and glimpse into the future]
[1:08:30][Q&A][:speech]
[1:09:59][@braincruser][Will the operations be reordered to reduce the number of ops and load / stores?]
[1:12:01][@mmozeiko][You are calculating Out like or(or(or(r, g), b), a). Would it be better to do it like this: or(or(r, g), or(b, a)), so first two or's are not dependent on each other?]
[1:14:57][handmade_render_group.cpp: Write it the way mmozeiko suggests]
[1:17:31][@uspred][Do you need to start with 32-bit floats? Is there further optimization that doesn't need the casting?]
[1:18:21][Blackboard: Multiplying floats vs Multiplying integers]
[1:19:54][@mmozeiko][Same for texture bilinear adds together]
[1:20:03][handmade_render_group.cpp: Implement mmozeiko's suggestion]
[1:23:00][@flaturated][Can you compile /O2 to compare it to last week's performance?]
[1:23:16][@brblackmer][Why did you make macros for your SIMD operations (mmSquare, etc.) vs making functions?]
[1:23:39][@quikligames][Are these intrinsics the same on other operating systems or compilers, as long as it's using Intel architecture?]
[1:24:40][@mmozeiko][Why do you say unaligned store is nasty? As far as I know, for latest Intel CPUs (at least starting from Ivy Bridge) unaligned load / store is not very expensive anymore (<5% difference)]
[1:26:25][@plain_flavored][Is scalar access to __m128 elements still slow on Intel?]
[1:27:18][@braincruser][The processor window is 192 instructions]
[1:28:01][@gasto5][I don't understand how one optimizes by using the intrinsic or function]
[1:28:51][@mmozeiko][_mm_cvttps_epi32 always truncates. Would that be better than messing with rounding mode?]
[1:30:45][handmade_render_group.cpp: Switch to _mm_cvttps_epi32]
[1:32:50][Wrap up][:speech]
[/video]
Relocate riscy and add newly converted hero The idea here is to reduce the amount of superfluous stuff downloaded to each server running cinera 2017-12-06 22:26:13 +00:00			`[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Packing Pixels for the Framebuffer" vod_platform=youtube id=90eSF6jLzvQ annotator=dspecht annotator=Miblo]`
			`[2:05][Load up the code and consider optimisation]`
			`[4:09][handmade_render_group.cpp: Comment out if(ShouldFill\[I\])]`
			`[5:34][Blackboard: Interleaving four SIMD values]`
			`[14:27][Blackboard: Establishing the order we need]`
			`[15:46][handmade_render_group.cpp: Write the SIMD register names that we want to end up with]`
			`[16:29][Internet: Intel Intrinsics Guide[ref`
			`site="Intel"`
			`page="Intrinsics Guide"`
			`url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]]`
			`[17:23][Blackboard: __mm_unpackhi_epi32 and __mm_unpacklo_epi32]`
			`[19:04][Blackboard: Using these operations to generate what we need]`
			`[24:17][handmade_render_group.cpp: Name the registers in register order]`
			`[25:15][Internet: Double-check the parameter order of the unpack operations]`
			`[26:22][handmade_render_group.cpp: Start to populate the registers]`
			`[26:52][Internet: Keeping in mind how often you move between __m128 and __m128i]`
			`[28:39][handmade_render_group.cpp: Cast the Blended values from float to int]`
			`[29:47][Use structured art to enable us to see what's happening]`
			`[34:47][Debugger: Watch how our art gets shuffled]`
			`[38:40][handmade_render_group.cpp: Produce the rest of the pixel values we need]`
			`[41:43][Convert 32-bit floating point values to 8-bit integers]`
			`[44:07][// TODO(casey): Set the rounding to something known]`
			`[45:08][Blackboard: Using 8-bits of these 32-bit registers]`
			`[47:32][handmade_render_group.cpp: Bitwise OR and Shift these values]`
			`[50:27][Blackboard: How the shift operations work]`
			`[52:44][handmade_render_group.cpp: Implement these shifts]`
			`[55:06][Debugger: Take a look at the Out value]`
			`[57:33][handmade_render_group.cpp: Break out the values]`
			`[58:22][Debugger: Inspect these values]`
			`[58:35][handmade_render_group.cpp: Fix the test case]`
			`[59:32][Debugger: Inspect our stuff]`
			`[1:00:13][handmade_render_group.cpp: Write Out to Pixel]`
			`[1:01:08][Debugger: Crash and reload]`
			`[1:01:43][Debugger: Note that we are writing unaligned]`
			`[1:04:22][Blackboard: Alignment]`
			`[1:05:54][handmade_render_group.cpp: Issue _mm_storeu_si128 to cause the compiler to use the (unaligned) mov instruction]`
			`[1:07:23][Recap and glimpse into the future]`
Fix some incorrectly converted annotations Also apply some :speech categorisation 2018-03-07 21:48:09 +00:00			`[1:08:30][Q&A][:speech]`
Relocate riscy and add newly converted hero The idea here is to reduce the amount of superfluous stuff downloaded to each server running cinera 2017-12-06 22:26:13 +00:00			`[1:09:59][@braincruser][Will the operations be reordered to reduce the number of ops and load / stores?]`
			`[1:12:01][@mmozeiko][You are calculating Out like or(or(or(r, g), b), a). Would it be better to do it like this: or(or(r, g), or(b, a)), so first two or's are not dependent on each other?]`
			`[1:14:57][handmade_render_group.cpp: Write it the way mmozeiko suggests]`
			`[1:17:31][@uspred][Do you need to start with 32-bit floats? Is there further optimization that doesn't need the casting?]`
			`[1:18:21][Blackboard: Multiplying floats vs Multiplying integers]`
			`[1:19:54][@mmozeiko][Same for texture bilinear adds together]`
			`[1:20:03][handmade_render_group.cpp: Implement mmozeiko's suggestion]`
			`[1:23:00][@flaturated][Can you compile /O2 to compare it to last week's performance?]`
			`[1:23:16][@brblackmer][Why did you make macros for your SIMD operations (mmSquare, etc.) vs making functions?]`
			`[1:23:39][@quikligames][Are these intrinsics the same on other operating systems or compilers, as long as it's using Intel architecture?]`
			`[1:24:40][@mmozeiko][Why do you say unaligned store is nasty? As far as I know, for latest Intel CPUs (at least starting from Ivy Bridge) unaligned load / store is not very expensive anymore (<5% difference)]`
			`[1:26:25][@plain_flavored][Is scalar access to __m128 elements still slow on Intel?]`
			`[1:27:18][@braincruser][The processor window is 192 instructions]`
			`[1:28:01][@gasto5][I don't understand how one optimizes by using the intrinsic or function]`
			`[1:28:51][@mmozeiko][_mm_cvttps_epi32 always truncates. Would that be better than messing with rounding mode?]`
			`[1:30:45][handmade_render_group.cpp: Switch to _mm_cvttps_epi32]`
Fix some incorrectly converted annotations Also apply some :speech categorisation 2018-03-07 21:48:09 +00:00			`[1:32:50][Wrap up][:speech]`
Relocate riscy and add newly converted hero The idea here is to reduce the amount of superfluous stuff downloaded to each server running cinera 2017-12-06 22:26:13 +00:00			`[/video]`