cinera_handmade.network/cmuratori/hero/code/code118.hmml

[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code template=code118_template.html title="Wide Unpacking and Masking" vod_platform=youtube id=-_X0UYCGaVA annotator=ChronalDragon annotator=dspecht annotator=ZedZull annotator=Miblo]
[0:00:25][Overview of optimization work]
[0:01:30][Recap where we were yesterday]
[0:01:50][Current issue: Black bars]
[0:03:20][Blackboard: Writing correct values to destination]
[0:05:35][It's ok to do all operations for all pixels]
[0:06:52][Blackboard: Another option: Combine old/new values]
[0:08:14][Blackboard: Build a mask]
[0:09:00][Masking out the invalid new values]
[0:10:50][Making sure we save the original destination]
[0:11:38][Haven't SIMD-ized the load yet, deal with OriginalDest differently]
[0:12:55][Problem with WriteMask: Haven't computed it yet!]
[0:14:00][Use cheesy set macros to set WriteMask]
[0:14:16][Handmade Hero: A Bit Garish edition]
[0:15:20][Fixing the 'problem': Mi macro for uint setting]
[0:16:00][Another thing: Fabian's rounding mode comment]
[0:16:57][Some work to do with the last for(I) loop]
[0:19:34][The explicit version of unrolling the loop]
[0:22:00][Checking we're still working: under 100 cycles now]
[0:23:10][Doing the destination the same way]
[0:23:50][Just saved more cycles moving things out]
[0:24:35][Fixing the WriteMask nonsense]
[0:25:38][SSE Comparison Operations]
[0:26:20][Blackboard: Comparisons for wide operations]
[0:29:43][Using comparisons to generate WriteMask directly]
[0:31:50][Working WriteMask with wide operations]
[0:32:10][Problem: can't get rid of if entirely...]
[0:32:40][Solution: Clamp U and V]
[0:33:40][Get rid of the if entirely!]
[0:33:54][Handmade Hero: Uniformly Stretchy Edition]
[0:34:05][Fixing the bug: U/V copypasta typo]
[0:35:05][Doing the texel fetch wide as well]
[0:37:30][Not optimizing yet, just translating to SIMD]
[0:39:45][Adjusting the texture fetch to use the wide values]
[0:40:30][Converting the fetch coord by truncating]
[0:42:00][Getting fX and fY by subtraction]
[0:43:30][All correct, under 70 cycles]
[0:44:10][No longer need to initialize the Texel values]
[0:46:00][Everything in SIMD now but texel loads]
[0:46:50][Blackboard: Unpacking the color data]
[0:48:30][Pulling out colors using masks and shifting]
[0:53:20][Blackboard: The matrix of sample reads]
[0:55:00][Packing the sample data into 4-wide registers]
[0:55:48][Some crazy emacs macro kung-fu]
[0:56:50][Doing the Texels the same way as Dest]
[0:58:05][Working texel read, and...almost 50cy/pixel]
[0:59:25][What if there's nothing in the mask?]
[1:01:19][Q&A][:speech]
[1:02:03][@grumpygiant256][Could you not just align the X coord to a 4-pixel boundary up front, and thereby use aligned loads and stores?]
[1:03:03][@garlandobloom][Are you pulling this code over into ground splats soon?]
[1:05:15][@ostrovskivlad][Is it me or after this whole SIMD conversion the cycles per pixel are much more consistent?]
[1:05:44][@ifingerbangedurcat][I have kind of missed the past few days, I'm wondering if doing CPU intrinsics exclusively for SSE2 in your game code is bad or are we targetting SSE2? For example, should we wrap everything into platform-specific files so its easier to target other platforms?]
[1:08:35][@flyingsand][What does it mean for intrinsics that don't have a specified throughput?]
[1:08:51][@kelimion][Instead of loading the destination first would it be faster to skip that and instead do a masked write e.g. _mm_maskmoveu_si128]
[1:11:56][@tobeypeters][Would it be a good idea to just use SIMD for all our math operations in all our programs?]
[1:15:36][@flyingsand][Example of an intrinsic with no throughput: _mm_cmpgt_ps]
[1:21:00][@grumpygiant][Agner Fog says the throughput is 1]
[1:22:16][@mrstone56][\[What is latency vs throughput?\]]
[1:22:46][@themarsala][What is the end goal of the optimization, trying to get below a certain threshold, or just to get everything converted?]
[1:23:54][@tobeypeters][Does size of variables and stuff matter to SIMD, like 32bit vs 64bit?]
[1:25:45][@hellotanjent][Is the SSE code doing any cache prefetch or hinting stuff yet?]
[1:27:12][@allaizn][Couldn't we use a half-float instead of floats as we don't need that much precision with only 255 discrete values?]
[1:28:50][@ttbjm][Is the normal map code going to be converted to SIMD?]
[1:29:27][End of the stream][:speech]
[/video]
Add Handmade Hero's old explanations in the notes 2018-01-08 22:10:24 +00:00			`[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code template=code118_template.html title="Wide Unpacking and Masking" vod_platform=youtube id=-_X0UYCGaVA annotator=ChronalDragon annotator=dspecht annotator=ZedZull annotator=Miblo]`
Relocate riscy and add newly converted hero The idea here is to reduce the amount of superfluous stuff downloaded to each server running cinera 2017-12-06 22:26:13 +00:00			`[0:00:25][Overview of optimization work]`
			`[0:01:30][Recap where we were yesterday]`
			`[0:01:50][Current issue: Black bars]`
			`[0:03:20][Blackboard: Writing correct values to destination]`
			`[0:05:35][It's ok to do all operations for all pixels]`
			`[0:06:52][Blackboard: Another option: Combine old/new values]`
			`[0:08:14][Blackboard: Build a mask]`
			`[0:09:00][Masking out the invalid new values]`
			`[0:10:50][Making sure we save the original destination]`
			`[0:11:38][Haven't SIMD-ized the load yet, deal with OriginalDest differently]`
			`[0:12:55][Problem with WriteMask: Haven't computed it yet!]`
			`[0:14:00][Use cheesy set macros to set WriteMask]`
			`[0:14:16][Handmade Hero: A Bit Garish edition]`
			`[0:15:20][Fixing the 'problem': Mi macro for uint setting]`
			`[0:16:00][Another thing: Fabian's rounding mode comment]`
			`[0:16:57][Some work to do with the last for(I) loop]`
			`[0:19:34][The explicit version of unrolling the loop]`
			`[0:22:00][Checking we're still working: under 100 cycles now]`
			`[0:23:10][Doing the destination the same way]`
			`[0:23:50][Just saved more cycles moving things out]`
			`[0:24:35][Fixing the WriteMask nonsense]`
			`[0:25:38][SSE Comparison Operations]`
			`[0:26:20][Blackboard: Comparisons for wide operations]`
			`[0:29:43][Using comparisons to generate WriteMask directly]`
			`[0:31:50][Working WriteMask with wide operations]`
			`[0:32:10][Problem: can't get rid of if entirely...]`
			`[0:32:40][Solution: Clamp U and V]`
			`[0:33:40][Get rid of the if entirely!]`
			`[0:33:54][Handmade Hero: Uniformly Stretchy Edition]`
			`[0:34:05][Fixing the bug: U/V copypasta typo]`
			`[0:35:05][Doing the texel fetch wide as well]`
			`[0:37:30][Not optimizing yet, just translating to SIMD]`
			`[0:39:45][Adjusting the texture fetch to use the wide values]`
			`[0:40:30][Converting the fetch coord by truncating]`
			`[0:42:00][Getting fX and fY by subtraction]`
			`[0:43:30][All correct, under 70 cycles]`
			`[0:44:10][No longer need to initialize the Texel values]`
			`[0:46:00][Everything in SIMD now but texel loads]`
			`[0:46:50][Blackboard: Unpacking the color data]`
			`[0:48:30][Pulling out colors using masks and shifting]`
			`[0:53:20][Blackboard: The matrix of sample reads]`
			`[0:55:00][Packing the sample data into 4-wide registers]`
			`[0:55:48][Some crazy emacs macro kung-fu]`
			`[0:56:50][Doing the Texels the same way as Dest]`
			`[0:58:05][Working texel read, and...almost 50cy/pixel]`
			`[0:59:25][What if there's nothing in the mask?]`
Fix some incorrectly converted annotations Also apply some :speech categorisation 2018-03-07 21:48:09 +00:00			`[1:01:19][Q&A][:speech]`
Relocate riscy and add newly converted hero The idea here is to reduce the amount of superfluous stuff downloaded to each server running cinera 2017-12-06 22:26:13 +00:00			`[1:02:03][@grumpygiant256][Could you not just align the X coord to a 4-pixel boundary up front, and thereby use aligned loads and stores?]`
			`[1:03:03][@garlandobloom][Are you pulling this code over into ground splats soon?]`
			`[1:05:15][@ostrovskivlad][Is it me or after this whole SIMD conversion the cycles per pixel are much more consistent?]`
			`[1:05:44][@ifingerbangedurcat][I have kind of missed the past few days, I'm wondering if doing CPU intrinsics exclusively for SSE2 in your game code is bad or are we targetting SSE2? For example, should we wrap everything into platform-specific files so its easier to target other platforms?]`
			`[1:08:35][@flyingsand][What does it mean for intrinsics that don't have a specified throughput?]`
			`[1:08:51][@kelimion][Instead of loading the destination first would it be faster to skip that and instead do a masked write e.g. _mm_maskmoveu_si128]`
			`[1:11:56][@tobeypeters][Would it be a good idea to just use SIMD for all our math operations in all our programs?]`
			`[1:15:36][@flyingsand][Example of an intrinsic with no throughput: _mm_cmpgt_ps]`
			`[1:21:00][@grumpygiant][Agner Fog says the throughput is 1]`
			`[1:22:16][@mrstone56][\[What is latency vs throughput?\]]`
			`[1:22:46][@themarsala][What is the end goal of the optimization, trying to get below a certain threshold, or just to get everything converted?]`
			`[1:23:54][@tobeypeters][Does size of variables and stuff matter to SIMD, like 32bit vs 64bit?]`
			`[1:25:45][@hellotanjent][Is the SSE code doing any cache prefetch or hinting stuff yet?]`
			`[1:27:12][@allaizn][Couldn't we use a half-float instead of floats as we don't need that much precision with only 255 discrete values?]`
			`[1:28:50][@ttbjm][Is the normal map code going to be converted to SIMD?]`
Fix some incorrectly converted annotations Also apply some :speech categorisation 2018-03-07 21:48:09 +00:00			`[1:29:27][End of the stream][:speech]`
Relocate riscy and add newly converted hero The idea here is to reduce the amount of superfluous stuff downloaded to each server running cinera 2017-12-06 22:26:13 +00:00			`[/video]`