cinera_handmade.network/cmuratori/hero/code/code120.hmml

[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Measuring Port Usage with IACA" vod_platform=youtube id=-c-0s6KiPSw annotator=Miblo annotator=AndrewJDR annotator=ZedZull]
[0:31][Review of last session: cycle counting code]
[1:20][Fabian: Instruction counts including throughput numbers is not accurate, does not properly take into account CPU's ability to overlap different ops]
[4:00][Accurate way is to write tool to simulate the CPU, as Casey did for the XB360]
[4:30][... or Intel architecture code analyzer (IACA)]
[7:35][Overview on how to use IACA with the code]
[8:08][Marking sections with IACA_START and IACA_END]
[9:20][Modifying build.bat to include the iaca directory]
[10:09][For linux/unix compatibility, mind your case when including files]
[11:38][Running the IACA command line]
[13:10][Reading the IACA results]
[15:28][Trying to decipher the meanings of the letters in the IACA table]
[16:58][IACA can output graphs?]
[17:12][IACA reports max throughput of 86.60 cycles]
[17:56][There maybe some more room for optimization...]
[18:30][IACA is pretty nice!]
[19:20][Adding some macros to turn IACA on/off]
[19:47][Thanks to Fabian for the suggestion]
[20:44][Fabian: bilinear and squaring don't need floating point]
[21:19][Move the sRGB->linear conversion after the bilinear]
[23:25][Bake normalization into color]
[25:25][Works fine (not much improvement)]
[27:52][Remove a number of multiply ops by keeping things in 0-255 space (no improvement)]
[36:47][Diff IACA output from the run with the removed multiplies and the one prior]
[41:30][Getting rid of 43 instructions did not improve throughput reported by IACA]
[42:46][Seems to be doing the same number of multiplies either way]
[43:09][Compiler was smart enough to do the transformations?]
[45:07][What other optimizations could we do?]
[47:19][Use _mm_mul_mulhi_epi16 to do the square operations more wide prior to the FP conversion?]
[49:34][Blackboard: Mask out A and G, which leaves R and B aligned to the 16-bit SIMD boundaries]
[52:53][Blackboard: _mm_mullo_epi16 vs. _mm_mulhi_epi16]
[54:36][problem: this will square our alpha as well]
[55:19][We'll have to use another instruction to handle alpha]
[57:00][Bitshifting / masking to pull the components from their 16-bit lanes]
[58:39][Wrong result! It's Q&A, but let's try to debug first...]
[1:06:35][Issue found: Should be masking 16-bit, not 8-bit]
[1:07:00][Better, but still a strange result]
[1:07:57][How to avoid squaring the alpha?]
[1:09:18][Just pull the alpha out prior to the squaring?]
[1:09:46][.. that works fine]
[1:10:10][Now let's convert everything to use the 16-bit squaring]
[1:10:59][... around 6 cycles improvement, but small visual problem with the bilinear]
[1:11:37][Found the issue: We were reading only from SampleA]
[1:11:47][Bilinear looks better, but still oddity with green fringing around the hero]
[1:12:16][Found the issue, looks good, but...]
[1:12:42][... we're actually 8 cycles worse now]
[1:13:14][Why? Let's run it through IACA]
[1:13:44][Throughput bottleneck: Inter-iteration? Good question for Fabian]
[1:13:59][Total number of micro-ops have gotten smaller - 350 vs 306 vs. 283 but throughput is worse]
[1:15:56][Could use the same technique when loading the destination, but probably not a good idea]
[1:17:01][Q&A]
[1:17:30][@cubercaleb][CP stands for Critical path]
[1:18:13][@flaturated][IACA was showing Port 1 as the bottleneck, so reducing multplies won't help]
[1:19:05][@stelar7][Inter-iteration means that run x of the loop depends on the prior run]
[1:19:56][@butwhynot1][Try hoisting out the TexturePitch/TextureMemory (several cycles improvement)]
[1:23:54][@roflraging][How do you support AVX? What about register saving through context switches?]
[1:25:07][@mmozeiko][Replace sqrt with mul/rsqrt?]
[1:31:14][Some comments on port 1 pressure from Fabian]
[1:34:19][@robotchocolatedino][How can removing the sqrt help if it's done on the multiply port, not the adder port?]
[/video]
Relocate riscy and add newly converted hero The idea here is to reduce the amount of superfluous stuff downloaded to each server running cinera 2017-12-06 22:26:13 +00:00			`[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Measuring Port Usage with IACA" vod_platform=youtube id=-c-0s6KiPSw annotator=Miblo annotator=AndrewJDR annotator=ZedZull]`
			`[0:31][Review of last session: cycle counting code]`
			`[1:20][Fabian: Instruction counts including throughput numbers is not accurate, does not properly take into account CPU's ability to overlap different ops]`
			`[4:00][Accurate way is to write tool to simulate the CPU, as Casey did for the XB360]`
			`[4:30][... or Intel architecture code analyzer (IACA)]`
			`[7:35][Overview on how to use IACA with the code]`
			`[8:08][Marking sections with IACA_START and IACA_END]`
			`[9:20][Modifying build.bat to include the iaca directory]`
			`[10:09][For linux/unix compatibility, mind your case when including files]`
			`[11:38][Running the IACA command line]`
			`[13:10][Reading the IACA results]`
			`[15:28][Trying to decipher the meanings of the letters in the IACA table]`
			`[16:58][IACA can output graphs?]`
			`[17:12][IACA reports max throughput of 86.60 cycles]`
			`[17:56][There maybe some more room for optimization...]`
			`[18:30][IACA is pretty nice!]`
			`[19:20][Adding some macros to turn IACA on/off]`
			`[19:47][Thanks to Fabian for the suggestion]`
			`[20:44][Fabian: bilinear and squaring don't need floating point]`
			`[21:19][Move the sRGB->linear conversion after the bilinear]`
			`[23:25][Bake normalization into color]`
			`[25:25][Works fine (not much improvement)]`
			`[27:52][Remove a number of multiply ops by keeping things in 0-255 space (no improvement)]`
			`[36:47][Diff IACA output from the run with the removed multiplies and the one prior]`
			`[41:30][Getting rid of 43 instructions did not improve throughput reported by IACA]`
			`[42:46][Seems to be doing the same number of multiplies either way]`
			`[43:09][Compiler was smart enough to do the transformations?]`
			`[45:07][What other optimizations could we do?]`
			`[47:19][Use _mm_mul_mulhi_epi16 to do the square operations more wide prior to the FP conversion?]`
			`[49:34][Blackboard: Mask out A and G, which leaves R and B aligned to the 16-bit SIMD boundaries]`
			`[52:53][Blackboard: _mm_mullo_epi16 vs. _mm_mulhi_epi16]`
			`[54:36][problem: this will square our alpha as well]`
			`[55:19][We'll have to use another instruction to handle alpha]`
			`[57:00][Bitshifting / masking to pull the components from their 16-bit lanes]`
			`[58:39][Wrong result! It's Q&A, but let's try to debug first...]`
			`[1:06:35][Issue found: Should be masking 16-bit, not 8-bit]`
			`[1:07:00][Better, but still a strange result]`
			`[1:07:57][How to avoid squaring the alpha?]`
			`[1:09:18][Just pull the alpha out prior to the squaring?]`
			`[1:09:46][.. that works fine]`
			`[1:10:10][Now let's convert everything to use the 16-bit squaring]`
			`[1:10:59][... around 6 cycles improvement, but small visual problem with the bilinear]`
			`[1:11:37][Found the issue: We were reading only from SampleA]`
			`[1:11:47][Bilinear looks better, but still oddity with green fringing around the hero]`
			`[1:12:16][Found the issue, looks good, but...]`
			`[1:12:42][... we're actually 8 cycles worse now]`
			`[1:13:14][Why? Let's run it through IACA]`
			`[1:13:44][Throughput bottleneck: Inter-iteration? Good question for Fabian]`
			`[1:13:59][Total number of micro-ops have gotten smaller - 350 vs 306 vs. 283 but throughput is worse]`
			`[1:15:56][Could use the same technique when loading the destination, but probably not a good idea]`
			`[1:17:01][Q&A]`
			`[1:17:30][@cubercaleb][CP stands for Critical path]`
			`[1:18:13][@flaturated][IACA was showing Port 1 as the bottleneck, so reducing multplies won't help]`
			`[1:19:05][@stelar7][Inter-iteration means that run x of the loop depends on the prior run]`
			`[1:19:56][@butwhynot1][Try hoisting out the TexturePitch/TextureMemory (several cycles improvement)]`
			`[1:23:54][@roflraging][How do you support AVX? What about register saving through context switches?]`
			`[1:25:07][@mmozeiko][Replace sqrt with mul/rsqrt?]`
			`[1:31:14][Some comments on port 1 pressure from Fabian]`
			`[1:34:19][@robotchocolatedino][How can removing the sqrt help if it's done on the multiply port, not the adder port?]`
			`[/video]`