61 lines
3.9 KiB
Plaintext
61 lines
3.9 KiB
Plaintext
|
[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Measuring Port Usage with IACA" vod_platform=youtube id=-c-0s6KiPSw annotator=Miblo annotator=AndrewJDR annotator=ZedZull]
|
||
|
[0:31][Review of last session: cycle counting code]
|
||
|
[1:20][Fabian: Instruction counts including throughput numbers is not accurate, does not properly take into account CPU's ability to overlap different ops]
|
||
|
[4:00][Accurate way is to write tool to simulate the CPU, as Casey did for the XB360]
|
||
|
[4:30][... or Intel architecture code analyzer (IACA)]
|
||
|
[7:35][Overview on how to use IACA with the code]
|
||
|
[8:08][Marking sections with IACA_START and IACA_END]
|
||
|
[9:20][Modifying build.bat to include the iaca directory]
|
||
|
[10:09][For linux/unix compatibility, mind your case when including files]
|
||
|
[11:38][Running the IACA command line]
|
||
|
[13:10][Reading the IACA results]
|
||
|
[15:28][Trying to decipher the meanings of the letters in the IACA table]
|
||
|
[16:58][IACA can output graphs?]
|
||
|
[17:12][IACA reports max throughput of 86.60 cycles]
|
||
|
[17:56][There maybe some more room for optimization...]
|
||
|
[18:30][IACA is pretty nice!]
|
||
|
[19:20][Adding some macros to turn IACA on/off]
|
||
|
[19:47][Thanks to Fabian for the suggestion]
|
||
|
[20:44][Fabian: bilinear and squaring don't need floating point]
|
||
|
[21:19][Move the sRGB->linear conversion after the bilinear]
|
||
|
[23:25][Bake normalization into color]
|
||
|
[25:25][Works fine (not much improvement)]
|
||
|
[27:52][Remove a number of multiply ops by keeping things in 0-255 space (no improvement)]
|
||
|
[36:47][Diff IACA output from the run with the removed multiplies and the one prior]
|
||
|
[41:30][Getting rid of 43 instructions did not improve throughput reported by IACA]
|
||
|
[42:46][Seems to be doing the same number of multiplies either way]
|
||
|
[43:09][Compiler was smart enough to do the transformations?]
|
||
|
[45:07][What other optimizations could we do?]
|
||
|
[47:19][Use _mm_mul_mulhi_epi16 to do the square operations more wide prior to the FP conversion?]
|
||
|
[49:34][Blackboard: Mask out A and G, which leaves R and B aligned to the 16-bit SIMD boundaries]
|
||
|
[52:53][Blackboard: _mm_mullo_epi16 vs. _mm_mulhi_epi16]
|
||
|
[54:36][problem: this will square our alpha as well]
|
||
|
[55:19][We'll have to use another instruction to handle alpha]
|
||
|
[57:00][Bitshifting / masking to pull the components from their 16-bit lanes]
|
||
|
[58:39][Wrong result! It's Q&A, but let's try to debug first...]
|
||
|
[1:06:35][Issue found: Should be masking 16-bit, not 8-bit]
|
||
|
[1:07:00][Better, but still a strange result]
|
||
|
[1:07:57][How to avoid squaring the alpha?]
|
||
|
[1:09:18][Just pull the alpha out prior to the squaring?]
|
||
|
[1:09:46][.. that works fine]
|
||
|
[1:10:10][Now let's convert everything to use the 16-bit squaring]
|
||
|
[1:10:59][... around 6 cycles improvement, but small visual problem with the bilinear]
|
||
|
[1:11:37][Found the issue: We were reading only from SampleA]
|
||
|
[1:11:47][Bilinear looks better, but still oddity with green fringing around the hero]
|
||
|
[1:12:16][Found the issue, looks good, but...]
|
||
|
[1:12:42][... we're actually 8 cycles worse now]
|
||
|
[1:13:14][Why? Let's run it through IACA]
|
||
|
[1:13:44][Throughput bottleneck: Inter-iteration? Good question for Fabian]
|
||
|
[1:13:59][Total number of micro-ops have gotten smaller - 350 vs 306 vs. 283 but throughput is worse]
|
||
|
[1:15:56][Could use the same technique when loading the destination, but probably not a good idea]
|
||
|
[1:17:01][Q&A]
|
||
|
[1:17:30][@cubercaleb][CP stands for Critical path]
|
||
|
[1:18:13][@flaturated][IACA was showing Port 1 as the bottleneck, so reducing multplies won't help]
|
||
|
[1:19:05][@stelar7][Inter-iteration means that run x of the loop depends on the prior run]
|
||
|
[1:19:56][@butwhynot1][Try hoisting out the TexturePitch/TextureMemory (several cycles improvement)]
|
||
|
[1:23:54][@roflraging][How do you support AVX? What about register saving through context switches?]
|
||
|
[1:25:07][@mmozeiko][Replace sqrt with mul/rsqrt?]
|
||
|
[1:31:14][Some comments on port 1 pressure from Fabian]
|
||
|
[1:34:19][@robotchocolatedino][How can removing the sqrt help if it's done on the multiply port, not the adder port?]
|
||
|
[/video]
|