[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Measuring Port Usage with IACA" vod_platform=youtube id=-c-0s6KiPSw annotator=Miblo annotator=AndrewJDR annotator=ZedZull]
[0:31][Review of last session: cycle counting code]
[1:20][Fabian: Instruction counts including throughput numbers is not accurate, does not properly take into account CPU's ability to overlap different ops]
[4:00][Accurate way is to write tool to simulate the CPU, as Casey did for the XB360]
[4:30][... or Intel architecture code analyzer (IACA)]
[7:35][Overview on how to use IACA with the code]
[8:08][Marking sections with IACA_START and IACA_END]
[9:20][Modifying build.bat to include the iaca directory]
[10:09][For linux/unix compatibility, mind your case when including files]
[11:38][Running the IACA command line]
[13:10][Reading the IACA results]
[15:28][Trying to decipher the meanings of the letters in the IACA table]
[16:58][IACA can output graphs?]
[17:12][IACA reports max throughput of 86.60 cycles]
[17:56][There maybe some more room for optimization...]
[18:30][IACA is pretty nice!]
[19:20][Adding some macros to turn IACA on/off]
[19:47][Thanks to Fabian for the suggestion]
[20:44][Fabian: bilinear and squaring don't need floating point]
[21:19][Move the sRGB->linear conversion after the bilinear]
[23:25][Bake normalization into color]
[25:25][Works fine (not much improvement)]
[27:52][Remove a number of multiply ops by keeping things in 0-255 space (no improvement)]
[36:47][Diff IACA output from the run with the removed multiplies and the one prior]
[41:30][Getting rid of 43 instructions did not improve throughput reported by IACA]
[42:46][Seems to be doing the same number of multiplies either way]
[43:09][Compiler was smart enough to do the transformations?]
[45:07][What other optimizations could we do?]
[47:19][Use _mm_mul_mulhi_epi16 to do the square operations more wide prior to the FP conversion?]
[49:34][Blackboard: Mask out A and G, which leaves R and B aligned to the 16-bit SIMD boundaries]
[52:53][Blackboard: _mm_mullo_epi16 vs. _mm_mulhi_epi16]
[54:36][problem: this will square our alpha as well]
[55:19][We'll have to use another instruction to handle alpha]
[57:00][Bitshifting / masking to pull the components from their 16-bit lanes]
[58:39][Wrong result! It's Q&A, but let's try to debug first...]
[1:06:35][Issue found: Should be masking 16-bit, not 8-bit]
[1:07:00][Better, but still a strange result]
[1:07:57][How to avoid squaring the alpha?]
[1:09:18][Just pull the alpha out prior to the squaring?]
[1:09:46][.. that works fine]
[1:10:10][Now let's convert everything to use the 16-bit squaring]
[1:10:59][... around 6 cycles improvement, but small visual problem with the bilinear]
[1:11:37][Found the issue: We were reading only from SampleA]
[1:11:47][Bilinear looks better, but still oddity with green fringing around the hero]
[1:12:16][Found the issue, looks good, but...]
[1:12:42][... we're actually 8 cycles worse now]
[1:13:14][Why? Let's run it through IACA]
[1:13:44][Throughput bottleneck: Inter-iteration? Good question for Fabian]
[1:13:59][Total number of micro-ops have gotten smaller - 350 vs 306 vs. 283 but throughput is worse]
[1:15:56][Could use the same technique when loading the destination, but probably not a good idea]