[video output=day120 member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Measuring Port Usage with IACA" vod_platform=youtube id=-c-0s6KiPSw annotator=Miblo annotator=AndrewJDR annotator=ZedZull] [0:31][Review of last session: cycle counting code] [1:20][Fabian: Instruction counts including throughput numbers is not accurate, does not properly take into account CPU's ability to overlap different ops] [4:00][Accurate way is to write tool to simulate the CPU, as Casey did for the XB360] [4:30][... or Intel architecture code analyzer (IACA)] [7:35][Overview on how to use IACA with the code] [8:08][Marking sections with IACA_START and IACA_END] [9:20][Modifying build.bat to include the iaca directory] [10:09][For linux/unix compatibility, mind your case when including files] [11:38][Running the IACA command line] [13:10][Reading the IACA results] [15:28][Trying to decipher the meanings of the letters in the IACA table] [16:58][IACA can output graphs?] [17:12][IACA reports max throughput of 86.60 cycles] [17:56][There maybe some more room for optimization...] [18:30][IACA is pretty nice!] [19:20][Adding some macros to turn IACA on/off] [19:47][Thanks to Fabian for the suggestion] [20:44][Fabian: bilinear and squaring don't need floating point] [21:19][Move the sRGB->linear conversion after the bilinear] [23:25][Bake normalization into color] [25:25][Works fine (not much improvement)] [27:52][Remove a number of multiply ops by keeping things in 0-255 space (no improvement)] [36:47][Diff IACA output from the run with the removed multiplies and the one prior] [41:30][Getting rid of 43 instructions did not improve throughput reported by IACA] [42:46][Seems to be doing the same number of multiplies either way] [43:09][Compiler was smart enough to do the transformations?] [45:07][What other optimizations could we do?] [47:19][Use _mm_mul_mulhi_epi16 to do the square operations more wide prior to the FP conversion?] [49:34][Blackboard: Mask out A and G, which leaves R and B aligned to the 16-bit SIMD boundaries] [52:53][Blackboard: _mm_mullo_epi16 vs. _mm_mulhi_epi16] [54:36][problem: this will square our alpha as well] [55:19][We'll have to use another instruction to handle alpha] [57:00][Bitshifting / masking to pull the components from their 16-bit lanes] [58:39][Wrong result! It's Q&A, but let's try to debug first...] [1:06:35][Issue found: Should be masking 16-bit, not 8-bit] [1:07:00][Better, but still a strange result] [1:07:57][How to avoid squaring the alpha?] [1:09:18][Just pull the alpha out prior to the squaring?] [1:09:46][.. that works fine] [1:10:10][Now let's convert everything to use the 16-bit squaring] [1:10:59][... around 6 cycles improvement, but small visual problem with the bilinear] [1:11:37][Found the issue: We were reading only from SampleA] [1:11:47][Bilinear looks better, but still oddity with green fringing around the hero] [1:12:16][Found the issue, looks good, but...] [1:12:42][... we're actually 8 cycles worse now] [1:13:14][Why? Let's run it through IACA] [1:13:44][Throughput bottleneck: Inter-iteration? Good question for Fabian] [1:13:59][Total number of micro-ops have gotten smaller - 350 vs 306 vs. 283 but throughput is worse] [1:15:56][Could use the same technique when loading the destination, but probably not a good idea] [1:17:01][Q&A][:speech] [1:17:30][@cubercaleb][CP stands for Critical path] [1:18:13][@flaturated][IACA was showing Port 1 as the bottleneck, so reducing multplies won't help] [1:19:05][@stelar7][Inter-iteration means that run x of the loop depends on the prior run] [1:19:56][@butwhynot1][Try hoisting out the TexturePitch/TextureMemory (several cycles improvement)] [1:23:54][@roflraging][How do you support AVX? What about register saving through context switches?] [1:25:07][@mmozeiko][Replace sqrt with mul/rsqrt?] [1:31:14][Some comments on port 1 pressure from Fabian] [1:34:19][@robotchocolatedino][How can removing the sqrt help if it's done on the multiply port, not the adder port?] [/video]