[2:33][Prep work for getting pre-optimization vs post-optimization cycle counts]
[3:43][Add cycle counting to DrawRectangleSlowly]
[4:41][... ~350 vs ~50 cycles per pixel!]
[5:17][How long *should* it take to fill each pixel? Let's count up all the intrinsics and their throughputs...]
[7:10][... How can we automate this counting process?]
[7:58][Answer: Override the intrinsics with macros that add to some counter variables]
[8:47][Oops, there's still some SIMDizing left to do here...]
[9:30][Use _mm_add_ps to increment PixelPx by 4 instead of scalar adds (2-3 cycles better)]
[11:55][dx and dy can be baked into PixelPx and PixelPy (2 cycles better)]
[13:08][Should we loft PixelPx and PixelPy axis multiply/add calculation out of the inner loop?]
[13:59][Maybe loft just the multiplies but not the add? Hmm...]
[14:20][... try lofting the multiplications. (1-2 cycles worse)]
[15:50][Note: Texture fetches can't be done in SIMD]
[16:52][Fabian on why _mm_maskmoveu_si128 is so slow. Don't use it! It bypasses the cache.]
[18:15][Adding a #define for each intrinsic to count operations (_mm_add_ps, _mm_mul_ps, etc)]
[21:45][Start setting up the intrinsic #defines to count operations]
[23:45][Preprocessor cleverness that handles the fact that intrinsics often take other intrinsics as params]
[27:34][Define load/store to nothing]
[28:39][Mini-rant about the compiler not doing instruction/intrinsic instrumentation automatically]
[31:46][We've got counts!]
[32:15][Double check that counts make sense]
[33:27][Multiply counts by throughputs to get total latency estimate]
[35:27][_mm_castps_si128 latency is difficult to know.]
[35:52][looking up the processor core type in windows]
[36:52][_mm_and_ps and bitwise ops are 1/3 cycle on Nehalem]
[40:28][Use a macro to sum up the latency*counts to get a rough throughput total]
[42:55][Well, Isn't that fancy: Measured throughput is lower than the theoretical best throughput. Instructions are likely executing on multiple ALUs per cycle]
[45:40][How many units are in Nehalem core?]
[48:17][... Two?]
[49:12][On the limitations of executing multiple instructions per clock]
[51:25][We're quite close to the max theoretical throughput.]