[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Counting Intrinsics" vod_platform=youtube id=NPDL1OENYio annotator=AndrewJDR annotator=ZedZull]
[0:10][Lesson: Das keyboards are horrible]
[1:36][Recap of last episode and today's agenda]
[2:33][Prep work for getting pre-optimization vs post-optimization cycle counts]
[3:43][Add cycle counting to DrawRectangleSlowly]
[4:41][... ~350 vs ~50 cycles per pixel!]
[5:17][How long *should* it take to fill each pixel? Let's count up all the intrinsics and their throughputs...]
[7:10][... How can we automate this counting process?]
[7:58][Answer: Override the intrinsics with macros that add to some counter variables]
[8:47][Oops, there's still some SIMDizing left to do here...]
[9:30][Use _mm_add_ps to increment PixelPx by 4 instead of scalar adds (2-3 cycles better)]
[11:55][dx and dy can be baked into PixelPx and PixelPy (2 cycles better)]
[13:08][Should we loft PixelPx and PixelPy axis multiply/add calculation out of the inner loop?]
[13:59][Maybe loft just the multiplies but not the add? Hmm...]
[14:20][... try lofting the multiplications. (1-2 cycles worse)]
[15:50][Note: Texture fetches can't be done in SIMD]
[16:52][Fabian on why _mm_maskmoveu_si128 is so slow. Don't use it! It bypasses the cache.]
[18:15][Adding a #define for each intrinsic to count operations (_mm_add_ps, _mm_mul_ps, etc)]
[21:45][Start setting up the intrinsic #defines to count operations]
[23:45][Preprocessor cleverness that handles the fact that intrinsics often take other intrinsics as params]
[27:34][Define load/store to nothing]
[28:39][Mini-rant about the compiler not doing instruction/intrinsic instrumentation automatically]
[31:46][We've got counts!]
[32:15][Double check that counts make sense]
[33:27][Multiply counts by throughputs to get total latency estimate]
[35:27][_mm_castps_si128 latency is difficult to know.]
[35:52][looking up the processor core type in windows]
[36:52][_mm_and_ps and bitwise ops are 1/3 cycle on Nehalem]
[40:28][Use a macro to sum up the latency*counts to get a rough throughput total]
[42:55][Well, Isn't that fancy: Measured throughput is lower than the theoretical best throughput. Instructions are likely executing on multiple ALUs per cycle]
[45:40][How many units are in Nehalem core?]
[48:17][... Two?]
[49:12][On the limitations of executing multiple instructions per clock]
[51:25][We're quite close to the max theoretical throughput.]
[52:19][Memory latency probably isn't hurting performance]
[52:47][Make an #if toggle for the intrinsic measurement code]
[53:58][How much is gamma (sqrt) costing us?]
[56:30][A troubling visual artifact appears around our hero...]
[57:47][Aha! An issue with the linear/SRGB code]
[1:00:28][gamma is costing only ~6 cycles]
[1:01:05][This is a reasonably optimized pixel loop]
[1:01:32][Agenda for next session: Optimize outside/around the pixel loop.]
[1:01:56][Q&A][:speech]
[1:02:09][@stelar7][Is this what you were looking for?]
[1:03:16][Nehalem diagram: Only one FPU?]
[1:05:52][@grumpygiant256][Worth timing the load/stores with no ALU ops to see how much we're memory bound?]
[1:12:46][@thesizik][You counted _mm_and_ps wrong.]
[1:13:35][@ieee754][Are you doing pre-multipled alpha? (Yes)]
[1:13:38][@tenbroya][Could you run the game with task manager open?]
[1:16:17][@jayp2][Will this game only work for your specific processor?]
[1:16:43][@toppstv][ Are you going to update the yellow background textures?]
[1:17:32][@braincruser][The texture fetch should be an L1 cache fetch.]
[1:18:10][@0xwid][In an alternate universe where nobody cares for art, do you think optimization would still be a focus for developers?]
[1:19:34][@miblo][Any idea why my cores get maxed out when running [~hero Handmade Hero] with the XCB platform layer?]
[1:20:33][@robotchocolatedino][Why wasn't there a greater speed increase after removing gamma correction?]
[1:22:42][@marumoto][How will we split up the drawing onto multiple cores?]
[1:22:56][@dingernalt2][What's the floating head?]
[1:23:14][@nothings2][Question about _mm_ps_sqrt and common subexpression elimination]
[1:24:14][@thesizik][What's that drum-like background noise?]
[1:24:37][@jayp2][Do you see all the questions?]
[1:25:07][@thevaber][Can rdtsc be inaccurate with CPUs that vary their cycle rate?]
[1:26:23][@cubercaleb][How does the CPU do things ahead of time if things are supposed to be done in order?]
[1:29:34][@ttbjm][Do you expect a 16x speedup from multi-threading?]
[1:29:59][@gasto5][How do you select the instruction set for optimizing?]
[1:32:55][@nothings2][Aren't the Unity hardware survey results pretty different than the Steam ones?]
[1:34:01][@captainkraft][What are the gains you get by writing your own software renderer vs using SDL, GPUs, etc?]
[1:35:24][@jayp2][Can a processor work through different types of calculations in a single cycle?]
[1:37:16][@ca2dev][What kinds of things can be delegated to the GPU?]
[/video]