[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Counting Intrinsics" vod_platform=youtube id=NPDL1OENYio annotator=AndrewJDR annotator=ZedZull] [0:10][Lesson: Das keyboards are horrible] [1:36][Recap of last episode and today's agenda] [2:33][Prep work for getting pre-optimization vs post-optimization cycle counts] [3:43][Add cycle counting to DrawRectangleSlowly] [4:41][... ~350 vs ~50 cycles per pixel!] [5:17][How long *should* it take to fill each pixel? Let's count up all the intrinsics and their throughputs...] [7:10][... How can we automate this counting process?] [7:58][Answer: Override the intrinsics with macros that add to some counter variables] [8:47][Oops, there's still some SIMDizing left to do here...] [9:30][Use _mm_add_ps to increment PixelPx by 4 instead of scalar adds (2-3 cycles better)] [11:55][dx and dy can be baked into PixelPx and PixelPy (2 cycles better)] [13:08][Should we loft PixelPx and PixelPy axis multiply/add calculation out of the inner loop?] [13:59][Maybe loft just the multiplies but not the add? Hmm...] [14:20][... try lofting the multiplications. (1-2 cycles worse)] [15:50][Note: Texture fetches can't be done in SIMD] [16:52][Fabian on why _mm_maskmoveu_si128 is so slow. Don't use it! It bypasses the cache.] [18:15][Adding a #define for each intrinsic to count operations (_mm_add_ps, _mm_mul_ps, etc)] [21:45][Start setting up the intrinsic #defines to count operations] [23:45][Preprocessor cleverness that handles the fact that intrinsics often take other intrinsics as params] [27:34][Define load/store to nothing] [28:39][Mini-rant about the compiler not doing instruction/intrinsic instrumentation automatically] [31:46][We've got counts!] [32:15][Double check that counts make sense] [33:27][Multiply counts by throughputs to get total latency estimate] [35:27][_mm_castps_si128 latency is difficult to know.] [35:52][looking up the processor core type in windows] [36:52][_mm_and_ps and bitwise ops are 1/3 cycle on Nehalem] [40:28][Use a macro to sum up the latency*counts to get a rough throughput total] [42:55][Well, Isn't that fancy: Measured throughput is lower than the theoretical best throughput. Instructions are likely executing on multiple ALUs per cycle] [45:40][How many units are in Nehalem core?] [48:17][... Two?] [49:12][On the limitations of executing multiple instructions per clock] [51:25][We're quite close to the max theoretical throughput.] [52:19][Memory latency probably isn't hurting performance] [52:47][Make an #if toggle for the intrinsic measurement code] [53:58][How much is gamma (sqrt) costing us?] [56:30][A troubling visual artifact appears around our hero...] [57:47][Aha! An issue with the linear/SRGB code] [1:00:28][gamma is costing only ~6 cycles] [1:01:05][This is a reasonably optimized pixel loop] [1:01:32][Agenda for next session: Optimize outside/around the pixel loop.] [1:01:56][Q&A][:speech] [1:02:09][@stelar7][Is this what you were looking for?] [1:03:16][Nehalem diagram: Only one FPU?] [1:05:52][@grumpygiant256][Worth timing the load/stores with no ALU ops to see how much we're memory bound?] [1:12:46][@thesizik][You counted _mm_and_ps wrong.] [1:13:35][@ieee754][Are you doing pre-multipled alpha? (Yes)] [1:13:38][@tenbroya][Could you run the game with task manager open?] [1:16:17][@jayp2][Will this game only work for your specific processor?] [1:16:43][@toppstv][ Are you going to update the yellow background textures?] [1:17:32][@braincruser][The texture fetch should be an L1 cache fetch.] [1:18:10][@0xwid][In an alternate universe where nobody cares for art, do you think optimization would still be a focus for developers?] [1:19:34][@miblo][Any idea why my cores get maxed out when running [~hero Handmade Hero] with the XCB platform layer?] [1:20:33][@robotchocolatedino][Why wasn't there a greater speed increase after removing gamma correction?] [1:22:42][@marumoto][How will we split up the drawing onto multiple cores?] [1:22:56][@dingernalt2][What's the floating head?] [1:23:14][@nothings2][Question about _mm_ps_sqrt and common subexpression elimination] [1:24:14][@thesizik][What's that drum-like background noise?] [1:24:37][@jayp2][Do you see all the questions?] [1:25:07][@thevaber][Can rdtsc be inaccurate with CPUs that vary their cycle rate?] [1:26:23][@cubercaleb][How does the CPU do things ahead of time if things are supposed to be done in order?] [1:29:34][@ttbjm][Do you expect a 16x speedup from multi-threading?] [1:29:59][@gasto5][How do you select the instruction set for optimizing?] [1:32:55][@nothings2][Aren't the Unity hardware survey results pretty different than the Steam ones?] [1:34:01][@captainkraft][What are the gains you get by writing your own software renderer vs using SDL, GPUs, etc?] [1:35:24][@jayp2][Can a processor work through different types of calculations in a single cycle?] [1:37:16][@ca2dev][What kinds of things can be delegated to the GPU?] [/video]