cinera_handmade.network/cmuratori/hero/code/code119.hmml

[video output=day119 member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Counting Intrinsics" vod_platform=youtube id=NPDL1OENYio annotator=AndrewJDR annotator=ZedZull]
[0:10][Lesson: Das keyboards are horrible]
[1:36][Recap of last episode and today's agenda]
[2:33][Prep work for getting pre-optimization vs post-optimization cycle counts]
[3:43][Add cycle counting to DrawRectangleSlowly]
[4:41][... ~350 vs ~50 cycles per pixel!]
[5:17][How long *should* it take to fill each pixel? Let's count up all the intrinsics and their throughputs...]
[7:10][... How can we automate this counting process?]
[7:58][Answer: Override the intrinsics with macros that add to some counter variables]
[8:47][Oops, there's still some SIMDizing left to do here...]
[9:30][Use _mm_add_ps to increment PixelPx by 4 instead of scalar adds (2-3 cycles better)]
[11:55][dx and dy can be baked into PixelPx and PixelPy (2 cycles better)]
[13:08][Should we loft PixelPx and PixelPy axis multiply/add calculation out of the inner loop?]
[13:59][Maybe loft just the multiplies but not the add? Hmm...]
[14:20][... try lofting the multiplications. (1-2 cycles worse)]
[15:50][Note: Texture fetches can't be done in SIMD]
[16:52][Fabian on why _mm_maskmoveu_si128 is so slow. Don't use it! It bypasses the cache.]
[18:15][Adding a #define for each intrinsic to count operations (_mm_add_ps, _mm_mul_ps, etc)]
[21:45][Start setting up the intrinsic #defines to count operations]
[23:45][Preprocessor cleverness that handles the fact that intrinsics often take other intrinsics as params]
[27:34][Define load/store to nothing]
[28:39][Mini-rant about the compiler not doing instruction/intrinsic instrumentation automatically]
[31:46][We've got counts!]
[32:15][Double check that counts make sense]
[33:27][Multiply counts by throughputs to get total latency estimate]
[35:27][_mm_castps_si128 latency is difficult to know.]
[35:52][looking up the processor core type in windows]
[36:52][_mm_and_ps and bitwise ops are 1/3 cycle on Nehalem]
[40:28][Use a macro to sum up the latency*counts to get a rough throughput total]
[42:55][Well, Isn't that fancy: Measured throughput is lower than the theoretical best throughput. Instructions are likely executing on multiple ALUs per cycle]
[45:40][How many units are in Nehalem core?]
[48:17][... Two?]
[49:12][On the limitations of executing multiple instructions per clock]
[51:25][We're quite close to the max theoretical throughput.]
[52:19][Memory latency probably isn't hurting performance]
[52:47][Make an #if toggle for the intrinsic measurement code]
[53:58][How much is gamma (sqrt) costing us?]
[56:30][A troubling visual artifact appears around our hero...]
[57:47][Aha! An issue with the linear/SRGB code]
[1:00:28][gamma is costing only ~6 cycles]
[1:01:05][This is a reasonably optimized pixel loop]
[1:01:32][Agenda for next session: Optimize outside/around the pixel loop.]
[1:01:56][Q&A][:speech]
[1:02:09][@stelar7][Is this what you were looking for?]
[1:03:16][Nehalem diagram: Only one FPU?]
[1:05:52][@grumpygiant256][Worth timing the load/stores with no ALU ops to see how much we're memory bound?]
[1:12:46][@thesizik][You counted _mm_and_ps wrong.]
[1:13:35][@ieee754][Are you doing pre-multipled alpha? (Yes)]
[1:13:38][@tenbroya][Could you run the game with task manager open?]
[1:16:17][@jayp2][Will this game only work for your specific processor?]
[1:16:43][@toppstv][ Are you going to update the yellow background textures?]
[1:17:32][@braincruser][The texture fetch should be an L1 cache fetch.]
[1:18:10][@0xwid][In an alternate universe where nobody cares for art, do you think optimization would still be a focus for developers?]
[1:19:34][@miblo][Any idea why my cores get maxed out when running [~hero Handmade Hero] with the XCB platform layer?]
[1:20:33][@robotchocolatedino][Why wasn't there a greater speed increase after removing gamma correction?]
[1:22:42][@marumoto][How will we split up the drawing onto multiple cores?]
[1:22:56][@dingernalt2][What's the floating head?]
[1:23:14][@nothings2][Question about _mm_ps_sqrt and common subexpression elimination]
[1:24:14][@thesizik][What's that drum-like background noise?]
[1:24:37][@jayp2][Do you see all the questions?]
[1:25:07][@thevaber][Can rdtsc be inaccurate with CPUs that vary their cycle rate?]
[1:26:23][@cubercaleb][How does the CPU do things ahead of time if things are supposed to be done in order?]
[1:29:34][@ttbjm][Do you expect a 16x speedup from multi-threading?]
[1:29:59][@gasto5][How do you select the instruction set for optimizing?]
[1:32:55][@nothings2][Aren't the Unity hardware survey results pretty different than the Steam ones?]
[1:34:01][@captainkraft][What are the gains you get by writing your own software renderer vs using SDL, GPUs, etc?]
[1:35:24][@jayp2][Can a processor work through different types of calculations in a single cycle?]
[1:37:16][@ca2dev][What kinds of things can be delegated to the GPU?]
[/video]
Cinera 0.7.0 Update Add output parameter to all of hero/code, hero/intro-to-c and hero/misc, preserving the current URLs while allowing different .hmml filenames, notably for hero/misc which now gets sorted chronologically. Update the cinera__*.css files 2020-05-09 20:59:36 +00:00			`[video output=day119 member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Counting Intrinsics" vod_platform=youtube id=NPDL1OENYio annotator=AndrewJDR annotator=ZedZull]`
Relocate riscy and add newly converted hero The idea here is to reduce the amount of superfluous stuff downloaded to each server running cinera 2017-12-06 22:26:13 +00:00			`[0:10][Lesson: Das keyboards are horrible]`
			`[1:36][Recap of last episode and today's agenda]`
			`[2:33][Prep work for getting pre-optimization vs post-optimization cycle counts]`
			`[3:43][Add cycle counting to DrawRectangleSlowly]`
			`[4:41][... ~350 vs ~50 cycles per pixel!]`
			`[5:17][How long should it take to fill each pixel? Let's count up all the intrinsics and their throughputs...]`
			`[7:10][... How can we automate this counting process?]`
			`[7:58][Answer: Override the intrinsics with macros that add to some counter variables]`
			`[8:47][Oops, there's still some SIMDizing left to do here...]`
			`[9:30][Use _mm_add_ps to increment PixelPx by 4 instead of scalar adds (2-3 cycles better)]`
			`[11:55][dx and dy can be baked into PixelPx and PixelPy (2 cycles better)]`
			`[13:08][Should we loft PixelPx and PixelPy axis multiply/add calculation out of the inner loop?]`
			`[13:59][Maybe loft just the multiplies but not the add? Hmm...]`
			`[14:20][... try lofting the multiplications. (1-2 cycles worse)]`
			`[15:50][Note: Texture fetches can't be done in SIMD]`
			`[16:52][Fabian on why _mm_maskmoveu_si128 is so slow. Don't use it! It bypasses the cache.]`
			`[18:15][Adding a #define for each intrinsic to count operations (_mm_add_ps, _mm_mul_ps, etc)]`
			`[21:45][Start setting up the intrinsic #defines to count operations]`
			`[23:45][Preprocessor cleverness that handles the fact that intrinsics often take other intrinsics as params]`
			`[27:34][Define load/store to nothing]`
			`[28:39][Mini-rant about the compiler not doing instruction/intrinsic instrumentation automatically]`
			`[31:46][We've got counts!]`
			`[32:15][Double check that counts make sense]`
			`[33:27][Multiply counts by throughputs to get total latency estimate]`
			`[35:27][_mm_castps_si128 latency is difficult to know.]`
			`[35:52][looking up the processor core type in windows]`
			`[36:52][_mm_and_ps and bitwise ops are 1/3 cycle on Nehalem]`
			`[40:28][Use a macro to sum up the latency*counts to get a rough throughput total]`
			`[42:55][Well, Isn't that fancy: Measured throughput is lower than the theoretical best throughput. Instructions are likely executing on multiple ALUs per cycle]`
			`[45:40][How many units are in Nehalem core?]`
			`[48:17][... Two?]`
			`[49:12][On the limitations of executing multiple instructions per clock]`
			`[51:25][We're quite close to the max theoretical throughput.]`
			`[52:19][Memory latency probably isn't hurting performance]`
			`[52:47][Make an #if toggle for the intrinsic measurement code]`
			`[53:58][How much is gamma (sqrt) costing us?]`
			`[56:30][A troubling visual artifact appears around our hero...]`
			`[57:47][Aha! An issue with the linear/SRGB code]`
			`[1:00:28][gamma is costing only ~6 cycles]`
			`[1:01:05][This is a reasonably optimized pixel loop]`
			`[1:01:32][Agenda for next session: Optimize outside/around the pixel loop.]`
Fix some incorrectly converted annotations Also apply some :speech categorisation 2018-03-07 21:48:09 +00:00			`[1:01:56][Q&A][:speech]`
Relocate riscy and add newly converted hero The idea here is to reduce the amount of superfluous stuff downloaded to each server running cinera 2017-12-06 22:26:13 +00:00			`[1:02:09][@stelar7][Is this what you were looking for?]`
			`[1:03:16][Nehalem diagram: Only one FPU?]`
			`[1:05:52][@grumpygiant256][Worth timing the load/stores with no ALU ops to see how much we're memory bound?]`
			`[1:12:46][@thesizik][You counted _mm_and_ps wrong.]`
			`[1:13:35][@ieee754][Are you doing pre-multipled alpha? (Yes)]`
			`[1:13:38][@tenbroya][Could you run the game with task manager open?]`
			`[1:16:17][@jayp2][Will this game only work for your specific processor?]`
			`[1:16:43][@toppstv][ Are you going to update the yellow background textures?]`
			`[1:17:32][@braincruser][The texture fetch should be an L1 cache fetch.]`
			`[1:18:10][@0xwid][In an alternate universe where nobody cares for art, do you think optimization would still be a focus for developers?]`
			`[1:19:34][@miblo][Any idea why my cores get maxed out when running [~hero Handmade Hero] with the XCB platform layer?]`
			`[1:20:33][@robotchocolatedino][Why wasn't there a greater speed increase after removing gamma correction?]`
			`[1:22:42][@marumoto][How will we split up the drawing onto multiple cores?]`
			`[1:22:56][@dingernalt2][What's the floating head?]`
			`[1:23:14][@nothings2][Question about _mm_ps_sqrt and common subexpression elimination]`
			`[1:24:14][@thesizik][What's that drum-like background noise?]`
			`[1:24:37][@jayp2][Do you see all the questions?]`
			`[1:25:07][@thevaber][Can rdtsc be inaccurate with CPUs that vary their cycle rate?]`
			`[1:26:23][@cubercaleb][How does the CPU do things ahead of time if things are supposed to be done in order?]`
			`[1:29:34][@ttbjm][Do you expect a 16x speedup from multi-threading?]`
			`[1:29:59][@gasto5][How do you select the instruction set for optimizing?]`
			`[1:32:55][@nothings2][Aren't the Unity hardware survey results pretty different than the Steam ones?]`
			`[1:34:01][@captainkraft][What are the gains you get by writing your own software renderer vs using SDL, GPUs, etc?]`
			`[1:35:24][@jayp2][Can a processor work through different types of calculations in a single cycle?]`
			`[1:37:16][@ca2dev][What kinds of things can be delegated to the GPU?]`
			`[/video]`