[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Converting Math Operations to SIMD" vod_platform=youtube id=1CVmlnhgT3g annotator=Miblo annotator=dspecht] [1:23][Recap yesterday's work] [2:46][build.bat: Switch to -O2] [4:22][Think about doing the TestPixel TIMED_BLOCK over a wider range] [5:20][handmade_render_group.cpp: Move the timer around the for loops] [5:50][Debugger: See that there are two loops that are more or less the same] [6:26][handmade_platform.h: Number these DebugCycleCounters] [6:49][handmade_render_group.cpp: Rename TestPixel to ProcessPixel and remove TIMED_BLOCK around DrawRectangleSlowly] [7:35][Debugger: Look at the DEBUG CYCLE COUNTS] [8:12][handmade_render_group.cpp: Introduce END_TIMED_BLOCK_COUNTED] [9:36][Debugger: See that the ProcessPixel count is now more accurate \[243cy/h\]] [10:34][handmade_render_group.cpp: Write this in SIMD] [16:35][Run and see that it's still producing the correct result] [16:47][build.bat: Switch to -Od] [17:27][Debugger: Inspect TexelAr] [21:28][handmade_render_group.cpp: Continue transforming these Texel computations into SIMD] [29:21][Run and note that we're running just fine \[575cy/h\]] [29:46][handmade_render_group.cpp: Continue making these wide] [37:14][Compile and see if we made any mistakes \[557cy/h\]] [37:31][handmade_render_group.cpp: Do the rest of this wide, except for the Clamp] [40:39][Intel Intrinsics Guide: _mm_sqrt_ps[ref site="Intel" page="Intrinsics Guide" url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]] [41:11][handmade_render_group.cpp: Do _mm_sqrt_ps and continue converting to SIMD] [43:39][Run and note that we are blitting correctly \[427cy/h\]] [43:54][Debugger: Look at what Clamp01 does] [47:17][Intel Intrinsics Guide: _mm_min_ps and _mm_max_ps[ref site="Intel" page="Intrinsics Guide" url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]] [48:45][handmade_render_group.cpp: Do the Clamps wide \[179cy/h\]] [50:02][Run and note that the game is already running faster] [50:47][Reflect on the straightforwardness of this work] [51:54][Consider what's left to convert to SIMD] [52:46][handmade_render_group.cpp: Do PixelP wide] [54:16][Run and note how fast it's running \[124cy/h\]] [56:18][Debugger: Investigate what the compiler is doing with those 50 cycles] [1:02:54][handmade_render_group.cpp: Finish doing the SIMD here] [1:07:32][Run and note that we're creeping forwards \[121cy/h\]] [1:08:06][Recap and glimpse into the future of doing the Loads and Repack in SIMD] [1:11:08][Q&A][:speech] [1:11:32][@kknewkles][How do you cover multiple CPU technologies intrinsic-wise? Preprocessor switches on dedicated intrinsics for each? Also, whom to read on ASM? I'm thinking Mike Abrash?] [1:13:09][@houb_][We have come from 385 cycles to ~123. Does something like the 80%-20% rule apply? Do you think we will get down to 50 cycles?] [1:15:22][@maexono][The way we use mmSquare, does it calculate the argument twice?] [1:15:41][Debugger: Determine if the compiler is doing common subexpression elimination for these multiplies] [1:21:11][Deep, concentrated investigation][quote 86] [1:25:54][Look at how fast the game's running] [1:26:19][@cvaucher][Where do OpenCL and other GPGPU frameworks fit into optimization? It seems like if something is SIMD-able, it could just be done wider on a GPU. Are there workloads that are better suited to the CPU and SIMD?] [1:29:06][@garlandobloom][We have optimizations still on?] [1:29:19][@gasto5][Why are there optimizing options in the compiler if one will end up typing SIMD functions?] [1:31:01][@quylthulg][Do you know of the _mm_setr_ps intrinsic (and _pd etc) - note the r in setr? It loads the values in reverse order, i.e. in the order that is more intuitive] [1:31:38][@garlandobloom][When do you think we will thread the renderer?] [1:31:57][@goodoldmalk][Possibly misguided question, is there a way to overload operators to use SIMD instructions instead?] [1:32:45][@digitaldomovoi][Is padding and alignment still something you have to concern yourself with? I remember doing SIMD in the mid 2000s, and SIMD was essentially worthless (much of the time) if your data wasn't aligned] [1:33:43][@digitaldomovoi][Addendum: By "concern yourself", I mean, is it something the compiler now handles more autonomously when you "engage" SIMD] [1:34:15][@kil4h][Will you generate asm for NEON (if you port to arm of course)? GCC seems to be pretty bad at generating correct code with intrinsics (from my experience on Android)] [1:35:03][@culver_fly][How would you know if doing something will speed up the code? Especially when it's a fairly large change to the codebase and when time is limited, I find myself reluctant to perform such optimizations in fear of introducing bugs] [1:36:46][@miblo][What do you think you'll next want to convert to SIMD, in case I want to practise over the weekend?] [1:38:52][@flaturated][Can you compile it -Od and show how SIMD has helped there?] [1:39:32][@kknewkles][Would it be a good exercise (albeit a large one) to study a simple CPU and write some soft for it? Arduino or something ancient? I wanted to learn coding for GBA for a while] [1:41:04][@kknewkles][Let's rephrase: what CPU would you advise to study that would be simple enough yet representative enough of the general stuff you should know about when working with CPUs?][quote 87] [1:42:52][@theitchyninja][How long have you been working on this and when do you think you will finish?] [1:43:29][@gasto5][re you going to optimize gameplay code as well?] [1:43:45][@houb_][Have you heard of the JayStation2 Project from Jaymin Kessler, working with the Raspberry Pi 2 B+?] [1:44:03][Close things down with a recap of the week's optimisation work][:speech] [1:48:03][Shout out to the mods][:speech] [/video]