[video output=day116 member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Converting Math Operations to SIMD" vod_platform=youtube id=1CVmlnhgT3g annotator=Miblo annotator=dspecht]
[1:23][Recap yesterday's work]
[2:46][build.bat: Switch to -O2]
[4:22][Think about doing the TestPixel TIMED_BLOCK over a wider range]
[5:20][handmade_render_group.cpp: Move the timer around the for loops]
[5:50][Debugger: See that there are two loops that are more or less the same]
[6:26][handmade_platform.h: Number these DebugCycleCounters]
[6:49][handmade_render_group.cpp: Rename TestPixel to ProcessPixel and remove TIMED_BLOCK around DrawRectangleSlowly]
[7:35][Debugger: Look at the DEBUG CYCLE COUNTS]
[8:12][handmade_render_group.cpp: Introduce END_TIMED_BLOCK_COUNTED]
[9:36][Debugger: See that the ProcessPixel count is now more accurate \[243cy/h\]]
[10:34][handmade_render_group.cpp: Write this in SIMD]
[16:35][Run and see that it's still producing the correct result]
[16:47][build.bat: Switch to -Od]
[17:27][Debugger: Inspect TexelAr]
[21:28][handmade_render_group.cpp: Continue transforming these Texel computations into SIMD]
[29:21][Run and note that we're running just fine \[575cy/h\]]
[29:46][handmade_render_group.cpp: Continue making these wide]
[37:14][Compile and see if we made any mistakes \[557cy/h\]]
[37:31][handmade_render_group.cpp: Do the rest of this wide, except for the Clamp]
[40:39][Intel Intrinsics Guide: _mm_sqrt_ps[ref
    site="Intel"
    page="Intrinsics Guide"
    url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]]
[41:11][handmade_render_group.cpp: Do _mm_sqrt_ps and continue converting to SIMD]
[43:39][Run and note that we are blitting correctly \[427cy/h\]]
[43:54][Debugger: Look at what Clamp01 does]
[47:17][Intel Intrinsics Guide: _mm_min_ps and _mm_max_ps[ref
    site="Intel"
    page="Intrinsics Guide"
    url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]]
[48:45][handmade_render_group.cpp: Do the Clamps wide \[179cy/h\]]
[50:02][Run and note that the game is already running faster]
[50:47][Reflect on the straightforwardness of this work]
[51:54][Consider what's left to convert to SIMD]
[52:46][handmade_render_group.cpp: Do PixelP wide]
[54:16][Run and note how fast it's running \[124cy/h\]]
[56:18][Debugger: Investigate what the compiler is doing with those 50 cycles]
[1:02:54][handmade_render_group.cpp: Finish doing the SIMD here]
[1:07:32][Run and note that we're creeping forwards \[121cy/h\]]
[1:08:06][Recap and glimpse into the future of doing the Loads and Repack in SIMD]
[1:11:08][Q&A][:speech]
[1:11:32][@kknewkles][How do you cover multiple CPU technologies intrinsic-wise? Preprocessor switches on dedicated intrinsics for each? Also, whom to read on ASM? I'm thinking Mike Abrash?]
[1:13:09][@houb_][We have come from 385 cycles to ~123. Does something like the 80%-20% rule apply? Do you think we will get down to 50 cycles?]
[1:15:22][@maexono][The way we use mmSquare, does it calculate the argument twice?]
[1:15:41][Debugger: Determine if the compiler is doing common subexpression elimination for these multiplies]
[1:21:11][Deep, concentrated investigation][quote 86]
[1:25:54][Look at how fast the game's running]
[1:26:19][@cvaucher][Where do OpenCL and other GPGPU frameworks fit into optimization? It seems like if something is SIMD-able, it could just be done wider on a GPU. Are there workloads that are better suited to the CPU and SIMD?]
[1:29:06][@garlandobloom][We have optimizations still on?]
[1:29:19][@gasto5][Why are there optimizing options in the compiler if one will end up typing SIMD functions?]
[1:31:01][@quylthulg][Do you know of the _mm_setr_ps intrinsic (and _pd etc) - note the r in setr? It loads the values in reverse order, i.e. in the order that is more intuitive]
[1:31:38][@garlandobloom][When do you think we will thread the renderer?]
[1:31:57][@goodoldmalk][Possibly misguided question, is there a way to overload operators to use SIMD instructions instead?]
[1:32:45][@digitaldomovoi][Is padding and alignment still something you have to concern yourself with? I remember doing SIMD in the mid 2000s, and SIMD was essentially worthless (much of the time) if your data wasn't aligned]
[1:33:43][@digitaldomovoi][Addendum: By "concern yourself", I mean, is it something the compiler now handles more autonomously when you "engage" SIMD]
[1:34:15][@kil4h][Will you generate asm for NEON (if you port to arm of course)? GCC seems to be pretty bad at generating correct code with intrinsics (from my experience on Android)]
[1:35:03][@culver_fly][How would you know if doing something will speed up the code? Especially when it's a fairly large change to the codebase and when time is limited, I find myself reluctant to perform such optimizations in fear of introducing bugs]
[1:36:46][@miblo][What do you think you'll next want to convert to SIMD, in case I want to practise over the weekend?]
[1:38:52][@flaturated][Can you compile it -Od and show how SIMD has helped there?]
[1:39:32][@kknewkles][Would it be a good exercise (albeit a large one) to study a simple CPU and write some soft for it? Arduino or something ancient? I wanted to learn coding for GBA for a while]
[1:41:04][@kknewkles][Let's rephrase: what CPU would you advise to study that would be simple enough yet representative enough of the general stuff you should know about when working with CPUs?][quote 87]
[1:42:52][@theitchyninja][How long have you been working on this and when do you think you will finish?]
[1:43:29][@gasto5][re you going to optimize gameplay code as well?]
[1:43:45][@houb_][Have you heard of the JayStation2 Project from Jaymin Kessler, working with the Raspberry Pi 2 B+?]
[1:44:03][Close things down with a recap of the week's optimisation work][:speech]
[1:48:03][Shout out to the mods][:speech]
[/video]