[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="SIMD Basics" vod_platform=youtube id=YnnTb0AQgYM annotator=Miblo annotator=dspecht]
[0:59][Open us up and see where we're at in terms of performance]
[1:52][Blackboard: SIMD on x64]
[6:34][Blackboard: How do we use SIMD?]
[8:16][Blackboard: CPU vs GPU framebuffers]
[14:10][Blackboard: "SOA" vs "AOS"]
[18:22][Blackboard: How this stuff actually works]
[22:13][Blackboard: Strided loading on NEON]
[25:22][build.bat: Turn off compiler optimisations]
[25:46][Internet: Intel Intrinsics Guide[ref
    site="Intel"
    page="Intel Intrinsics Guide"
    url="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"]]
[28:54][handmade_render_group.cpp: Initialise some __m128 registers and use some SIMD intrinsics to operate on them 4-wide]
[31:07][Debugger: Go to disassembly and look at the SIMD registers]
[36:02][handmade_render_group.cpp: Set four different values in the registers]
[36:36][Debugger: See those different values in the registers and note the order in which they are loaded]
[39:47][handmade_render_group.cpp: Turn Square functions into multiplies]
[41:04][Fix the loop to work on pixels in batches of 4]
[46:20][Run the game and note that we are overwriting our boundary]
[47:15][handmade_render_group.cpp: Temporarily clip the buffers]
[48:19][Separate the memory loading stuff from the computations]
[50:58][Declare the arrays before the loop]
[57:20][Debugger: Run and investigate the error]
[58:18][handmade_render_group.cpp: Correctly test ShouldFill\[I\]]
[58:50][Run and note that we're (almost) back to where we started]
[59:42][handmade_render_group.cpp: Walk through the routine]
[1:01:53][Load in the Pixels from the right place]
[1:02:33][Run, note that we're back to some semblance of good, and glimpse into the future]
[1:04:06][Q&A]
[1:04:30][@thesizik][Would it be faster to unpack pixels using a union of an int32 with a struct of 4 int8's, instead of doing 4 shifts and masks per pixel?]
[1:05:15][@houb_][Why don't we go: Y<2 and X<2 and go through in blocks, instead of a line?]
[1:07:44][@culver_fly][Is it better if we calculate if the pixel should be filled and queue it up and only do the calculations once we hit 4 of them?]
[1:10:45][@hmh_bot][Casey was using a Das Keyboard 4, but it broke, so he is currently using an unknown keyboard he had lying around]
[1:11:30][@hguleryuz][Sorry, maybe this is off-topic: Would it be correct to say anyone coding in Java, by default, is not making use of any of the SIMD stuff, or do you think the JIT compiler is smart enough to make use of it in certain circumstances, maybe with some analysis of the bytecode?]
[1:12:29][@guit4rfreak][How often do you optimize for cache misses vs optimizing with SIMD? I got the impression that cache misses are by far the most important things to look out for]
[1:14:40][@culver_fly][Please send my best regards to Jeff]
[1:14:52][@sharlock93][Schedule-wise, how many more weeks until you are done with optimization of the renderer?]
[1:15:01][@ray_caster][Will you be covering Morton order texture swizzling?]
[1:16:54][@dr_fubar][Possibly a noob Q: Have you ever run into problems with floating point arithmetic, and what are some good approaches to avoiding those problems?[ref
    author="Forman S. Acton"
    title="Real Computing Made Real"
    isbn=9780691036632][ref
    author="Forman S. Acton"
    title="Numerical Methods that Work"
    isbn=9780883854501]]
[1:21:37][starchypancakes \[...\] Casey said SSE2 was standard, I guess I'll start there[ref
    site="Valve"
    page="Steam Hardware & Software Survey"
    url="https://store.steampowered.com/hwsurvey"]]
[1:24:06][@houb_][Is there a way to track how memory gets stored to cache?[ref
    site="Intel"
    page="Intel Performance Counter Monitor"
    url="https://software.intel.com/en-us/articles/intel-performance-counter-monitor"]]
[1:28:01][@hguleryuz][Off-topic: Do you know if JAI will have extensions / a method for using SIMD?]
[1:28:50][@xaitra][How much do you need to think about the intrinsic instructions while programming, or does the compiler usually take care of that? Is this the big difference between using GNU and Intel compiler, for example?]
[1:30:37][@ray_caster][I think he's essentially asking how proficient compilers are at automatically emitting SIMD instructions[ref
    site="LLVM"
    page="Auto-Vectorization in LLVM"
    url="http://llvm.org/docs/Vectorizers.html"]]
[1:33:54][@rooctag][Do you have to take the instruction cache into account? Or is it large enough?]
[1:34:39][@goodoldmalk][How does intrinsics and parallel processing work together? Does each CPU have registers to do intrinsics? If so, could we increase X-fold the number of pixel rendering in our code if we computed in parallel?]
[1:35:33][Wrap things up]
[/video]