[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=ray title="Replacing rand() and Preparing for SIMD" vod_platform=youtube id=xBBEkn1x7So annotator=Miblo]
[0:06][Recap and set the stage for the day][:speech]
[1:38][Note that we're building in optimised mode][:speech]
[2:15][:Run and see our output image]
[3:39][ray.cpp: Walk through the code][:speech]
[5:23][Consider two areas of :optimisation: 1) Bounding Volume Hierarchy][:speech]
[6:57][2) Using better :math operations][:optimisation :speech]
[7:42][Step into RenderTile() and inspect the :asm, noting down routines to improve]
[15:51][Check out PCG, A Family of Better Random Number Generators[ref
    site="PCG, A Family of Better Random Number Generators"
    url=http://www.pcg-random.org/] with a recommendation to read the full paper[ref
    author="Melissa E. O’Neill"
    title="PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation"
    url=http://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf]][:prng :research]
[24:13][Check out the x86 SSE2 shift-left instructions[ref
    site=Intel
    page="Intel Intrinsics Guide"
    url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:isa :research]
[27:59][Read 6.3 - Specific Implementations[ref
    author="Melissa E. O’Neill"
    title="PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation"
    url=http://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf] and the Xorshift wiki article[ref
        site=Wikipedia
        page=Xorshift
        url=https://en.wikipedia.org/wiki/Xorshift]][:prng :research]
[31:15][Introduce XOrShift32() from Wikipedia[ref
        site=Wikipedia
        page=Xorshift
        url=https://en.wikipedia.org/wiki/Xorshift] with a check into doing this in a 64-bit[ref
    author="Melissa E. O’Neill"
    title="PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation"
    url=http://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf]]
[37:45][:Run our program to get a benchmark :timing]
[38:59][Replace rand() with our new XOrShift32(), packing Entropy in the work_order struct][:optimisation :prng]
[46:39][:Run to see no obvious problems with our output, and note our dramatically improved :performance]
[48:47][Step into the code and inspect the :asm to see a lot of mulss calls]
[51:30][Introduce CastSampleRays() to do some of the work of RenderTile()][:lighting :rendering]
[58:21][:Run to see that we lose some speed]
[59:12][Make RenderTile() only use a random_series in its inner loop][:lighting :rendering]
[1:00:07][:Run to see that that's a little bit better]
[1:01:07][Rename cast_result to cast_state which contains both the input and output data][:lighting :rendering]
[1:08:12][:Run to see some busted imagery]
[1:08:58][Fix RenderTile() to correctly fill out the cast_state State][:lighting :rendering]
[1:12:20][:Run to see that that helps]
[1:13:05][Consider how to perform this ray casting wide][:lighting :optimisation :rendering :speech]
[1:18:04][Transform CastSampleRays() to handle the notion of operating wide][:lighting :optimisation :rendering :speech]
[1:19:06][:Run to see that it runs roughly four times faster, and that the image now contains tile-boundary artifacts]
[1:21:08][Temporarily revert RandomUnlateral() to use rand()][:prng]
[1:21:38][:Run to see no artifacts, and note that the XOrShift32() needs improving]
[1:22:46][Sketch in the code to enable CastSampleRays() to operate wide][:lighting :optimisation :rendering]
[1:33:17][Describe our current situation][:speech]
[1:34:11][Set up CastSampleRays() to let all rays in all lanes finish][:lighting :optimisation :rendering]
[1:38:21][Consider how to track the materials wide][:lighting :optimisation :rendering :speech]
[1:40:08][Set up CastSampleRays() to track the materials wide and collate all the computations][:lighting :optimisation :rendering]
[1:52:29][Create ray_lane.h to #define the lanes, and introduce RandomBilateralLane(), various permutations of ConditionalAssign(), a Max(), MaskIsZeroed() and versions of HorizontalAdd()][:optimisation :prng]
[2:03:39][:Run and see totally busted imagery]
[2:04:23][Build in debug mode and on one core]
[2:05:40][Step in to CastSampleRays() and inspect its values]
[2:05:56][Make CastSampleRays() set FilmX and FilmY to their centres][:lighting :rendering]
[2:07:14][Step in to CastSampleRays() and see that the State->Series and Order->Entropy are both 0]
[2:08:36][Make CastSampleRays() offset the Entropy and use different random series per ray][:prng]
[2:09:27][Step in to CastSampleRays() and note that the ConditionalAssign() is wrong]
[2:10:44][Make ConditionalAssign() zero the Mask if there is nothing set in it]
[2:11:20][Step in to ConditionalAssign() to see that that is better]
[2:11:41][:Run to see how the picture looks]
[2:13:24][View the image][:run]
[2:13:49][Reduce the RayCount and increase the CoreCount][:lighting :rendering]
[2:14:49][Investigate the summation][:lighting :rendering]
[2:17:53][Make CastSampleRays() correctly set the LaneMask][:lighting :rendering]
[2:18:35][:Run and see a more correct image]
[2:18:52][Switch back to the optimised version, with more RaysPerPixel]
[2:19:09][:Run to see that we're darker]
[2:20:13][Correctly set the LaneWidth][:lighting :rendering]
[2:21:20][:Run and see that the images are basically indistinguishable]
[2:22:12][Set up to support a constrained set of LANE_WIDTH values][:optimisation]
[2:30:05][:Run to see that XOrShift32() is actually fine]
[2:31:45][Do LANE_WIDTH==8 too][:optimisation]
[2:32:43][Q&A][:speech]
[2:33:46][@yurasniper][Q: How would one implement something like bloom effect in a raytracer?][:lighting :rendering]
[2:39:46][:Run our program to capture its :performance statistics]
[2:42:07][@macielda][Q: Is the Halton 2,3 sequence a good way to generate sample positions? I've heard about some people using it. It is a low discrepancy series][:prng]
[2:43:11][Rename our image and stat files][:admin]
[2:44:30][@vaualbus][Q: When you learn this way of doing SIMD? I remember in [~hero Handmade Hero] when we had optimized the renderer we use __m128 every way][:optimisation]
[2:46:05][@macielda][Q: What is your take on AA methods? I'm currently looking for one for my game. I see The Witness has MSAA option only (no FXAA, TXAA and friends)?][:rendering]
[2:46:31][@longboolean][Q: Are there any machines with :hardware RNG that just puts random values into a register with one instruction?[ref
    site=Wikipedia
    page=RdRand
    url=https://en.wikipedia.org/wiki/RdRand]][:prng]
[2:48:44][@pseudonym73][Q: G'day, long time no stream. Low-discrepancy sequences do exhibit blue noise behaviours if you do them right, but their main advantage is that you can access the quasi-random streams in an arbitrary order. Not really relevant yet. Also, you can do better than 2,3 Halton][:prng]
[2:49:37][@macielda][Q: Do shader languages expose things like "Conditional Assign"?][:language]
[2:51:14][Ensure that everything is in good shape][:admin]
[2:52:14][Shut down][:speech]
[/video]