[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=ray title="Replacing rand() and Preparing for SIMD" vod_platform=youtube id=xBBEkn1x7So annotator=Miblo] [0:06][Recap and set the stage for the day][:speech] [1:38][Note that we're building in optimised mode][:speech] [2:15][:Run and see our output image] [3:39][ray.cpp: Walk through the code][:speech] [5:23][Consider two areas of :optimisation: 1) Bounding Volume Hierarchy][:speech] [6:57][2) Using better :math operations][:optimisation :speech] [7:42][Step into RenderTile() and inspect the :asm, noting down routines to improve] [15:51][Check out PCG, A Family of Better Random Number Generators[ref site="PCG, A Family of Better Random Number Generators" url=http://www.pcg-random.org/] with a recommendation to read the full paper[ref author="Melissa E. O’Neill" title="PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation" url=http://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf]][:prng :research] [24:13][Check out the x86 SSE2 shift-left instructions[ref site=Intel page="Intel Intrinsics Guide" url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:isa :research] [27:59][Read 6.3 - Specific Implementations[ref author="Melissa E. O’Neill" title="PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation" url=http://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf] and the Xorshift wiki article[ref site=Wikipedia page=Xorshift url=https://en.wikipedia.org/wiki/Xorshift]][:prng :research] [31:15][Introduce XOrShift32() from Wikipedia[ref site=Wikipedia page=Xorshift url=https://en.wikipedia.org/wiki/Xorshift] with a check into doing this in a 64-bit[ref author="Melissa E. O’Neill" title="PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation" url=http://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf]] [37:45][:Run our program to get a benchmark :timing] [38:59][Replace rand() with our new XOrShift32(), packing Entropy in the work_order struct][:optimisation :prng] [46:39][:Run to see no obvious problems with our output, and note our dramatically improved :performance] [48:47][Step into the code and inspect the :asm to see a lot of mulss calls] [51:30][Introduce CastSampleRays() to do some of the work of RenderTile()][:lighting :rendering] [58:21][:Run to see that we lose some speed] [59:12][Make RenderTile() only use a random_series in its inner loop][:lighting :rendering] [1:00:07][:Run to see that that's a little bit better] [1:01:07][Rename cast_result to cast_state which contains both the input and output data][:lighting :rendering] [1:08:12][:Run to see some busted imagery] [1:08:58][Fix RenderTile() to correctly fill out the cast_state State][:lighting :rendering] [1:12:20][:Run to see that that helps] [1:13:05][Consider how to perform this ray casting wide][:lighting :optimisation :rendering :speech] [1:18:04][Transform CastSampleRays() to handle the notion of operating wide][:lighting :optimisation :rendering :speech] [1:19:06][:Run to see that it runs roughly four times faster, and that the image now contains tile-boundary artifacts] [1:21:08][Temporarily revert RandomUnlateral() to use rand()][:prng] [1:21:38][:Run to see no artifacts, and note that the XOrShift32() needs improving] [1:22:46][Sketch in the code to enable CastSampleRays() to operate wide][:lighting :optimisation :rendering] [1:33:17][Describe our current situation][:speech] [1:34:11][Set up CastSampleRays() to let all rays in all lanes finish][:lighting :optimisation :rendering] [1:38:21][Consider how to track the materials wide][:lighting :optimisation :rendering :speech] [1:40:08][Set up CastSampleRays() to track the materials wide and collate all the computations][:lighting :optimisation :rendering] [1:52:29][Create ray_lane.h to #define the lanes, and introduce RandomBilateralLane(), various permutations of ConditionalAssign(), a Max(), MaskIsZeroed() and versions of HorizontalAdd()][:optimisation :prng] [2:03:39][:Run and see totally busted imagery] [2:04:23][Build in debug mode and on one core] [2:05:40][Step in to CastSampleRays() and inspect its values] [2:05:56][Make CastSampleRays() set FilmX and FilmY to their centres][:lighting :rendering] [2:07:14][Step in to CastSampleRays() and see that the State->Series and Order->Entropy are both 0] [2:08:36][Make CastSampleRays() offset the Entropy and use different random series per ray][:prng] [2:09:27][Step in to CastSampleRays() and note that the ConditionalAssign() is wrong] [2:10:44][Make ConditionalAssign() zero the Mask if there is nothing set in it] [2:11:20][Step in to ConditionalAssign() to see that that is better] [2:11:41][:Run to see how the picture looks] [2:13:24][View the image][:run] [2:13:49][Reduce the RayCount and increase the CoreCount][:lighting :rendering] [2:14:49][Investigate the summation][:lighting :rendering] [2:17:53][Make CastSampleRays() correctly set the LaneMask][:lighting :rendering] [2:18:35][:Run and see a more correct image] [2:18:52][Switch back to the optimised version, with more RaysPerPixel] [2:19:09][:Run to see that we're darker] [2:20:13][Correctly set the LaneWidth][:lighting :rendering] [2:21:20][:Run and see that the images are basically indistinguishable] [2:22:12][Set up to support a constrained set of LANE_WIDTH values][:optimisation] [2:30:05][:Run to see that XOrShift32() is actually fine] [2:31:45][Do LANE_WIDTH==8 too][:optimisation] [2:32:43][Q&A] [2:33:46][@yurasniper][Q: How would one implement something like bloom effect in a raytracer?][:lighting :rendering] [2:39:46][:Run our program to capture its :performance statistics] [2:42:07][@macielda][Q: Is the Halton 2,3 sequence a good way to generate sample positions? I've heard about some people using it. It is a low discrepancy series][:prng] [2:43:11][Rename our image and stat files][:admin] [2:44:30][@vaualbus][Q: When you learn this way of doing SIMD? I remember in [~hero Handmade Hero] when we had optimized the renderer we use __m128 every way][:optimisation] [2:46:05][@macielda][Q: What is your take on AA methods? I'm currently looking for one for my game. I see The Witness has MSAA option only (no FXAA, TXAA and friends)?][:rendering] [2:46:31][@longboolean][Q: Are there any machines with :hardware RNG that just puts random values into a register with one instruction?[ref site=Wikipedia page=RdRand url=https://en.wikipedia.org/wiki/RdRand]][:prng] [2:48:44][@pseudonym73][Q: G'day, long time no stream. Low-discrepancy sequences do exhibit blue noise behaviours if you do them right, but their main advantage is that you can access the quasi-random streams in an arbitrary order. Not really relevant yet. Also, you can do better than 2,3 Halton][:prng] [2:49:37][@macielda][Q: Do shader languages expose things like "Conditional Assign"?][:language] [2:51:14][Ensure that everything is in good shape][:admin] [2:52:14][Shut down][:speech] [/video]