parent
							
								
									d599e075c2
								
							
						
					
					
						commit
						9d1f1ae18d
					
				|  | @ -1,4 +1,4 @@ | |||
| [video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=ray title="Replacing rand() and Preparing for SIMD" vod_platform=youtube id=dpvrPYdTkPw annotator=Miblo] | ||||
| [video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=ray title="Replacing rand() and Preparing for SIMD" vod_platform=youtube id=xBBEkn1x7So annotator=Miblo] | ||||
| [0:06][Recap and set the stage for the day][:speech] | ||||
| [1:38][Note that we're building in optimised mode][:speech] | ||||
| [2:15][:Run and see our output image] | ||||
|  |  | |||
|  | @ -0,0 +1,144 @@ | |||
| [video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=ray title="Optimizing with SSE2 and AVX2" vod_platform=youtube id=dpvrPYdTkPw annotator=Miblo] | ||||
| [0:02][Recap and set the stage for the day][:speech] | ||||
| [1:26][:Run the program to show the current picture] | ||||
| [4:17][Begin to implement the LANE_WIDTH == 4 versions for our various functions / operators[ref | ||||
|     site=Intel | ||||
|     page="Intel Intrinsics Guide" | ||||
|     url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:isa :lighting :optimisation :rendering] | ||||
| [13:57][Describe the _mm_xor_si128 instruction[ref | ||||
|     site=Intel | ||||
|     page="Intel Intrinsics Guide" | ||||
|     url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:isa :research] | ||||
| [16:52][Implement a full set of lane width-agnostic operators[ref | ||||
|     site=Intel | ||||
|     page="Intel Intrinsics Guide" | ||||
|     url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:isa :math :optimisation] | ||||
| [42:33][Fix up CastSampleRays() to convert everything to the correct lane width][:optimisation][quote 608] | ||||
| [45:35][Introduce LaneV3FromV3() and continue fixing up CastSampleRays()][:optimisation] | ||||
| [48:45][Implement the various lane_v3 functions / operators[ref | ||||
|     site=Intel | ||||
|     page="Intel Intrinsics Guide" | ||||
|     url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:isa :math :optimisation] | ||||
| [1:03:14][Implement scalar comparison operators[ref | ||||
|     site=Intel | ||||
|     page="Intel Intrinsics Guide" | ||||
|     url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:isa :math :optimisation] | ||||
| [1:12:16][Introduce AndNot() using _mm_andnot_si128 for ConditionalAssign() to use[ref | ||||
|     site=Intel | ||||
|     page="Intel Intrinsics Guide" | ||||
|     url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:isa :optimisation] | ||||
| [1:23:00][Continue to implement our scalar functions][:optimisation] | ||||
| [1:32:28][Double-check C's specification for comparison operators[ref | ||||
|     site=cppreference.com | ||||
|     page="Comparison operators" | ||||
|     url=http://en.cppreference.com/w/c/language/operator_comparison]][:research] | ||||
| [1:34:00][Continue to fix up compile errors] | ||||
| [1:34:52][Implement scalar loading of materials using _mm_setr_ps[ref | ||||
|     site=Intel | ||||
|     page="Intel Intrinsics Guide" | ||||
|     url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:isa :optimisation] | ||||
| [1:50:15][Continue to fix up compile errors[ref | ||||
|     site=Intel | ||||
|     page="Intel Intrinsics Guide" | ||||
|     url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:isa :optimisation] | ||||
| [1:59:59][Implement multiple permutations of MaskIsZeroed() and HorizontalAdd()[ref | ||||
|     site=Intel | ||||
|     page="Intel Intrinsics Guide" | ||||
|     url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:isa :optimisation] | ||||
| [2:06:20][Make RenderTile() pack the sRGB colour inline and initialise everything in scalar][:optimisation] | ||||
| [2:11:39][Introduce Extract0() for RenderTile() to call][:optimisation] | ||||
| [2:14:17][Change Materials, Planes and Spheres to be initialiser lists] | ||||
| [2:18:47][Make the Entropy stuff work properly[ref | ||||
|     site=MSDN | ||||
|     page="x64 (amd64) Intrinsics List" | ||||
|     url=https://msdn.microsoft.com/en-us/library/hh977022.aspx]][:optimisation] | ||||
| [2:21:42][:Run the program to see totally bogus results] | ||||
| [2:22:18][Print out the lane width and flip LANE_WIDTH back to 1 so we can get that working again][:optimisation :profiling] | ||||
| [2:32:07][:Run the program in 1-wide lanes to see that this no longer works] | ||||
| [2:34:23][Step through CastSampleRays() and inspect its values] | ||||
| [2:43:07][Make the operator& for lane_v3 zero out the mask if needed] | ||||
| [2:44:29][:Run the program...] | ||||
| [2:45:19][Bump the CPUCount back up] | ||||
| [2:45:27][:Run the program to see what's going on] | ||||
| [2:46:58][Increase the LANE_WIDTH to 4] | ||||
| [2:47:19][:Run the program to see a bizarre picture] | ||||
| [2:48:26][Switch back to the slow mode and step through CastSampleRays() to inspect its values] | ||||
| [2:54:12][Fix ConditionalAssign() to cast rather than convert] | ||||
| [2:54:54][Step back through CastSampleRays() to see more expected values] | ||||
| [2:56:20][:Run our program to see a better image] | ||||
| [2:58:02][Compare our GatherF32_() functions] | ||||
| [2:59:52][Step into CastSampleRays() and inspect the material values] | ||||
| [3:03:16][Scrutinise our operator!= for lane_u32] | ||||
| [3:04:39][Fix our operator!= for lane_u32 to use _mm_set1_epi32(0xFFFFFFFF) rather than _mm_setzero_si128()][:isa] | ||||
| [3:05:55][Step in to CastSampleRays() to see that our lane mask is set properly] | ||||
| [3:06:57][:Run our program to see that we're now only a little bit wrong] | ||||
| [3:08:08][Read through our scalar code for any obvious mistakes] | ||||
| [3:18:07][:Run our program on 1 lane, to compare our image with the 4 lane version] | ||||
| [3:20:51][Rename Scatter to Specular and try to force all Specular values to 1] | ||||
| [3:23:52][:Run our program to see what that looks like] | ||||
| [3:25:53][Revert those specular values and investigate whether the PureBounce, RandomBounce and RayDirection are being computed correctly] | ||||
| [3:28:49][Step in to the lane_v3 Lerp() to see what it produces][:math] | ||||
| [3:32:42][Check the normalisation of RayDirection] | ||||
| [3:33:37][Step through RandomBilateral()] | ||||
| [3:35:01][Step into LaneF32FromU32() and double-check what it is computing] | ||||
| [3:36:48][Make LaneU32FromU32 cast its incoming u32 to an int when passing it to _mm_set1_epi32][quote 601] | ||||
| [3:38:43][Step back in to RandomUnilateral() to see possibly more expected results] | ||||
| [3:40:06][Assert in RandomUnilateral() that Result < 0.6f] | ||||
| [3:41:06][:Run the game and don't hit that assert, to determine that RandomUnilateral() is not producing the full range of values from 0 to 1] | ||||
| [3:43:00][Make RandomUnilateral() shift down its terms by 1][quote 602] | ||||
| [3:44:29][:Run and hit our assertion in RandomUnilateral()] | ||||
| [3:44:34][Remove that assert and :run the game to see a reasonable result] | ||||
| [3:46:17][:Run the program at full quality and compare our images][quote 603] | ||||
| [3:49:06][Step in to CastSampleRays() to see that we do break out properly] | ||||
| [3:49:20][Cast significantly fewer rays per pixel to determine that we are not over-casting] | ||||
| [3:52:21][Step in to CastSampleRays() and inspect the :asm] | ||||
| [3:55:07][Make CastSampleRays() count up the LoopsComputed for us to print out][:profiling] | ||||
| [4:00:35][:Run our program and inspect its statistics to see a mere 10.61% wasted bounces] | ||||
| [4:01:40][Q&A] | ||||
| [4:02:40][@thecodedragon][You didn't replace &= and |= with the correct operator inside the function] | ||||
| [4:03:03][@popcorn0x90][Q: Is your beard fake? It grew pretty fast] | ||||
| [4:03:11][@Kelimion][@cmuratori: Not just day 3, but also 2017-11-19 (for the image filename)] | ||||
| [4:03:38][@pragmascrypt][Q: 64 samples looked very smooth. Could you compare 64 samples with 4 wide to 64 samples 1 wide?] | ||||
| [4:03:59][@chrysos42][Q: Due to floating point precision, is there a significant difference between generating a random float by dividing 32 random bits by the max 32 bit integer vs dividing 24 random bits by the max 24 bit integer?][:prng] | ||||
| [4:05:34][@the_lyribolical_coach_b][Q: You said in a much earlier stream that using operator overloads for SIMD could confuse the compiler, preferring to use macros. Why the change?] | ||||
| [4:06:53][@pragmascrypt][Q: I was thinking maybe it does more samples than it should with 4 wide, so by comparing 64 samples 1 wide with 64 samples 4 wide maybe it would look different] | ||||
| [4:07:12][:Run the program on 1 lane and 64 RaysPerPixel and compare the images] | ||||
| [4:09:23][@groggeh][Q: Nine women can't grow a baby any faster; would smaller packing potentially be better? Is the packing taking too much time? Just spit-balling] | ||||
| [4:10:24][Enable CastSampleRays() to early-out as often as possible][:optimisation] | ||||
| [4:14:40][:Run the program to see that it is now twice as fast] | ||||
| [4:16:00][Consider avoiding gathering for rays that haven't hit][:optimisation] | ||||
| [4:16:59][Explicitly establish that the LaneMask is not zeroed before setting the Attenuation, Bounces and RayDirection][:optimisation] | ||||
| [4:17:56][:Run the program to see another speedup] | ||||
| [4:19:02][Pull out the lane width-specific code to their own .h files, introducing 8-wide versions for everything[ref | ||||
|     site=Intel | ||||
|     page="Intel Intrinsics Guide" | ||||
|     url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:isa :optimisation] | ||||
| [4:22:17][Check out the _CMP* defines in immintrin.h[ref | ||||
|     site="Intel Developer Zone" | ||||
|     page="_mm_cmp_ps, _mm256_cmp_ps" | ||||
|     url=https://software.intel.com/en-us/node/524077]][quote 604] | ||||
| [4:27:08][Learn what "ordered" means in the context of these _CMP* defines[ref | ||||
|     site="Intel Developer Zone" | ||||
|     page="Compare Intrinsics for Floating Point Vectors" | ||||
|     url=https://software.intel.com/en-us/node/694431]][:research] | ||||
| [4:28:22][Continue to implement the 8-wide versions of our functions / operators[ref | ||||
|     site=Intel | ||||
|     page="Intel Intrinsics Guide" | ||||
|     url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:isa :optimisation] | ||||
| [4:36:07][:Run the program in 8-wide lanes and crash immediately] | ||||
| [4:38:21][Inspect the :asm for RenderTile() to see that we are failing on the vunpcklps call, and investigate if it is an alignment issue] | ||||
| [4:42:03][Search the Intel 64 and IA-32 Architectures Software Developer Manuals for vunpcklps[ref | ||||
|     site="Intel" | ||||
|     page="Intel 64 and IA-32 Architectures Software Developer Manuals" | ||||
|     url="https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html"]][:research] | ||||
| [4:44:33][Pass -arch:AVX2 on the build line to prevent the vunpcklps call from using bcst[ref | ||||
|     site=MSDN | ||||
|     page="x64 (amd64) Intrinsics List" | ||||
|     url=https://msdn.microsoft.com/en-us/library/jj620901.aspx]] | ||||
| [4:46:26][:Run our program in 8-wide lanes to see that we are slower, more wasteful and darker] | ||||
| [4:47:18][Fix our 8-wide HorizontalAdd()][:optimisation] | ||||
| [4:48:10][:Run our program to see that we are much better, and save off our images and statistics] | ||||
| [4:52:28][That's about it for today] | ||||
| [4:53:20][@butwhynot1][Q: Do AVX512 now] | ||||
| [4:54:04][That's it] | ||||
| [/video] | ||||
		Loading…
	
		Reference in New Issue