cinera_handmade.network/cmuratori/hero/code/code121.hmml

[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Rendering in Tiles (Marathon)" vod_platform=youtube id=kZlPYka1T0g annotator=Miblo annotator=Kelimion]
[1:36][Reintroducing the Intel Architecture Code Analyzer]
[10:46][A long time ago, in RAD's source tree]
[13:00][blowtard, an analytical tool for the Xbox 360's PowerPC Tri-Core Xenon written by Casey]
[22:04][How IACA's output differs from Casey's stats in blowtard]
[30:56][Looking at how to get our cycle count down]
[32:21][Manually unroll the Fetch / Sample loop]
[36:30][Group by Sample]
[37:27][Use _mm_setr_ps as suggested by Fabian a long time ago]
[42:14][Taking a look at the total throughput count]
[43:18][Casey needs some more soya \[sic\] milk]
[44:17][Could we do a load once, and grab out the two values that we needed?]
[45:48][Day 121 Blackboard: Explanation of possible texel loading optimisation]
[50:32][Figuring out how the compiler is loading the texel data]
[1:00:18][This is fine, then]
[1:01:01][We multiply by TexturePitch and sizeof(uint32) four-wide manually, which is stupid]
[1:02:06][Shift up FetchX_4x by 2, rather than multiply by sizeof(uint32)]
[1:03:40][Premultiply FetchY_4x by TexturePitch_4x]
[1:04:07][Give the compiler the wide stuff so that it can see it as wide]
[1:11:21][_mm_mul_epi32 does not do integer * integer]
[1:13:43][Port pressure (we're back to InterIteration)]
[1:17:46][Blackboard: Hyperthreading]
[1:27:22][Blackboard: Designing how to break up the renderer for multithreading to ease pressure on the caches]
[1:32:22][Blackboard: Divide the frame buffer into chunks that are sized appropriately for the cache]
[1:39:55][Blackboard: The plan for setting up the renderer]
[1:40:47][Implementation of interleaved scanlines, in readiness for hyperthreading]
[1:46:36][Blackboard: The logic of interleaved scanlines]
[1:52:37][Updating compiler directives for folks who use LLVM]
[1:55:20][Implementation of frame buffer divisions, in readiness for multi-core processing]
[2:05:30][Go to Disassembly of DrawRectangleSlowly in order to diagnose bogus cycle count]
[2:10:04][Frame buffer divisions, continued]
[2:20:50][Introduce GetClampedRectArea]
[2:22:12][Problematic thing: Our convention for rectangles before was that they did not include their final value]
[2:27:33][Fix the cycle counter for DrawRectangleSlowly again]
[2:29:42][A shortcut didn't work out.  (!quote 297 + !quote 298)]
[2:30:56][Loft FillRect above the loop]
[2:36:34][Introduce PixelPxRow in order to keep PixelPx as a wide value rather than having to set it each time]
[2:39:50][Check IACA for performance difference and revert to setting PixelPx each time through the loop]
[2:43:28][Shuffle calculations around to figure out how the performance is affected, for good or ill]
[2:51:17][Blackboard: Thinking about that alignment problem]
[2:55:58][Align MinX and MaxX]
[3:00:18][Microsoft Visual Studio 2013 has stopped working]
[3:02:03][Dancing trees]
[3:03:03][Change our loads and stores to no longer be unaligned]
[3:04:05][Assess performance difference and revert back to the unaligned load and store instructions]
[3:05:12][Make sure that we actually always fill the real clip region and not write outside the clip region]
[3:07:10][Blackboard: Our options for filling the pixels]
[3:09:12][Implementation of alignment to the ending edge]
[3:16:48][Clip the leading edge]
[3:19:41][Blackboard: ClipMask]
[3:21:33][Try setting StartupClipMask by using _mm_srli_si128]
[3:22:28][// TODO(casey): This is stupid.]
[3:26:10][Early-out the FillRect tests]
[3:30:01][Start passing ClipRect through to DrawRectangleQuickly]
[3:35:35][Moment of realisation, with introduction of the InvertedInfinityRectangle]
[3:37:48][Temporarily adjust ClipRect in order to avoid a crash]
[3:39:24][Introduce TiledRenderGroupToOutput outside of the timer]
[3:43:57][Update DrawRectangle to take the clipping information]
[3:47:18][Update DrawRectangle{,Quickly} to use the Even / Odd information]
[3:49:20][Break the screen up into pieces and render them separately]
[3:54:34][Stretch your legs, Casey]
[3:56:28][We can finally end the stream]
[3:57:07][Q&A]
[3:57:12][@rygorous][a) your top and right clip is off-by-1!]
[3:59:54][@mmozeiko][_mm_mullo_epi32 is SSE4 intrinsic]
[4:04:57][@mmozeiko][Will you revert yesterday changes where you changed bilinear pixel unpacking code from float mul to int mul? It was faster with float mul.]
[4:05:24][@an0nymal][How many more marathon streams will we have? I thoroughly enjoyed the 4+ hours today.]
[4:05:47][@quikligames][You should give a big thanks to @Rygorous for sticking around and trying to give you tips knowing full well that you wouldn't see them in chat]
[4:06:07][@mmozeiko][would it be better to have tile sizes always divisible by 4 horizontally (or even 16 to be cache aligned), then there will be no need to deal with alignment and masking?]
[4:07:07][@rygorous][(clip) one too few pixels. look at the edge of the screen.]
[4:09:20][@rygorous][just pretty sure I saw glitchiness/off-by-1-pixel stuff near the edges but it might've been the video encoding]
[4:11:08][@mmozeiko][(tile size %4) - not masking for textures, but ClipMask variable]
[4:13:30][@abnercoimbre][Q: holy crap. our 1st marathon.]
[4:13:50][Time for Casey to go to bed, with closing remarks]
[/video]
Relocate riscy and add newly converted hero The idea here is to reduce the amount of superfluous stuff downloaded to each server running cinera 2017-12-06 22:26:13 +00:00			`[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Rendering in Tiles (Marathon)" vod_platform=youtube id=kZlPYka1T0g annotator=Miblo annotator=Kelimion]`
			`[1:36][Reintroducing the Intel Architecture Code Analyzer]`
			`[10:46][A long time ago, in RAD's source tree]`
			`[13:00][blowtard, an analytical tool for the Xbox 360's PowerPC Tri-Core Xenon written by Casey]`
			`[22:04][How IACA's output differs from Casey's stats in blowtard]`
			`[30:56][Looking at how to get our cycle count down]`
			`[32:21][Manually unroll the Fetch / Sample loop]`
			`[36:30][Group by Sample]`
			`[37:27][Use _mm_setr_ps as suggested by Fabian a long time ago]`
			`[42:14][Taking a look at the total throughput count]`
			`[43:18][Casey needs some more soya \[sic\] milk]`
			`[44:17][Could we do a load once, and grab out the two values that we needed?]`
			`[45:48][Day 121 Blackboard: Explanation of possible texel loading optimisation]`
			`[50:32][Figuring out how the compiler is loading the texel data]`
			`[1:00:18][This is fine, then]`
			`[1:01:01][We multiply by TexturePitch and sizeof(uint32) four-wide manually, which is stupid]`
			`[1:02:06][Shift up FetchX_4x by 2, rather than multiply by sizeof(uint32)]`
			`[1:03:40][Premultiply FetchY_4x by TexturePitch_4x]`
			`[1:04:07][Give the compiler the wide stuff so that it can see it as wide]`
			`[1:11:21][_mm_mul_epi32 does not do integer * integer]`
			`[1:13:43][Port pressure (we're back to InterIteration)]`
			`[1:17:46][Blackboard: Hyperthreading]`
			`[1:27:22][Blackboard: Designing how to break up the renderer for multithreading to ease pressure on the caches]`
			`[1:32:22][Blackboard: Divide the frame buffer into chunks that are sized appropriately for the cache]`
			`[1:39:55][Blackboard: The plan for setting up the renderer]`
			`[1:40:47][Implementation of interleaved scanlines, in readiness for hyperthreading]`
			`[1:46:36][Blackboard: The logic of interleaved scanlines]`
			`[1:52:37][Updating compiler directives for folks who use LLVM]`
			`[1:55:20][Implementation of frame buffer divisions, in readiness for multi-core processing]`
			`[2:05:30][Go to Disassembly of DrawRectangleSlowly in order to diagnose bogus cycle count]`
			`[2:10:04][Frame buffer divisions, continued]`
			`[2:20:50][Introduce GetClampedRectArea]`
			`[2:22:12][Problematic thing: Our convention for rectangles before was that they did not include their final value]`
			`[2:27:33][Fix the cycle counter for DrawRectangleSlowly again]`
			`[2:29:42][A shortcut didn't work out. (!quote 297 + !quote 298)]`
			`[2:30:56][Loft FillRect above the loop]`
			`[2:36:34][Introduce PixelPxRow in order to keep PixelPx as a wide value rather than having to set it each time]`
			`[2:39:50][Check IACA for performance difference and revert to setting PixelPx each time through the loop]`
			`[2:43:28][Shuffle calculations around to figure out how the performance is affected, for good or ill]`
			`[2:51:17][Blackboard: Thinking about that alignment problem]`
			`[2:55:58][Align MinX and MaxX]`
			`[3:00:18][Microsoft Visual Studio 2013 has stopped working]`
			`[3:02:03][Dancing trees]`
			`[3:03:03][Change our loads and stores to no longer be unaligned]`
			`[3:04:05][Assess performance difference and revert back to the unaligned load and store instructions]`
			`[3:05:12][Make sure that we actually always fill the real clip region and not write outside the clip region]`
			`[3:07:10][Blackboard: Our options for filling the pixels]`
			`[3:09:12][Implementation of alignment to the ending edge]`
			`[3:16:48][Clip the leading edge]`
			`[3:19:41][Blackboard: ClipMask]`
			`[3:21:33][Try setting StartupClipMask by using _mm_srli_si128]`
			`[3:22:28][// TODO(casey): This is stupid.]`
			`[3:26:10][Early-out the FillRect tests]`
			`[3:30:01][Start passing ClipRect through to DrawRectangleQuickly]`
			`[3:35:35][Moment of realisation, with introduction of the InvertedInfinityRectangle]`
			`[3:37:48][Temporarily adjust ClipRect in order to avoid a crash]`
			`[3:39:24][Introduce TiledRenderGroupToOutput outside of the timer]`
			`[3:43:57][Update DrawRectangle to take the clipping information]`
			`[3:47:18][Update DrawRectangle{,Quickly} to use the Even / Odd information]`
			`[3:49:20][Break the screen up into pieces and render them separately]`
			`[3:54:34][Stretch your legs, Casey]`
			`[3:56:28][We can finally end the stream]`
			`[3:57:07][Q&A]`
			`[3:57:12][@rygorous][a) your top and right clip is off-by-1!]`
			`[3:59:54][@mmozeiko][_mm_mullo_epi32 is SSE4 intrinsic]`
			`[4:04:57][@mmozeiko][Will you revert yesterday changes where you changed bilinear pixel unpacking code from float mul to int mul? It was faster with float mul.]`
			`[4:05:24][@an0nymal][How many more marathon streams will we have? I thoroughly enjoyed the 4+ hours today.]`
			`[4:05:47][@quikligames][You should give a big thanks to @Rygorous for sticking around and trying to give you tips knowing full well that you wouldn't see them in chat]`
			`[4:06:07][@mmozeiko][would it be better to have tile sizes always divisible by 4 horizontally (or even 16 to be cache aligned), then there will be no need to deal with alignment and masking?]`
			`[4:07:07][@rygorous][(clip) one too few pixels. look at the edge of the screen.]`
			`[4:09:20][@rygorous][just pretty sure I saw glitchiness/off-by-1-pixel stuff near the edges but it might've been the video encoding]`
			`[4:11:08][@mmozeiko][(tile size %4) - not masking for textures, but ClipMask variable]`
			`[4:13:30][@abnercoimbre][Q: holy crap. our 1st marathon.]`
			`[4:13:50][Time for Casey to go to bed, with closing remarks]`
			`[/video]`