[video output=day121 member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Rendering in Tiles (Marathon)" vod_platform=youtube id=kZlPYka1T0g annotator=Miblo annotator=Kelimion] [1:36][Reintroducing the Intel Architecture Code Analyzer] [10:46][A long time ago, in RAD's source tree] [13:00][blowtard, an analytical tool for the Xbox 360's PowerPC Tri-Core Xenon written by Casey] [22:04][How IACA's output differs from Casey's stats in blowtard] [30:56][Looking at how to get our cycle count down] [32:21][Manually unroll the Fetch / Sample loop] [36:30][Group by Sample] [37:27][Use _mm_setr_ps as suggested by Fabian a long time ago] [42:14][Taking a look at the total throughput count] [43:18][Casey needs some more soya \[sic\] milk] [44:17][Could we do a load once, and grab out the two values that we needed?] [45:48][Day 121 Blackboard: Explanation of possible texel loading optimisation] [50:32][Figuring out how the compiler is loading the texel data] [1:00:18][This is fine, then] [1:01:01][We multiply by TexturePitch and sizeof(uint32) four-wide manually, which is stupid] [1:02:06][Shift up FetchX_4x by 2, rather than multiply by sizeof(uint32)] [1:03:40][Premultiply FetchY_4x by TexturePitch_4x] [1:04:07][Give the compiler the wide stuff so that it can see it as wide] [1:11:21][_mm_mul_epi32 does not do integer * integer] [1:13:43][Port pressure (we're back to InterIteration)] [1:17:46][Blackboard: Hyperthreading] [1:27:22][Blackboard: Designing how to break up the renderer for multithreading to ease pressure on the caches] [1:32:22][Blackboard: Divide the frame buffer into chunks that are sized appropriately for the cache] [1:39:55][Blackboard: The plan for setting up the renderer] [1:40:47][Implementation of interleaved scanlines, in readiness for hyperthreading] [1:46:36][Blackboard: The logic of interleaved scanlines] [1:52:37][Updating compiler directives for folks who use LLVM] [1:55:20][Implementation of frame buffer divisions, in readiness for multi-core processing] [2:05:30][Go to Disassembly of DrawRectangleSlowly in order to diagnose bogus cycle count] [2:10:04][Frame buffer divisions, continued] [2:20:50][Introduce GetClampedRectArea] [2:22:12][Problematic thing: Our convention for rectangles before was that they did not include their final value] [2:27:33][Fix the cycle counter for DrawRectangleSlowly again] [2:29:42][A shortcut didn't work out. (!quote 297 + !quote 298)] [2:30:56][Loft FillRect above the loop] [2:36:34][Introduce PixelPxRow in order to keep PixelPx as a wide value rather than having to set it each time] [2:39:50][Check IACA for performance difference and revert to setting PixelPx each time through the loop] [2:43:28][Shuffle calculations around to figure out how the performance is affected, for good or ill] [2:51:17][Blackboard: Thinking about that alignment problem] [2:55:58][Align MinX and MaxX] [3:00:18][Microsoft Visual Studio 2013 has stopped working] [3:02:03][Dancing trees] [3:03:03][Change our loads and stores to no longer be unaligned] [3:04:05][Assess performance difference and revert back to the unaligned load and store instructions] [3:05:12][Make sure that we actually always fill the real clip region and not write outside the clip region] [3:07:10][Blackboard: Our options for filling the pixels] [3:09:12][Implementation of alignment to the ending edge] [3:16:48][Clip the leading edge] [3:19:41][Blackboard: ClipMask] [3:21:33][Try setting StartupClipMask by using _mm_srli_si128] [3:22:28][// TODO(casey): This is stupid.] [3:26:10][Early-out the FillRect tests] [3:30:01][Start passing ClipRect through to DrawRectangleQuickly] [3:35:35][Moment of realisation, with introduction of the InvertedInfinityRectangle] [3:37:48][Temporarily adjust ClipRect in order to avoid a crash] [3:39:24][Introduce TiledRenderGroupToOutput outside of the timer] [3:43:57][Update DrawRectangle to take the clipping information] [3:47:18][Update DrawRectangle{,Quickly} to use the Even / Odd information] [3:49:20][Break the screen up into pieces and render them separately] [3:54:34][Stretch your legs, Casey] [3:56:28][We can finally end the stream] [3:57:07][Q&A][:speech] [3:57:12][@rygorous][a) your top and right clip is off-by-1!] [3:59:54][@mmozeiko][_mm_mullo_epi32 is SSE4 intrinsic] [4:04:57][@mmozeiko][Will you revert yesterday changes where you changed bilinear pixel unpacking code from float mul to int mul? It was faster with float mul.] [4:05:24][@an0nymal][How many more marathon streams will we have? I thoroughly enjoyed the 4+ hours today.] [4:05:47][@quikligames][You should give a big thanks to @Rygorous for sticking around and trying to give you tips knowing full well that you wouldn't see them in chat] [4:06:07][@mmozeiko][would it be better to have tile sizes always divisible by 4 horizontally (or even 16 to be cache aligned), then there will be no need to deal with alignment and masking?] [4:07:07][@rygorous][(clip) one too few pixels. look at the edge of the screen.] [4:09:20][@rygorous][just pretty sure I saw glitchiness/off-by-1-pixel stuff near the edges but it might've been the video encoding] [4:11:08][@mmozeiko][(tile size %4) - not masking for textures, but ClipMask variable] [4:13:30][@abnercoimbre][Q: holy crap. our 1st marathon.] [4:13:50][Time for Casey to go to bed, with closing remarks] [/video]