[video output=day599 member=cmuratori stream_platform=twitch stream_username=handmade_hero project=code title="Implementing the Grid Raycast Postamble" vod_platform=youtube id=HN5IP9q4pWE annotator=Miblo]
[0:00][Welcome to the stream with a plug of Slipways[ref
    site=SlipWays
    url=https://slipways.net/]][:speech]
[0:48][Begin to recap the new grid-based raycasting][:lighting :speech]
[1:42][Toggle on LIGHTING_USE_FOUR_RAYS (i.e. the tree-based raycaster)][:lighting]
[1:57][Demo the :lighting with the determination to reduce the :sampling noise and speed it up][:run]
[2:56][Toggle off LIGHTING_USE_FOUR_RAYS mentioning our refusal to use Apple hardware][:lighting]
[3:27][GridRayCast() work: 1) Implement the routine correctly][:lighting :research]
[3:50][GridRayCast() setup work: 2) Produce the ray-direction lookup tables][:lighting :research]
[4:01][GridRayCast() setup work: 3) Grid up our geometry][:lighting :research]
[4:09][Embark on implementing GridRayCast() correctly][:lighting]
[7:22][Delete archived videos and disable Storage Sense on the streaming machine][:admin]
[9:10][Implement the leaf picking in GridRayCast(), introducing lighting_spatial_grid_node and lighting_spatial_grid_leaf][:"data structure" :lighting]
[12:48][Reflect on our grid-based raycaster, comparing it with the tree-based one][:"data structure" :lighting :research]
[14:32][:SIMD inefficiency considerations, including packing on-demand][:"data structure" :lighting :research]
[16:52][Make a note to do a single-leaf spatial structure version][:"data structure" :lighting]
[17:30][Continued :SIMD inefficiency considerations when we have between 1 and 3 pieces of data to work with][:"data structure" :lighting :research]
[19:02][Prepare to implement the tRay picking in GridRayCast() using _mm_minpos_epu16()[ref
    site=Intel
    page="Intel Intrinsics Guide"
    url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/][ref
    site=uops.info
    page="PHMINPOSUW (XMM, XMM)"
    url=https://uops.info/html-instr/PHMINPOSUW_XMM_XMM.html]][:"data structure" :lighting :research :simd]
[24:25][Implement the tRay picking in GridRayCast() using _mm_minpos_epu16()[ref
    site=Intel
    page="Intel Intrinsics Guide"
    url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:"data structure" :lighting :simd]
[32:00][Recall platform-specific bugginess of _mm_set1_epi32()[ref
    site=Intel
    page="Intel Intrinsics Guide"
    url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/][ref
    site="Compiler Explorer"
    url=https://godbolt.org][ref
    site=GitHub
    page="cmuratori / meow_hash"
    url=https://github.com/cmuratori/meow_hash/]][:research :simd]
[39:21][Try _mm_set1_epi32() in GridRayCast()][:"data structure" :lighting :simd]
[41:28][Consider a special floating-point comparison circuit to be unnecessary][:hardware :isa :research]
[44:40][@charlesbukowski][I can confirm that I am in fact not an IEEE expert][:hardware :isa]
[44:44][Continue to implement versions of tRay picking in GridRayCast(), with and without _mm_cvtepi32_ps()[ref
    site=Intel
    page="Intel Intrinsics Guide"
    url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:"data structure" :lighting :simd]
[52:57][Continue to consider special comparison of the top 16-bits of a floating-point value to be unnecessary][:hardware :isa :research]
[53:57][Write the HCompShuffler in GridRayCast()[ref
    site=Intel
    page="Intel Intrinsics Guide"
    url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:"data structure" :lighting :simd]
[1:01:00][Learn that _mm_minpos_epu16(), given duplicated input, is documented to return the first matching value[ref
    site=Intel
    page="Intel Intrinsics Guide"
    url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:simd :research]
[1:03:13][Write the ShuffleTable in GridRayCast()[ref
    site=Intel
    page="Intel Intrinsics Guide"
    url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:"data structure" :lighting :simd]
[1:06:22][Fix compile errors in GridRayCast()][:"data structure" :lighting :simd]
[1:06:41][Check the assembly of _mm_set1_epi32() in Compiler Explorer[ref
    site="Compiler Explorer"
    url=https://godbolt.org] to find that the compiler generates the full broadcast table, rather than our desired broadcast instruction][:asm :research :simd]
[1:08:45][Manually write out the full ShuffleTable in GridRayCast(), to abandon _mm_set1_epi32()][:"data structure" :lighting :simd]
[1:11:29][Reflect on the efficiency of our tRay picking][:"data structure" :lighting :research :simd]
[1:12:34][Move on to the ray hit extraction][:"data structure" :lighting :research :simd]
[1:15:01][Correctly implement the ray hit extraction in GridRayCast()[ref
    site=Intel
    page="Intel Intrinsics Guide"
    url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:"data structure" :lighting :simd]
[1:21:27][Introduce PShufB(), with a :rant on C++ vs assembly][:asm :"data structure" :language :lighting :simd]
[1:25:34][Introduce Extract0() for GridRayCast() to use[ref
    site=Intel
    page="Intel Intrinsics Guide"
    url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:"data structure" :lighting :simd]
[1:28:17][Continue to implement the ray hit extraction in GridRayCast(), making it set the TransferPPS in a passed-in value][:"data structure" :lighting :simd]
[1:58:26][Change Extract0() to use _mm_cvtss_f32(), and introduce Extract1() and Extract2(), using _mm_extract_ps()[ref
    site=Intel
    page="Intel Intrinsics Guide"
    url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:"data structure" :lighting :simd]
[2:01:48][Get GridRayCast() in a compilable state][:"data structure" :lighting :simd]
[2:04:17][Q&A][:speech]
[2:04:34][@x1bzzr][Q: [@naysayer88 Jon] said you had opinions on signed / unsigned. He said you said "maybe it's the wrong complication to have". Do you know what he was referring to and, if so, can you elaborate?][:language]
[2:07:08][@sholofly][Q: Are you familiar with Home Assistant?]
[2:07:19][@hubco][Q: Are there alternatives to inline assembly other than intrinsic?][:asm]
[2:08:02][@bogez57][Q: Hey [@cmuratori Casey], on episode 300-ish you mentioned that OpenGL 4.5 is much closer to a good graphics :API then something like Vulcan. I'm wondering if other developers you know agree with you on this and if so, then why would something like Vulkan get support? Who exactly makes the design / adoption decisions for these kinds of technologies? Do developers have any say?]
[2:10:04][@mindmark42][Q: Why is it slow for the CPU to do horizontal :SIMD operations?]
[2:10:39]["Slow" → What does this mean?][:blackboard :performance]
[2:12:09][Horizontal Operations = High Latency][:blackboard :performance :simd]
[2:15:35][_mm_minpos_epu16() and _mm_hadd_epi32() as horizontal operations[ref
    site=Intel
    page="Intel Intrinsics Guide"
    url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:performance :research :simd]
[2:17:52][@hubco][Q: Wouldn't you use Vulkan for cross platform?][:api]
[2:19:29][Close everything down with a plug of the Meow the Infinite printed comic Kickstarter[ref
    site=Kickstarter
    page="Meow the Infinite: Book One"
    url=https://www.kickstarter.com/projects/annarettberg/meow-the-infinite-book-one] and a further mention of Slipways[ref
    site=SlipWays
    url=https://slipways.net/]][:speech]
[/video]