From 8a82432abb1536ad546e599ab236561ea4068a9d Mon Sep 17 00:00:00 2001
From: Miblo <admin@miblo.net>
Date: Mon, 4 Jul 2022 19:20:42 +0100
Subject: [PATCH] Index hero/chat020

---
 cmuratori/hero/chat/chat020.hmml | 238 +++++++++++++++++++++++++++++++
 1 file changed, 238 insertions(+)
 create mode 100644 cmuratori/hero/chat/chat020.hmml

diff --git a/cmuratori/hero/chat/chat020.hmml b/cmuratori/hero/chat/chat020.hmml
new file mode 100644
index 0000000..cafaad2
--- /dev/null
+++ b/cmuratori/hero/chat/chat020.hmml
@@ -0,0 +1,238 @@
+[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=chat medium=research title="Assembly Analysis and Front-end Register Clears" vod_platform=youtube id=R5tBY9Zyw6o annotator=Miblo]
+[0:03][Welcome to the chat][:speech]
+[1:32][Advocate ZII (Zero Is Initialisation)[ref
+    site=Imgur
+    page="Non zero'd and zero'd ASM"
+    url=https://imgur.com/a/xeX8GMk]][:language]
+[8:38][Describe Jesse Meyer's ZII experiment[ref
+    site=Imgur
+    page="Non zero'd and zero'd ASM"
+    url=https://imgur.com/a/xeX8GMk]][:language]
+[10:04][DOS vs Linux :memory mapping, page faults and :profiling]
+[18:38][:Memory mapping and :profiling: 1) Hunt for minimum]
+[22:04][:Memory mapping and :profiling: 2) Statistical breakdown, ignoring outliers]
+[24:52][General advice on :profiling CPU :performance]
+[25:42][Create xorclear.cpp][:programming :language :memory]
+[27:24][Set up our xorclear experiment in Compiler Explorer[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:language :memory]
+[28:55][Initially, msvc seems to generate better code than clang[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[32:17][Walk through the xorclear code in conjunction with the clang-generated assembly[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[39:14][Macro-ops subject to fusion (cmp and jne)[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[44:30][Memory Execution Units and Scalar Arithmetic Units][:hardware]
+[46:36][Port usage of ADD (R64, I8)[ref
+    site="uops.info"
+    page="ADD (R64, I8)"
+    url=https://uops.info/html-instr/ADD_R64_I8.html]][:hardware :performance]
+[49:50][Port usage of CMP (R64, I32)[ref
+    site="uops.info"
+    page="CMP (R64, I32)"
+    url=https://uops.info/html-instr/CMP_R64_I32.html]][:hardware :performance]
+[51:27][Writing Identity into the Matrices array using mov, movaps and movups instructions[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[58:29][xorps[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language]
+[1:00:28][Does Clang do anything more than -O3?][:language :speech]
+[1:01:06][@chronic_quagga][-mavx2?][:language]
+[1:01:21][Loading and writing zeros[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:04:40][Horrible code: 1) Superfluous zero writes[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:05:29][Try moving the Identity and Zero matrix_4x4 outside of main()[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:06:10][Move Identity and Zero matrix_4x4 back inside main()[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:06:22][Horrible code: 1) Superfluous zero writes (cont.)[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:06:52][Horrible code: 2) Using seven instructions to move 64 bytes[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:08:27][Hunt uops for mov[ref
+    site="uops.info"
+    url=https://uops.info/]][:hardware :performance]
+[1:11:03][MOVUPS (M128, XMM)[ref
+    site="uops.info"
+    page="MOVUPS (M128, XMM)"
+    url=https://uops.info/html-instr/MOVUPS_M128_XMM.html]][:hardware :performance]
+[1:14:56][Check the Intel 64 and IA-32 Architectures Software Developer Manual for MOV[ref
+    site="Intel"
+    page="Intel 64 and IA-32 Architectures Software Developer Manuals"
+    url=https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html]][:hardware :performance]
+[1:15:49][MOVQ (M64, XMM)[ref
+    site="uops.info"
+    page="MOVQ (M64, XMM)"
+    url=https://uops.info/html-instr/MOVQ_M64_XMM.html]][:hardware :performance]
+[1:16:08][MOV permutations[ref
+    site="Intel"
+    page="Intel 64 and IA-32 Architectures Software Developer Manuals"
+    url=https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html]][:hardware :performance]
+[1:16:50][Port usage of mov, movaps and movups[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:hardware :performance]
+[1:17:20][@jaege8][Next page?]
+[1:17:59][MOV (M32, I32)[ref
+    site="uops.info"
+    page="MOV (M32, I32)"
+    url=https://uops.info/html-instr/MOV_M32_I32.html]][:hardware :performance]
+[1:19:33][Horrible code: 2) Using seven instructions to move 64 bytes (cont.)[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:20:01][Hand-write and -read 128-bit rows using _mm_setr_ps() and _mm_storeu_ps()[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9][ref
+    site=Intel
+    page="Intel Intrinsics Guide"
+    url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:asm :language :memory]
+[1:22:36][The clang-generated code is now better, with one loop unroll[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:25:42][@oldganon][O3 didn't help here][:language :performance]
+[1:25:56][Thoughts on explicitly writing out intrinsics][:language :performance]
+[1:26:54][Walk through the xorclear code in conjunction with the msvc-generated assembly[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:28:09][Hunt uops for rep[ref
+    site="uops.info"
+    url=https://uops.info/]][:hardware :performance]
+[1:29:09][MOVSB_REPE[ref
+    site="uops.info"
+    page="MOVSB_REPE"
+    url=https://uops.info/html-instr/MOVSB_REPE.html]][:hardware :performance]
+[1:30:59][Determine to try a dependent clear]
+[1:32:15][@dragoonx6][@handmade_hero Try something like -O3 -march=skylake -ffast-math][:language]
+[1:32:39][Temporarily try moving the Identity and Zero matrix_4x4 outside of main()[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:33:50][@daniel_collin_][You can leave it inside and set it to static][:language]
+[1:35:16][Introduce a conditional clear in xorclear[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:39:14][Compare clang vs msvc on our conditional clear[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:44:14][@sainst0][Does it change if you give it -mtune=znver2?][:language]
+[1:44:47][Clang often outputs slow code, but faster intrinsics-heavy code][:language :performance]
+[1:45:34][Walk through the msvc-generated code for our conditional clear[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:46:22][Why clearing to zero is free[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :hardware :language :memory :performance]
+[1:49:05][Non-free zero-clearing: 1) When frontend-bound][:hardware :performance]
+[1:51:45][@peterfors][Skylake's memory subsystem is in charge of the loads and store requests and ordering. Since Haswell, it's possible to sustain two memory reads (on ports 2 and 3) and one memory write (on port 4) each cycle][:hardware :performance]
+[1:52:29][Non-free zero-clearing: 2) Code size, alignment differences][:hardware :performance]
+[1:53:22][Our movaps and xorps operations are free][:hardware :performance]
+[1:53:58][Try declaring the rows uninitialised, only conditionally setting to zero[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:54:45][Our code introduced an extra jmp[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:56:20][Always initialise to zero[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[1:57:07][Replace the branch with a masked blend[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[2:02:22][msvc doesn't bother to blend with 0[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[2:04:41][Fill the second column with 1s[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[2:04:56][msvc doesn't bother to do the full blend on each row[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[2:05:30][Make each row different[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
+[2:05:51][Our instructions will overlap[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:asm :language :memory :performance]
+[2:07:09][Q&A][:speech]
+[2:07:26][@jessem3y3r][@handmade_hero Hi [@cmuratori Casey]. Jesse from twitter here. Thank you so much for taking the time to explain and demonstrate this on [~hero Handmade Hero]! Deeply appreciated!]
+[2:08:37][@somebody_took_my_name][If you take a look at different add / sub ops with immediates, you'll see nice tricks with the lea instruction][:asm]
+[2:09:56][@centhusiast][Q: I compiled the code with icc, Intel's compiler, with O2 and it takes 3.5 seconds to run it. Is this really bad?][:performance]
+[2:10:09][:Memory bandwidth will be the bottleneck][:performance :speech]
+[2:13:35][@vodonikhs][Q: Someone has mentioned that older Clang versions generate better code. Could it be because of Heartbleed mitigation?][:language :performance]
+[2:13:46][Try rolling back to older clang versions[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:language :performance]
+[2:14:24][@i_am_seabass][Q: I read that mixing SSE2 and AVX2 will incur a :performance penalty. How would you handle optimizing code, if you want to support AVX2, but also SSE for older systems? Would you just have separate builds for each?]
+[2:16:29][Isolating architecture-dependent code][:language :speech]
+[2:18:34][@mindmark42][Q: Could you show what gcc does?][:language]
+[2:18:43][GCC uses all scalar mov instructions[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:language :performance]
+[2:19:18][@vodonikhs][Q: Try Clang 6][:language]
+[2:19:24][Clang 6 still looks bad[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:language :performance]
+[2:19:58][gcc -O3 generates the correct code[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:language :performance]
+[2:20:26][@skincell3][Q: Could you provide the twitter conversation link that you are responding to, for the YouTube video?]
+[2:21:03][@maliusarth][Q: You haven't tried latest clang with O3, did you?][:language]
+[2:21:10][Latest clang with -O3 generates bad code[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:language :performance]
+[2:21:13][Compilers should produce reliable code without the need for switches, optimisation passes, etc.][:language :performance]
+[2:23:23][@sir_klausi][@handmade_hero Is it possible those extra jumps clang generates are a spectre mitigation?][:language]
+[2:24:30][@drmaruq][Spectre mitigation is on :hardware level? Why would clang mess up the exe?][:language]
+[2:25:05][Share the godbolt link[ref
+    site="Compiler Explorer"
+    page="xorclear"
+    url=https://godbolt.org/z/v36WE9]][:language :performance]
+[2:25:33][Close it down with a plug of Star Code Galaxy[ref
+    site="Star Code Galaxy"
+    url=https://starcodegalaxy.com/]][:speech]
+[/video]