From 8a82432abb1536ad546e599ab236561ea4068a9d Mon Sep 17 00:00:00 2001 From: Miblo Date: Mon, 4 Jul 2022 19:20:42 +0100 Subject: [PATCH] Index hero/chat020 --- cmuratori/hero/chat/chat020.hmml | 238 +++++++++++++++++++++++++++++++ 1 file changed, 238 insertions(+) create mode 100644 cmuratori/hero/chat/chat020.hmml diff --git a/cmuratori/hero/chat/chat020.hmml b/cmuratori/hero/chat/chat020.hmml new file mode 100644 index 0000000..cafaad2 --- /dev/null +++ b/cmuratori/hero/chat/chat020.hmml @@ -0,0 +1,238 @@ +[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=chat medium=research title="Assembly Analysis and Front-end Register Clears" vod_platform=youtube id=R5tBY9Zyw6o annotator=Miblo] +[0:03][Welcome to the chat][:speech] +[1:32][Advocate ZII (Zero Is Initialisation)[ref + site=Imgur + page="Non zero'd and zero'd ASM" + url=https://imgur.com/a/xeX8GMk]][:language] +[8:38][Describe Jesse Meyer's ZII experiment[ref + site=Imgur + page="Non zero'd and zero'd ASM" + url=https://imgur.com/a/xeX8GMk]][:language] +[10:04][DOS vs Linux :memory mapping, page faults and :profiling] +[18:38][:Memory mapping and :profiling: 1) Hunt for minimum] +[22:04][:Memory mapping and :profiling: 2) Statistical breakdown, ignoring outliers] +[24:52][General advice on :profiling CPU :performance] +[25:42][Create xorclear.cpp][:programming :language :memory] +[27:24][Set up our xorclear experiment in Compiler Explorer[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:language :memory] +[28:55][Initially, msvc seems to generate better code than clang[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[32:17][Walk through the xorclear code in conjunction with the clang-generated assembly[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[39:14][Macro-ops subject to fusion (cmp and jne)[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[44:30][Memory Execution Units and Scalar Arithmetic Units][:hardware] +[46:36][Port usage of ADD (R64, I8)[ref + site="uops.info" + page="ADD (R64, I8)" + url=https://uops.info/html-instr/ADD_R64_I8.html]][:hardware :performance] +[49:50][Port usage of CMP (R64, I32)[ref + site="uops.info" + page="CMP (R64, I32)" + url=https://uops.info/html-instr/CMP_R64_I32.html]][:hardware :performance] +[51:27][Writing Identity into the Matrices array using mov, movaps and movups instructions[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[58:29][xorps[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language] +[1:00:28][Does Clang do anything more than -O3?][:language :speech] +[1:01:06][@chronic_quagga][-mavx2?][:language] +[1:01:21][Loading and writing zeros[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:04:40][Horrible code: 1) Superfluous zero writes[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:05:29][Try moving the Identity and Zero matrix_4x4 outside of main()[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:06:10][Move Identity and Zero matrix_4x4 back inside main()[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:06:22][Horrible code: 1) Superfluous zero writes (cont.)[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:06:52][Horrible code: 2) Using seven instructions to move 64 bytes[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:08:27][Hunt uops for mov[ref + site="uops.info" + url=https://uops.info/]][:hardware :performance] +[1:11:03][MOVUPS (M128, XMM)[ref + site="uops.info" + page="MOVUPS (M128, XMM)" + url=https://uops.info/html-instr/MOVUPS_M128_XMM.html]][:hardware :performance] +[1:14:56][Check the Intel 64 and IA-32 Architectures Software Developer Manual for MOV[ref + site="Intel" + page="Intel 64 and IA-32 Architectures Software Developer Manuals" + url=https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html]][:hardware :performance] +[1:15:49][MOVQ (M64, XMM)[ref + site="uops.info" + page="MOVQ (M64, XMM)" + url=https://uops.info/html-instr/MOVQ_M64_XMM.html]][:hardware :performance] +[1:16:08][MOV permutations[ref + site="Intel" + page="Intel 64 and IA-32 Architectures Software Developer Manuals" + url=https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html]][:hardware :performance] +[1:16:50][Port usage of mov, movaps and movups[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:hardware :performance] +[1:17:20][@jaege8][Next page?] +[1:17:59][MOV (M32, I32)[ref + site="uops.info" + page="MOV (M32, I32)" + url=https://uops.info/html-instr/MOV_M32_I32.html]][:hardware :performance] +[1:19:33][Horrible code: 2) Using seven instructions to move 64 bytes (cont.)[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:20:01][Hand-write and -read 128-bit rows using _mm_setr_ps() and _mm_storeu_ps()[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9][ref + site=Intel + page="Intel Intrinsics Guide" + url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:asm :language :memory] +[1:22:36][The clang-generated code is now better, with one loop unroll[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:25:42][@oldganon][O3 didn't help here][:language :performance] +[1:25:56][Thoughts on explicitly writing out intrinsics][:language :performance] +[1:26:54][Walk through the xorclear code in conjunction with the msvc-generated assembly[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:28:09][Hunt uops for rep[ref + site="uops.info" + url=https://uops.info/]][:hardware :performance] +[1:29:09][MOVSB_REPE[ref + site="uops.info" + page="MOVSB_REPE" + url=https://uops.info/html-instr/MOVSB_REPE.html]][:hardware :performance] +[1:30:59][Determine to try a dependent clear] +[1:32:15][@dragoonx6][@handmade_hero Try something like -O3 -march=skylake -ffast-math][:language] +[1:32:39][Temporarily try moving the Identity and Zero matrix_4x4 outside of main()[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:33:50][@daniel_collin_][You can leave it inside and set it to static][:language] +[1:35:16][Introduce a conditional clear in xorclear[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:39:14][Compare clang vs msvc on our conditional clear[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:44:14][@sainst0][Does it change if you give it -mtune=znver2?][:language] +[1:44:47][Clang often outputs slow code, but faster intrinsics-heavy code][:language :performance] +[1:45:34][Walk through the msvc-generated code for our conditional clear[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:46:22][Why clearing to zero is free[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :hardware :language :memory :performance] +[1:49:05][Non-free zero-clearing: 1) When frontend-bound][:hardware :performance] +[1:51:45][@peterfors][Skylake's memory subsystem is in charge of the loads and store requests and ordering. Since Haswell, it's possible to sustain two memory reads (on ports 2 and 3) and one memory write (on port 4) each cycle][:hardware :performance] +[1:52:29][Non-free zero-clearing: 2) Code size, alignment differences][:hardware :performance] +[1:53:22][Our movaps and xorps operations are free][:hardware :performance] +[1:53:58][Try declaring the rows uninitialised, only conditionally setting to zero[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:54:45][Our code introduced an extra jmp[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:56:20][Always initialise to zero[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[1:57:07][Replace the branch with a masked blend[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[2:02:22][msvc doesn't bother to blend with 0[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[2:04:41][Fill the second column with 1s[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[2:04:56][msvc doesn't bother to do the full blend on each row[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[2:05:30][Make each row different[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory] +[2:05:51][Our instructions will overlap[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:asm :language :memory :performance] +[2:07:09][Q&A][:speech] +[2:07:26][@jessem3y3r][@handmade_hero Hi [@cmuratori Casey]. Jesse from twitter here. Thank you so much for taking the time to explain and demonstrate this on [~hero Handmade Hero]! Deeply appreciated!] +[2:08:37][@somebody_took_my_name][If you take a look at different add / sub ops with immediates, you'll see nice tricks with the lea instruction][:asm] +[2:09:56][@centhusiast][Q: I compiled the code with icc, Intel's compiler, with O2 and it takes 3.5 seconds to run it. Is this really bad?][:performance] +[2:10:09][:Memory bandwidth will be the bottleneck][:performance :speech] +[2:13:35][@vodonikhs][Q: Someone has mentioned that older Clang versions generate better code. Could it be because of Heartbleed mitigation?][:language :performance] +[2:13:46][Try rolling back to older clang versions[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:language :performance] +[2:14:24][@i_am_seabass][Q: I read that mixing SSE2 and AVX2 will incur a :performance penalty. How would you handle optimizing code, if you want to support AVX2, but also SSE for older systems? Would you just have separate builds for each?] +[2:16:29][Isolating architecture-dependent code][:language :speech] +[2:18:34][@mindmark42][Q: Could you show what gcc does?][:language] +[2:18:43][GCC uses all scalar mov instructions[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:language :performance] +[2:19:18][@vodonikhs][Q: Try Clang 6][:language] +[2:19:24][Clang 6 still looks bad[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:language :performance] +[2:19:58][gcc -O3 generates the correct code[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:language :performance] +[2:20:26][@skincell3][Q: Could you provide the twitter conversation link that you are responding to, for the YouTube video?] +[2:21:03][@maliusarth][Q: You haven't tried latest clang with O3, did you?][:language] +[2:21:10][Latest clang with -O3 generates bad code[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:language :performance] +[2:21:13][Compilers should produce reliable code without the need for switches, optimisation passes, etc.][:language :performance] +[2:23:23][@sir_klausi][@handmade_hero Is it possible those extra jumps clang generates are a spectre mitigation?][:language] +[2:24:30][@drmaruq][Spectre mitigation is on :hardware level? Why would clang mess up the exe?][:language] +[2:25:05][Share the godbolt link[ref + site="Compiler Explorer" + page="xorclear" + url=https://godbolt.org/z/v36WE9]][:language :performance] +[2:25:33][Close it down with a plug of Star Code Galaxy[ref + site="Star Code Galaxy" + url=https://starcodegalaxy.com/]][:speech] +[/video]