cinera_handmade.network/cmuratori/hero/chat/chat020.hmml

239 lines
12 KiB
Plaintext

[video member=cmuratori stream_platform=twitch stream_username=handmade_hero project=chat medium=research title="Assembly Analysis and Front-end Register Clears" vod_platform=youtube id=R5tBY9Zyw6o annotator=Miblo]
[0:03][Welcome to the chat][:speech]
[1:32][Advocate ZII (Zero Is Initialisation)[ref
site=Imgur
page="Non zero'd and zero'd ASM"
url=https://imgur.com/a/xeX8GMk]][:language]
[8:38][Describe Jesse Meyer's ZII experiment[ref
site=Imgur
page="Non zero'd and zero'd ASM"
url=https://imgur.com/a/xeX8GMk]][:language]
[10:04][DOS vs Linux :memory mapping, page faults and :profiling]
[18:38][:Memory mapping and :profiling: 1) Hunt for minimum]
[22:04][:Memory mapping and :profiling: 2) Statistical breakdown, ignoring outliers]
[24:52][General advice on :profiling CPU :performance]
[25:42][Create xorclear.cpp][:programming :language :memory]
[27:24][Set up our xorclear experiment in Compiler Explorer[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:language :memory]
[28:55][Initially, msvc seems to generate better code than clang[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[32:17][Walk through the xorclear code in conjunction with the clang-generated assembly[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[39:14][Macro-ops subject to fusion (cmp and jne)[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[44:30][Memory Execution Units and Scalar Arithmetic Units][:hardware]
[46:36][Port usage of ADD (R64, I8)[ref
site="uops.info"
page="ADD (R64, I8)"
url=https://uops.info/html-instr/ADD_R64_I8.html]][:hardware :performance]
[49:50][Port usage of CMP (R64, I32)[ref
site="uops.info"
page="CMP (R64, I32)"
url=https://uops.info/html-instr/CMP_R64_I32.html]][:hardware :performance]
[51:27][Writing Identity into the Matrices array using mov, movaps and movups instructions[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[58:29][xorps[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language]
[1:00:28][Does Clang do anything more than -O3?][:language :speech]
[1:01:06][@chronic_quagga][-mavx2?][:language]
[1:01:21][Loading and writing zeros[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:04:40][Horrible code: 1) Superfluous zero writes[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:05:29][Try moving the Identity and Zero matrix_4x4 outside of main()[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:06:10][Move Identity and Zero matrix_4x4 back inside main()[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:06:22][Horrible code: 1) Superfluous zero writes (cont.)[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:06:52][Horrible code: 2) Using seven instructions to move 64 bytes[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:08:27][Hunt uops for mov[ref
site="uops.info"
url=https://uops.info/]][:hardware :performance]
[1:11:03][MOVUPS (M128, XMM)[ref
site="uops.info"
page="MOVUPS (M128, XMM)"
url=https://uops.info/html-instr/MOVUPS_M128_XMM.html]][:hardware :performance]
[1:14:56][Check the Intel 64 and IA-32 Architectures Software Developer Manual for MOV[ref
site="Intel"
page="Intel 64 and IA-32 Architectures Software Developer Manuals"
url=https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html]][:hardware :performance]
[1:15:49][MOVQ (M64, XMM)[ref
site="uops.info"
page="MOVQ (M64, XMM)"
url=https://uops.info/html-instr/MOVQ_M64_XMM.html]][:hardware :performance]
[1:16:08][MOV permutations[ref
site="Intel"
page="Intel 64 and IA-32 Architectures Software Developer Manuals"
url=https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html]][:hardware :performance]
[1:16:50][Port usage of mov, movaps and movups[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:hardware :performance]
[1:17:20][@jaege8][Next page?]
[1:17:59][MOV (M32, I32)[ref
site="uops.info"
page="MOV (M32, I32)"
url=https://uops.info/html-instr/MOV_M32_I32.html]][:hardware :performance]
[1:19:33][Horrible code: 2) Using seven instructions to move 64 bytes (cont.)[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:20:01][Hand-write and -read 128-bit rows using _mm_setr_ps() and _mm_storeu_ps()[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9][ref
site=Intel
page="Intel Intrinsics Guide"
url=https://software.intel.com/sites/landingpage/IntrinsicsGuide/]][:asm :language :memory]
[1:22:36][The clang-generated code is now better, with one loop unroll[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:25:42][@oldganon][O3 didn't help here][:language :performance]
[1:25:56][Thoughts on explicitly writing out intrinsics][:language :performance]
[1:26:54][Walk through the xorclear code in conjunction with the msvc-generated assembly[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:28:09][Hunt uops for rep[ref
site="uops.info"
url=https://uops.info/]][:hardware :performance]
[1:29:09][MOVSB_REPE[ref
site="uops.info"
page="MOVSB_REPE"
url=https://uops.info/html-instr/MOVSB_REPE.html]][:hardware :performance]
[1:30:59][Determine to try a dependent clear]
[1:32:15][@dragoonx6][@handmade_hero Try something like -O3 -march=skylake -ffast-math][:language]
[1:32:39][Temporarily try moving the Identity and Zero matrix_4x4 outside of main()[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:33:50][@daniel_collin_][You can leave it inside and set it to static][:language]
[1:35:16][Introduce a conditional clear in xorclear[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:39:14][Compare clang vs msvc on our conditional clear[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:44:14][@sainst0][Does it change if you give it -mtune=znver2?][:language]
[1:44:47][Clang often outputs slow code, but faster intrinsics-heavy code][:language :performance]
[1:45:34][Walk through the msvc-generated code for our conditional clear[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:46:22][Why clearing to zero is free[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :hardware :language :memory :performance]
[1:49:05][Non-free zero-clearing: 1) When frontend-bound][:hardware :performance]
[1:51:45][@peterfors][Skylake's memory subsystem is in charge of the loads and store requests and ordering. Since Haswell, it's possible to sustain two memory reads (on ports 2 and 3) and one memory write (on port 4) each cycle][:hardware :performance]
[1:52:29][Non-free zero-clearing: 2) Code size, alignment differences][:hardware :performance]
[1:53:22][Our movaps and xorps operations are free][:hardware :performance]
[1:53:58][Try declaring the rows uninitialised, only conditionally setting to zero[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:54:45][Our code introduced an extra jmp[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:56:20][Always initialise to zero[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[1:57:07][Replace the branch with a masked blend[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[2:02:22][msvc doesn't bother to blend with 0[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[2:04:41][Fill the second column with 1s[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[2:04:56][msvc doesn't bother to do the full blend on each row[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[2:05:30][Make each row different[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory]
[2:05:51][Our instructions will overlap[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:asm :language :memory :performance]
[2:07:09][Q&A][:speech]
[2:07:26][@jessem3y3r][@handmade_hero Hi [@cmuratori Casey]. Jesse from twitter here. Thank you so much for taking the time to explain and demonstrate this on [~hero Handmade Hero]! Deeply appreciated!]
[2:08:37][@somebody_took_my_name][If you take a look at different add / sub ops with immediates, you'll see nice tricks with the lea instruction][:asm]
[2:09:56][@centhusiast][Q: I compiled the code with icc, Intel's compiler, with O2 and it takes 3.5 seconds to run it. Is this really bad?][:performance]
[2:10:09][:Memory bandwidth will be the bottleneck][:performance :speech]
[2:13:35][@vodonikhs][Q: Someone has mentioned that older Clang versions generate better code. Could it be because of Heartbleed mitigation?][:language :performance]
[2:13:46][Try rolling back to older clang versions[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:language :performance]
[2:14:24][@i_am_seabass][Q: I read that mixing SSE2 and AVX2 will incur a :performance penalty. How would you handle optimizing code, if you want to support AVX2, but also SSE for older systems? Would you just have separate builds for each?]
[2:16:29][Isolating architecture-dependent code][:language :speech]
[2:18:34][@mindmark42][Q: Could you show what gcc does?][:language]
[2:18:43][GCC uses all scalar mov instructions[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:language :performance]
[2:19:18][@vodonikhs][Q: Try Clang 6][:language]
[2:19:24][Clang 6 still looks bad[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:language :performance]
[2:19:58][gcc -O3 generates the correct code[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:language :performance]
[2:20:26][@skincell3][Q: Could you provide the twitter conversation link that you are responding to, for the YouTube video?]
[2:21:03][@maliusarth][Q: You haven't tried latest clang with O3, did you?][:language]
[2:21:10][Latest clang with -O3 generates bad code[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:language :performance]
[2:21:13][Compilers should produce reliable code without the need for switches, optimisation passes, etc.][:language :performance]
[2:23:23][@sir_klausi][@handmade_hero Is it possible those extra jumps clang generates are a spectre mitigation?][:language]
[2:24:30][@drmaruq][Spectre mitigation is on :hardware level? Why would clang mess up the exe?][:language]
[2:25:05][Share the godbolt link[ref
site="Compiler Explorer"
page="xorclear"
url=https://godbolt.org/z/v36WE9]][:language :performance]
[2:25:33][Close it down with a plug of Star Code Galaxy[ref
site="Star Code Galaxy"
url=https://starcodegalaxy.com/]][:speech]
[/video]