[1:09:59][@braincruser][Will the operations be reordered to reduce the number of ops and load / stores?]
[1:12:01][@mmozeiko][You are calculating Out like or(or(or(r, g), b), a). Would it be better to do it like this: or(or(r, g), or(b, a)), so first two or's are not dependent on each other?]
[1:14:57][handmade_render_group.cpp: Write it the way mmozeiko suggests]
[1:17:31][@uspred][Do you need to start with 32-bit floats? Is there further optimization that doesn't need the casting?]
[1:18:21][Blackboard: Multiplying floats vs Multiplying integers]
[1:19:54][@mmozeiko][Same for texture bilinear adds together]
[1:23:00][@flaturated][Can you compile /O2 to compare it to last week's performance?]
[1:23:16][@brblackmer][Why did you make macros for your SIMD operations (mmSquare, etc.) vs making functions?]
[1:23:39][@quikligames][Are these intrinsics the same on other operating systems or compilers, as long as it's using Intel architecture?]
[1:24:40][@mmozeiko][Why do you say unaligned store is nasty? As far as I know, for latest Intel CPUs (at least starting from Ivy Bridge) unaligned load / store is not very expensive anymore (<5% difference)]
[1:26:25][@plain_flavored][Is scalar access to __m128 elements still slow on Intel?]
[1:27:18][@braincruser][The processor window is 192 instructions]
[1:28:01][@gasto5][I don't understand how one optimizes by using the intrinsic or function]
[1:28:51][@mmozeiko][_mm_cvttps_epi32 always truncates. Would that be better than messing with rounding mode?]
[1:30:45][handmade_render_group.cpp: Switch to _mm_cvttps_epi32]