Masking the write:

In SIMD, doing operations "4-wide" means that one wide (packed) operation operates on four pixels. So there's no difference between doing an operation on one pixel or two or three or four, except when it comes to reading and writing.

The way we can make sure we only write the pixels we're actually operating on meaningfully is by masking out the ones we aren't. Instead of doing a conditional check every loop, we want to build a mask that's filled with 1s in the places where we'll keep the pixels, and 0s in the places where we'll throw out the pixels. If we're operating on four pixels at once and we're hanging 2 off the edge, the mask might look like:

[0x00000000,0x00000000,0xFFFFFFFF,0xFFFFFFFF]

By doing a bitwise AND with the pixel data we generate, we can mask out the values that are invalid, since the zeroes in the mask will knock out any bits set in our data. Likewise, the 1s will ensure any values we want to keep will remain in place.

We still need to preserve the destination how it was, and the easiest way to do that is to remember what the destination looked like before, and use those values wherever we knocked out values in our data. So we generate an inverted mask that might look something like:

[0xFFFFFFFF,0xFFFFFFFF,0x00000000,0x00000000]

Using the same AND technique, we can grab out the destination values that should remain unchanged. Then, we can combine that with the set of valid pixel values we generated using the other mask using a bitwise OR. Since the places where the two sets of values overlap are set to 0s in one of them, the data will effectively just be copied from one onto the other with no interference.