I'm not sure if I understand the requirements (something with transparency or not), but the fastest sprite code is generated code.
As such you analyse the sprite that you generate specific draw routine, rather than 'generic code' that just reads from buffer and treats all bytes the same.
As such, your sprite generation should recognize individual cases like 'all bits are set and therefore I can move the data, rather than or the data in', and similarly on a more technical level, it should reuse data in registers in case certain places in the code use the same pattern/bits/data.
Similarly it could take into account sequential writing of memory and so on, and so on.
I wrote an article about it that at least touches part of what I mentioned above;
https://smfx.st/?sprites (sorry the certificate is invalid...).
If you need hands on help, I'd gladly give you a hand. Or we can have a little play who can get a faster routine and see how fast we can go

Kind regards
Wietze