c - Websocket data unmasking / multi byte xor -


websocket spec defines unmasking data as

j                   = mod 4 transformed-octet-i = original-octet-i xor masking-key-octet-j 

where mask 4 bytes long , unmasking has applied per byte.

is there way more efficiently, loop bytes?

server running code can assumed haswell cpu, os linux kernel > 3.2, sse etc present. coding done in c, can asm if necessary.

i'd tried solution myself, unable figure out if there appropriate instruction in of dozens of sse1-5/ave/(whatever extension - lost track of many on years)

thank much!

edit: after rereading spec couple of times seems it's xor'ing data bytes mask bytes, can 8 bytes @ time till last few bytes. question still open, think there still way optimize using sse or (maybe processing 16 bytes @ time? letting process loop? ...)

yes, can xor 16 bytes in 1 instruction using sse2, or 32 bytes @ time avx2 (haswell , later).

sse2:

#include <emmintrin.h>                     // sse2 instrinsics  __m128i v, v_mask; uint8_t *buff;                             // buffer - must 16 byte aligned  (int = 0; < n; += 16)            // note n must multiple of 16 {     v = _mm_load_si128(&buff[i]);          // load 16 bytes     v = _mm_xor_si128(v, v_mask);          // xor mask     v = _mm_store_si128(&buff[i], v);      // store 16 masked bytes } 

avx2:

#include <immintrin.h>                     // avx2 intrinsics  __m256i w, w_mask; uint8_t *buff;                             // buffer - must 16 byte aligned,                                            // , preferably 32 byte aligned  (int = 0; < n; += 32)            // note n must multiple of 32 {     w = _mm256_load_si256(&buff[i]);       // load 32 bytes     w = _mm256_xor_si256(w, w_mask);       // xor mask     w = _mm256_store_si256(&buff[i], w);   // store 32 masked bytes } 

Comments