websocket spec defines unmasking data as
j = mod 4 transformed-octet-i = original-octet-i xor masking-key-octet-j where mask 4 bytes long , unmasking has applied per byte.
is there way more efficiently, loop bytes?
server running code can assumed haswell cpu, os linux kernel > 3.2, sse etc present. coding done in c, can asm if necessary.
i'd tried solution myself, unable figure out if there appropriate instruction in of dozens of sse1-5/ave/(whatever extension - lost track of many on years)
thank much!
edit: after rereading spec couple of times seems it's xor'ing data bytes mask bytes, can 8 bytes @ time till last few bytes. question still open, think there still way optimize using sse or (maybe processing 16 bytes @ time? letting process loop? ...)
yes, can xor 16 bytes in 1 instruction using sse2, or 32 bytes @ time avx2 (haswell , later).
sse2:
#include <emmintrin.h> // sse2 instrinsics __m128i v, v_mask; uint8_t *buff; // buffer - must 16 byte aligned (int = 0; < n; += 16) // note n must multiple of 16 { v = _mm_load_si128(&buff[i]); // load 16 bytes v = _mm_xor_si128(v, v_mask); // xor mask v = _mm_store_si128(&buff[i], v); // store 16 masked bytes } avx2:
#include <immintrin.h> // avx2 intrinsics __m256i w, w_mask; uint8_t *buff; // buffer - must 16 byte aligned, // , preferably 32 byte aligned (int = 0; < n; += 32) // note n must multiple of 32 { w = _mm256_load_si256(&buff[i]); // load 32 bytes w = _mm256_xor_si256(w, w_mask); // xor mask w = _mm256_store_si256(&buff[i], w); // store 32 masked bytes }
Comments
Post a Comment