SIMD notes

✍ WebAssembly SIMD in Ostinato.

Examining Webassembly performance in different browsers on variouos platforms at the moment suggests that a SIMD-enabled version do not exhibits any distinct speed improvement.

In the 'precise' mode the matrix multiplication routine cxMtx::mul is called about 1500 times per frame. Supposedly, even a small improvement in its performance would result in some noticeable speed-up. This observation is relevant to the game-specific workload only. I.e.the matrices are getting multiplied often, but those are small 4x4 matrices typical in this context. The situation could be quite different for matrices of significantly bigger sizes.

The difference between the native versions compiled with and without SIMD instructions enabled is marginal, but a tiny improvement is present.

Nethertheless, it is interesting to take a look at the code clang creates for this demo and to compare the 'vectorized' version with the 'scalar' one.

First we need to produce a web-version with the assembly code that is not embedded into the page.

This is how to do it enabling SIMD vectorization:


			./build.sh wasm-0 -g -msimd128

Without vectorization enabled:


			./build.sh wasm-0 -g

Extract the listing with:


			$EMSDK/upstream/bin/wasm-dis bin/ostinato.wasm > ostinato_wasm.txt

Native build with SIMD code generation enabled:


			CXX=clang++ ./build.sh -O3 -march=native

And without SIMD code generation:


			CXX=clang++ ./build.sh -O3 -march=native -fno-tree-vectorize -fno-tree-slp-vectorize

To extract the listing:


			objdump -CD -j .text bin/prog/ostinato > ostinato_native.txt

Here are some additional build options that worth exploring.