When compiling with GCC, the option `-fopt-info-vec-all` gives you information about the vectorization of the code. In this case, GCC reports
edit: with intel compiler using `-qopt-report=5 -qopt-report-phase=vec -qopt-report-file=stdout`
// the for block <source>:10:24: note: ===== analyze_loop_nest ===== <source>:10:24: note: === vect_analyze_loop_form === <source>:10:24: note: not vectorized: control flow in loop. <source>:10:24: note: bad loop form. <source>:5:6: note: vectorized 0 loops in function. // the if block inside the for block <source>:11:9: note: got vectype for stmt: _4 = *_3; const vector(16) int <source>:11:9: note: got vectype for stmt: _8 = *_7; const vector(16) int <source>:11:9: note: === vect_analyze_data_ref_accesses === <source>:11:9: note: not vectorized: no grouped stores in basic block. <source>:11:9: note: ===vect_slp_analyze_bb=== <source>:11:9: note: ===vect_slp_analyze_bb=== <source>:11:9: note: === vect_analyze_data_refs === <source>:11:9: note: not vectorized: not enough data-refs in basic block.
Begin optimization report for: is_sorted(const int32_t *, size_t) Report from: Vector optimizations [vec] LOOP BEGIN at <source>(12,5) remark #15324: loop was not vectorized: unsigned types for induction variable and/or for lower/upper iteration bounds make loop uncountable LOOP END
I think the reduction in the number of branches inside the loop is a significant factor here, i.e. the if statement is performed per 4,8 or 16 elements instead of per value element. Branches invoke branch prediction logic which is non trivial, thus even though it may not slow execution it may increase power consumption.
On this basis another way of approaching the problem is to loop over the whole array and return the result at the end, but this of course would take N/2 loops on average, whereas the nested if can exit early. A compromise might be to loop over short sub-spans of the array, and do an exit early test at the end of each sub-span.
A good sub-span length for scalar code might be around 16, for one because we hit the law of diminishing returns for longer spans.
Also I think is_sorted_asc() or is_sorted_ascending() might be a better name if this were for a function in a general purpose library.
It seems that the early exit (the return false) in the middle of the loop prohibits the compiler from vectorizing the loop. The compiler can't know if part of the memory is inaccessible and so reading memory not being sure that the loop would run up to this point is illegal.
If you introduce a result variable and set it during the loop, the compiler can vectorize the loop. At least icc does, but I didn't play with the compiler settings of the other compilers too much.
Also compilers can still exit the loop early and at least MSVC does, because a segfault is UB and can be "optimized away".
FWIW HeroicKatora gives an improved version on Reddit: https://www.reddit.com/r/cpp/comments/8bkaj3/is_sorted_using...
using a bit more arcane (but not crazy) method for determining sorted-ness - https://godbolt.org/g/MKN9HP - clang, and icc seem to have no problem vectorising the inner loop.
gcc manages it too, but emits a huge amount of code, though. and setting -Os seems to stop it from vectorising the code. shame.
edit: replaced with more correct version..
It would be interesting to see how the SIMD versions work for small arrays. I suspect in this case the naive version better, and this could be the reading why compilers do not convert the code to SIMD instructions...
Generic code yields better results than SSE/AVX optimized ones. I wonder why that could be.
i += 7;
Wouldn't this cause a sizable performance hit due to being misaligned most of the time?
Another possible implementation which is log^2(n) https://en.wikipedia.org/wiki/Bitonic_sorter
I always miss multithreaded benchmarks when using SSE/AVX instructions. AFAIK AVX processing units are oversubscribed, there are less of them than CPU cores.
I can imagine that running AVX is_sorted (or any other AVX procedure) in multiple threads would be actually slower than running non-vectorized procedure.
Of course, that's my purely anecdotal opinion.
This assumes unaligned access is cheap.
Unfortunately SSE doesn't speed-up huge data much. It's only fast when everything's in the cache.