The JVM has a number of cool features that enable you to efficiently drop down into C (FFI) yourself: these tricks aren't just for built-ins.
A quick overview:
- First there was JNI. JNI means that you write method stubs with the "native" keyword. Then you run javah, which gives you some C glue code that you eventually need to compile. This is very fast, but it's annoying because now you need a tool chain everywhere. The Python equivalent of this is roughly writing CPython extensions.
- People thought JNI was annoying, so Sun developed JNA. JNA lets you bind a library directly: all the magic comes with the JVM, and you can just dlopen something and call some syms. This works fine, but it's very slow. The Python equivalent of this is roughly ctypes.
- Most recently, there's JNR and jnr-ffi. They do a very clever trick: you use JNI to get to libffi, and then you use libffi to call everything else performantly. You get roughly the performance of JNI, with the convenience of JNR. The Python equivalent of this is roughly cffi.
JNR is way more usable than I thought it would be. I develop caesium, a Clojure libsodium binding, using jnr-ffi and a pile of macros. I gave a talk about this at Clojure Conj (recording , slides ) if you're interested.
To be fair: this code uses intrinsics, which means that it's implemented differently than the three methods shown above, so it's still slightly different. It's just not different in a way that's meaningful to you unless you're working on the JVM itself :)
One thing I learned about pcmpxstrx is that it's surprisingly slow: latency of 10-11 cycles and reciprocal throughput of 3-5 cycles on Haswell according to Agner's tables, depending on the precise instruction variant. The instructions are also limited in the ALU ports they can use. Since AVX2 has made SIMD on x86 fairly flexible, it can sometimes not be worth using the string comparison instructions if simpler instructions suffice: even a slightly longer sequence of simpler SIMD instructions sometimes beats a single string compare.
The SSE 4.2 string comparison instructions still have their uses, but it's always worth testing alternate instruction sequences when optimizing code that might use them.
Go uses the same technique, or a Duff's device when AVX2 is unsupported.
I am convinced that similar tricks are employed by almost every language runtime or standard library (GNU libc also does this, etc.)
This guy's blog is awesome. A lot of in-depth information, original research, nice and funny writing style. I hope he finds some time to write / speak more soon.
I find it a bit sad that, instead of optimising the existing REP CMPS instruction to do vectorised compares like they did with REP MOVS/STOS and block copies/writes, Intel introduced another even more complex instruction that itself requires a bunch of additional support code to use. I certainly don't think it's a "good use of CISC".
Another excellent series on JVM perf/internals https://shipilev.net/jvm-anatomy-park/
> But did you know there is also a secret second implementation? String.compareTo is one of a few methods that is important enough to also get a special hand-rolled assembly version.
Is there a list somewhere of all the functions that have had such special treatment?
I think it could be fairly profitable performance wise to write a specialized version of compareTo (simple memcmp) for cases where the result is directly compared to 0, equal or not equal case.
Or to have compareEqualityTo() version.
Current version supports greater and less as well, which is of course useful in many data structures and sorting, but might not represent bulk of the call sites. This feature does not come for free.
This alternative version (memcmp alias) could be written to run faster.
Only benchmarking could tell whether or not it's ultimately worth it.
> The code that generates this, MacroAssembler::string_compare in macroAssembler_x86.cpp is well-documented for the curious.
I was expecting that this would be some C++ code, or maybe inline assembly, but it turns out it's just a bunch of macros that are a thin layer over assembly instructions. Is this done for portability, to abstract over differing instruction names on different architectures?
What a lovely ready; I enjoy learning about the intricate details of the JVM. Thank you for sharing!
tl;dr: The string comparison intrinsic in the JVM uses a vectorised string comparison instruction.
(vpcmpestri (of the pcmpxstrx family) isn't an especially crazy instruction to use for string comparison. That's what it's designed for.)
Click here to discover that crazy instruction! C++ programmers HATE IT!!
(I am just ironizing about the title, the content is actually great!)
UTF-16 is one of these ill-fated developments that curse some languages & platforms (WinNT incl Win10, WinAPI32, Java, Flash, JS, Python 3) to his day.
compareTo uses 0x19, which means doing the “equal each” (aka string comparison) operation across 8 unsigned words (thanks UTF-16!) with a negated result. This monster of an instruction takes in 4 registers of input:
It's great to see the disassembly. Would be curious to see the assembly performance-tested against other similar implementations (UTF-16 compatible.)
Btw, frik, you are shadow banned. I sense it is for the frigid opinions. Personally, I see them as "ok" as long as you provide technical reasons for your opinions and are polite enough.