The authors of Keccak / SHA-3 came up with Sakura, a hash tree construction that's provably as collision-resistant as the underlying hash function and very flexible.
If you're designing a new system using hash trees, you better use Sakura trees or have a good explanation for why not.
Edit: I also wrote some demo code for this attack when arguing with one of the IPFS guys about why they really shouldn't use Bittorrent BEP 30-style Merkle Trees. In order to get security with Bittorrent-style trees, the [length, root] pair really needs to be your cryptographic identifier, and you need to make sure they're always properly handled as a pair. There are just too many caveats to usage and we have provably secure alternatives.
As a former LimeWire developer, it makes me sad that Gnutella's TigerTrees avoided this vulnerability long before bittorrent's BEP 30 was published, and yet Bittorrent got it wrong. It was a well-known vulnerability, and was covered as part of my interview at LimeWire.
This is a well known edge case of Merkle trees; my undergrad security course covered this as an aside and it is prominent on the wikipedia page.
For those not familiar with the specific implementation issue, this post should be a fun read.
I don't know what's more concerning, that pymerkletools advertises itself for use with Bitcoin or that none of the contributors read the wiki page :|
This was interesting, but when you think about it, this isn't really a flaw as much as something inherent to the way it was designed.
Imagine you had some combinations of functions f, g, and h such that f(g(h(x))) = y. Obviously you could calculate that as h(x) -> g(h(x)) -> f(g(h(x)) = y, but then of course knowing h(x), or g(h(x)) would enable you to find y as well. So of course, due to the recursive nature of it, picking any set of inputs that were the outputs of a previous call would give you the same output.
That argument doesn't exactly fit multiple inputs, but the idea is the same.
I find this article and discussion unsettling. A large number of people made non-typesafe merkle trees (including bittorrent, apparently), and then were surprised that passing it incorrect types cause it to produce incorrect output.
I don’t see how this is a vulnerability in the merkle tree algorithm. It just seems like yet another case of “python libraries contain bugs that are common in idiomatic python”.
I guess since so many people have independently implemented the same mistake, the writeup is useful.
I think I might be missing the point...is the attack just to pretend an intermediate hash was a leaf?
If you're providing a merkle branch to prove that a document you've hashed is in the set, then I don't see how this helps you at all.
My preferred fix is including the total size of data in the root hash, since it's often convenient to know the total file size before starting a download.
Another solution to this attack can be:
hashing root with tree depth as last step.
I think the Wikipedia reference doesn't do it justice - the article alone is already ~1500 words.
Sorry, I couldn't help myself