-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use hash function to generate stable table types #943
Conversation
Given the fact that the chosen hash function actually affects correctness (not just performance as in a hash table), I am pretty uneasy about not directly controlling the hash function we use. It appears that the libstdc++ implementation of
This alone means that
The answer is simple: use a hash function implementation that we directly control and whose stability we can therefore guarantee. There are many good free non-cryptographic hash functions now, but I don't think you can go wrong with either MurmurHash3 or xxHash. OTOH, for this application collision-resistance is critical, and there are likely no downsides to just using a cryptographic hash, so unless there are performance requirements that make it infeasible, I think we might as well just use truncated SipHash or SHA2 or SHA3. Better safe than sorry. [0]libstdc++ [1]libc++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my long comment, I think we need to explicitly specify the hash function, and if performance allows it should be crypto-quality.
Any reason we need a crypto-quality hash? Or a hash in general? I would say any deterministic mapping function over the fully qualified symbol name should work... |
Of course any 1:1 function from symbol names to integers would work, but the universe of symbol names under some reasonable length has cardinality >> 2^32, so if you don't want to materialize this mapping explicitly and manage it in a centralized fashion (e.g. via an autoincrement assignment of integers to symbol names, maintained in some master database), then I don't see an alternative to a hash function (and also constraints on the number of hashes you generate, as we've discussed). You don't strictly need a crypto-quality hash function for this, just one with sufficient (empirically demonstrated) collision resistance. But IMO we may as well use a (truncated) crypto hash function if our performance requirements allow it, for guaranteed rather than "good enough" collision resistance. |
1414be4
to
e6021c3
Compare
I have switched to our own murmur3 hash. I don't think any of the crypto hash properties are needed here, and I certainly don't want to take a dependency on any crypto library or maintain a crypto function myself for a temporary solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as long as the smart ones are OK with it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with copying a fixed reference implementation into our codebase, but I think it should be relocated to a common source location and library. I agree that MurmurHash3 should be sufficient for our requirements.
Just noted a couple of typos otherwise.
@@ -783,6 +781,83 @@ reference_offset_t ddl_executor_t::find_available_offset(gaia::common::gaia_id_t | |||
find_parent_available_offset(table.outgoing_relationships())); | |||
} | |||
|
|||
// Adapted from the public domain murmur3 hash implementation at: | |||
// https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp | |||
uint32_t murmurhash3_x86_32(const void* key, int len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put this in a separate file under gaia_internal/common
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved.
ASSERT_PRECONDITION(db_name.length() <= std::numeric_limits<int>::max(), "The DB name is too long."); | ||
ASSERT_PRECONDITION(table_name.length() <= std::numeric_limits<int>::max(), "The table name is too long."); | ||
|
||
return murmurhash3_x86_32(table_name.data(), static_cast<int>(table_name.length())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm concerned about unaligned reads in this variant, because 1) that's undefined behavior (see e.g. https://discourse.julialang.org/t/problem-with-issue-murmurhash3-has-undefined-behavior/9807), and 2) it could cause hardware exceptions on certain ARM CPUs (which has apparently happened with MurmurHash), since UB means that alignment fixup code won't be generated (I don't know details well). How about MurmurHashAligned2?
(Re: MurmurHash3, the comment at https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L52 seems to indicate that getblock32()
/getblock64()
are intended to be extensibility points for aligning reads, but we'd have to implement that ourselves, although that would be trivial with memcpy
. Also, those may not be sufficient: aappleby/smhasher#74.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I just noticed that you already fixed the unaligned reads and writes in MurmurHash3_x86_32
, nice job!
for (int i = -nblocks; i; i++) | ||
{ | ||
uint32_t k1; | ||
std::memcpy(&k1, (blocks + i), sizeof(k1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, NVM, looks like you already handled unaligned reads, nice!
h1 *= 0xc2b2ae35; // NOLINT(cppcoreguidelines-avoid-magic-numbers) | ||
h1 ^= h1 >> 16; // NOLINT(cppcoreguidelines-avoid-magic-numbers) | ||
|
||
return h1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and you handled the unaligned write here too, nice!
@@ -783,6 +781,83 @@ reference_offset_t ddl_executor_t::find_available_offset(gaia::common::gaia_id_t | |||
find_parent_available_offset(table.outgoing_relationships())); | |||
} | |||
|
|||
// Adapted from the public domain murmur3 hash implementation at: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just copy that implementation in its own file so we can reuse it elsewhere instead of having it embedded in this cpp file? Copying the implementation instead of adapting it would also make it easier to review the implementation, because we could just ensure that the copy is correct instead of walking through the adaptations to make sure that it doesn't break the original implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is adapted. I need to make a few adjustments. For example, handling unaligned reads as Tobin noted below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the adaptation is needed, why not move it into its own file, so it can be reused elsewhere? Do we expect to replace this implementation with something else in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to gaia_internal/common
e6021c3
to
04dc4f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I missed a few comments from the previous PR (#943). This change addresses the remaining comments.
See: https://gaiaplatform.atlassian.net/browse/GAIAPLAT-1389 for details on this workitem. Some unit tests are updated due to the change of type generation algorithms. We can no longer assume type 1 or 2 always exists.
I missed a few comments from the previous PR (#943). This change addresses the remaining comments.
See: https://gaiaplatform.atlassian.net/browse/GAIAPLAT-1389 for details on this workitem.
Some unit tests are updated due to the change of type generation algorithms. We can no longer assume type 1 or 2 always exists.