Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use hash function to generate stable table types #943

Merged
merged 7 commits into from
Sep 23, 2021

Conversation

chuan
Copy link
Contributor

@chuan chuan commented Sep 22, 2021

See: https://gaiaplatform.atlassian.net/browse/GAIAPLAT-1389 for details on this workitem.

Some unit tests are updated due to the change of type generation algorithms. We can no longer assume type 1 or 2 always exists.

@senderista
Copy link
Contributor

Given the fact that the chosen hash function actually affects correctness (not just performance as in a hash table), I am pretty uneasy about not directly controlling the hash function we use. It appears that the libstdc++ implementation of std::hash[0] uses MurmurHashUnaligned2 and libc++[1] uses MurmurHash2 or CityHash, all of which are state-of-the-art algorithms, but there are other issues which make using std::hash infeasible for this application:

  1. There is no guarantee of hash stability between program executions. To quote the standard (https://en.cppreference.com/w/cpp/utility/hash):

Hash functions are only required to produce the same result for the same input within a single execution of a program; this allows salted hashes that prevent collision denial-of-service attacks.

This alone means that std::hash cannot be used in any application that requires stable mappings from inputs to outputs. (And in practice, you can expect unstable behavior, to thwart hash flooding attacks on hash tables.)

  1. The other issue is that any STL implementation is free to change its std::hash implementation in any release, because by 1) it has no stability guarantees even between program executions using the same version of an STL implementation, so a fortiori it has no obligation to keep the hash function implementation the same between releases.

The answer is simple: use a hash function implementation that we directly control and whose stability we can therefore guarantee. There are many good free non-cryptographic hash functions now, but I don't think you can go wrong with either MurmurHash3 or xxHash. OTOH, for this application collision-resistance is critical, and there are likely no downsides to just using a cryptographic hash, so unless there are performance requirements that make it infeasible, I think we might as well just use truncated SipHash or SHA2 or SHA3. Better safe than sorry.

[0]libstdc++ std::hash implementation:
https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/libsupc%2B%2B/hash_bytes.cc

[1]libc++ std::hash implementation:
https://github.com/llvm/llvm-project/blob/bcaa01752f5e45b9d1647aef16d74c748192f8a5/libcxx/include/__string#L1155

Copy link
Contributor

@senderista senderista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my long comment, I think we need to explicitly specify the hash function, and if performance allows it should be crypto-quality.

@yiwen-wong
Copy link

Per my long comment, I think we need to explicitly specify the hash function, and if performance allows it should be crypto-quality.

Any reason we need a crypto-quality hash? Or a hash in general? I would say any deterministic mapping function over the fully qualified symbol name should work...

@senderista
Copy link
Contributor

Per my long comment, I think we need to explicitly specify the hash function, and if performance allows it should be crypto-quality.

Any reason we need a crypto-quality hash? Or a hash in general? I would say any deterministic mapping function over the fully qualified symbol name should work...

Of course any 1:1 function from symbol names to integers would work, but the universe of symbol names under some reasonable length has cardinality >> 2^32, so if you don't want to materialize this mapping explicitly and manage it in a centralized fashion (e.g. via an autoincrement assignment of integers to symbol names, maintained in some master database), then I don't see an alternative to a hash function (and also constraints on the number of hashes you generate, as we've discussed). You don't strictly need a crypto-quality hash function for this, just one with sufficient (empirically demonstrated) collision resistance. But IMO we may as well use a (truncated) crypto hash function if our performance requirements allow it, for guaranteed rather than "good enough" collision resistance.

@chuan chuan force-pushed the chuan/table-type-hash-id branch from 1414be4 to e6021c3 Compare September 22, 2021 19:37
@chuan chuan requested a review from senderista September 22, 2021 19:38
@chuan
Copy link
Contributor Author

chuan commented Sep 22, 2021

I have switched to our own murmur3 hash. I don't think any of the crypto hash properties are needed here, and I certainly don't want to take a dependency on any crypto library or maintain a crypto function myself for a temporary solution.

Copy link
Contributor

@simone-gaia simone-gaia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as the smart ones are OK with it :)

Copy link
Contributor

@senderista senderista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with copying a fixed reference implementation into our codebase, but I think it should be relocated to a common source location and library. I agree that MurmurHash3 should be sufficient for our requirements.

Just noted a couple of typos otherwise.

@@ -783,6 +781,83 @@ reference_offset_t ddl_executor_t::find_available_offset(gaia::common::gaia_id_t
find_parent_available_offset(table.outgoing_relationships()));
}

// Adapted from the public domain murmur3 hash implementation at:
// https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
uint32_t murmurhash3_x86_32(const void* key, int len)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put this in a separate file under gaia_internal/common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

production/catalog/src/ddl_executor.cpp Outdated Show resolved Hide resolved
production/system/tests/test_recovery.cpp Outdated Show resolved Hide resolved
production/catalog/gaiac/src/main.cpp Outdated Show resolved Hide resolved
production/catalog/parser/src/lexer.ll Outdated Show resolved Hide resolved
production/catalog/src/ddl_executor.cpp Show resolved Hide resolved
ASSERT_PRECONDITION(db_name.length() <= std::numeric_limits<int>::max(), "The DB name is too long.");
ASSERT_PRECONDITION(table_name.length() <= std::numeric_limits<int>::max(), "The table name is too long.");

return murmurhash3_x86_32(table_name.data(), static_cast<int>(table_name.length()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm concerned about unaligned reads in this variant, because 1) that's undefined behavior (see e.g. https://discourse.julialang.org/t/problem-with-issue-murmurhash3-has-undefined-behavior/9807), and 2) it could cause hardware exceptions on certain ARM CPUs (which has apparently happened with MurmurHash), since UB means that alignment fixup code won't be generated (I don't know details well). How about MurmurHashAligned2?

(Re: MurmurHash3, the comment at https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L52 seems to indicate that getblock32()/getblock64() are intended to be extensibility points for aligning reads, but we'd have to implement that ourselves, although that would be trivial with memcpy. Also, those may not be sufficient: aappleby/smhasher#74.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just noticed that you already fixed the unaligned reads and writes in MurmurHash3_x86_32, nice job!

production/db/core/tests/test_relationships.cpp Outdated Show resolved Hide resolved
for (int i = -nblocks; i; i++)
{
uint32_t k1;
std::memcpy(&k1, (blocks + i), sizeof(k1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, NVM, looks like you already handled unaligned reads, nice!

h1 *= 0xc2b2ae35; // NOLINT(cppcoreguidelines-avoid-magic-numbers)
h1 ^= h1 >> 16; // NOLINT(cppcoreguidelines-avoid-magic-numbers)

return h1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and you handled the unaligned write here too, nice!

@@ -783,6 +781,83 @@ reference_offset_t ddl_executor_t::find_available_offset(gaia::common::gaia_id_t
find_parent_available_offset(table.outgoing_relationships()));
}

// Adapted from the public domain murmur3 hash implementation at:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just copy that implementation in its own file so we can reuse it elsewhere instead of having it embedded in this cpp file? Copying the implementation instead of adapting it would also make it easier to review the implementation, because we could just ensure that the copy is correct instead of walking through the adaptations to make sure that it doesn't break the original implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is adapted. I need to make a few adjustments. For example, handling unaligned reads as Tobin noted below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the adaptation is needed, why not move it into its own file, so it can be reused elsewhere? Do we expect to replace this implementation with something else in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to gaia_internal/common

@chuan chuan force-pushed the chuan/table-type-hash-id branch from e6021c3 to 04dc4f3 Compare September 23, 2021 00:20
Copy link
Contributor

@senderista senderista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chuan chuan merged commit e92afdc into master Sep 23, 2021
@chuan chuan deleted the chuan/table-type-hash-id branch September 23, 2021 00:59
chuan added a commit that referenced this pull request Sep 23, 2021
I missed a few comments from the previous PR (#943). This change addresses the remaining comments.
chuan added a commit that referenced this pull request Sep 23, 2021
See: https://gaiaplatform.atlassian.net/browse/GAIAPLAT-1389 for details on this workitem.

Some unit tests are updated due to the change of type generation algorithms. We can no longer assume type 1 or 2 always exists.
chuan added a commit that referenced this pull request Sep 23, 2021
I missed a few comments from the previous PR (#943). This change addresses the remaining comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants