-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Second Hash Function? #78
Comments
You should put entries with same hash into some list, no matter which hash function is used, unless it is perfect hash function (PHF, which is basically possible on fixed set only).
the hash table represented as flat JSON could look like this: {
1: ["AAA", "BBB"],
2: ["CCC", "DDD"]
3: "EEE"
} Because the probability of hash collision is very small (depending on the data-set), you'd have few elements in every list (mostly 1 only). The second hash function cannot avoid the collisions on variable dataset, it would just decrease the probability depending on distribution algorithm of both functions for variable set (and it'd be larger as you may assume, see Birthday Paradox). |
@sebres But linear probing is faster and easier to implement than chaining... I don't know how to find best(or proper) 2nd hash function. Or, how to probe for handling collisions... |
The case is - if your set, you want to put into table, is variable, you cannot really avoid collision without verifying of whole set or else without mathematical proof for combination of both hash functions. Still worse the combination of both functions would be probably far from ideal in sense of common hashing. Also note the "quality" of set is sometimes more important as the quantity, so if you'd do a probation over hashing of binary and text (ascii) data sets, you'd catch totally different results (probably would catch a collision earlier by hashing of the text blocks). Here are some tests of one 32-bits hash function, as you can see the results are sometimes far from 232/2 (216), just because of the data-set format. See similar question #58 about collision properties (note the creation of mathematical proof for two function is more complex).
You can implement your hash table in the way you wanted do that (without serialization inside the bucket, so without grouping the entries in the list), but with data compassion in CreateHashEntry function in case if hash value already exists, so if both keys are different you have found a collision. |
Does it only counts collisions or handle them as well?
If I want to build my own hash table with MurmurHash, how should I handle collisions...?
Thank you,
The text was updated successfully, but these errors were encountered: