-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
type_2_scd_generic_upsert does not handle NULL values properly #121
Comments
this is the other issue mentioned in #120 |
@dgcaron thank you for flagging this. I am not so familiar with the topic, could you explain the potential performance boost and how this will fix null value issues? |
i believe the issue arises here https://github.com/MrPowers/mack/blob/main/mack/__init__.py#L103
if the base.{attr} value is initially null for the first write, the scd2 isn't being built properly on next writes. this has to do with the way sql and pypark handle null values in comparisons. the perfomance boost should come from the fact that you join based on a precalculated hash (the hash is also persisted in the scd2 table) on both sides instead on all columns of interest. you don't have to evaluate each column on both sides to check for changes this way. i'll add a test case that shows the issue in the upcoming days |
@dgcaron understood. precalculating a hash column and comparing only that is way less expensive than comparing each column and it would solve the spark null problem, although this might be solvable through eqNullSafe (not sure though) |
yes, i guess eqNullSafe should solve the null issue too but i am not sure how to implement that properly with an expression string. it would allow for a change that is non-breaking though, if that is the preference, than i can take a look some time soon |
@dgcaron Maybe we could do that as a first step. We could still implement a scd2 with hash later :) |
@dgcaron - just a friendly ping on this one. Would love to get this fix added! Thank you! |
The current way the type_2_scd_generic_upsert function checks for changes in a row involves evaluating each column and this does not yield the expected result. An possible perfomance improvement on this matter and cater for NULL values is to add a hash column that is calculated based on the contents of the columns in the table (except the scd2 system columns).
some background: https://datacadamia.com/dit/owb/scd2_hash
besides some code around it the most interesting changes would be
an udf to calculate the hash
addition of the hash column to the update set
stage changes based on the hash of the columns instead of a column by column comparison
merge using the hash column
this is a breaking (not backwards compatible) change to what is now in the function so you could consider making this a different function?
The text was updated successfully, but these errors were encountered: