You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been exploring the potential for Splink to replace an existing process that assigns golden records. At present, this is composed of deterministic rules alongside a clerical team that handles conflicts in the case where the rules identify matches between an input record and multiple golden records.
The base table is 100 million rows, where 22 million have been processed and assigned to the golden record table, which has 18 million rows.
Moving forward, I'd like to perform a scheduled incremental link_only comparison between an input set (left) and the golden record table (right). I've been referencing previous discussions such as #1814, #1502, #2464, and #2354. The left table can vary from 30k - 2 million rows, and I wouldn't be concerned with deduplication within the table. The existing process identifies roughly 10k new entries to the right table on a daily basis, where the rest are identified as matches. The data quality between the two tables are very similar.
I'm thinking of implementing something similar to what is mention in #2464, where the right table is a 'spine' that gets extended with the input set based on the highest probability between a row in the left and right tables. A clerical review process would be in place for multiple high match_probabilities within a certain threshold and low match probabilities would be a new entry to the right table.
Would I be correct in thinking that re-training the link_only model for every iteration is necessary to account for variability in the left table? (IE entirely novel records, matching records, some mixture of both)
Is there any downside in whittling down the left table? For example, a left outer join on the left table against the right table, then running a link_only with the smaller input table, which could still contain dupes that haven't been linked to the golden record. Zooming out a bit, my question seems to be, "What happens if an entirely novel input set contains no matches to the table?"
In the scenario where I'd like to use find_matches_to_new_records, would this look something like:
Dedupe the base table and save off the model
Subsequent runs would then re-use this model and the right table
Is it sensible to use parameters generated from dedupe_only on the base table for a find_matches_to_new_records prediction between the left and right tables?
I'm a bit hesitant here, as Choosing a dataset for training a link_only model #1814 makes me think that the m and lambda parameters could lead my predictions astray. I also don't have a nice case where I can say definitively that a record should exist in the golden record table, though I think it's reasonable that the m parameter could be re-used given similar data quality.
Feed in input table and determine match probabilities
Re-generate model based off of base table on monthly / yearly basis to account for changes in incoming data to the base table
Any insights here would be greatly appreciated. Thank you!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi Team and Community,
I've been exploring the potential for Splink to replace an existing process that assigns golden records. At present, this is composed of deterministic rules alongside a clerical team that handles conflicts in the case where the rules identify matches between an input record and multiple golden records.
The base table is 100 million rows, where 22 million have been processed and assigned to the golden record table, which has 18 million rows.
Moving forward, I'd like to perform a scheduled incremental
link_only
comparison between an input set (left) and the golden record table (right). I've been referencing previous discussions such as #1814, #1502, #2464, and #2354. The left table can vary from 30k - 2 million rows, and I wouldn't be concerned with deduplication within the table. The existing process identifies roughly 10k new entries to the right table on a daily basis, where the rest are identified as matches. The data quality between the two tables are very similar.I'm thinking of implementing something similar to what is mention in #2464, where the right table is a 'spine' that gets extended with the input set based on the highest probability between a row in the left and right tables. A clerical review process would be in place for multiple high match_probabilities within a certain threshold and low match probabilities would be a new entry to the right table.
link_only
model for every iteration is necessary to account for variability in the left table? (IE entirely novel records, matching records, some mixture of both)link_only
with the smaller input table, which could still contain dupes that haven't been linked to the golden record. Zooming out a bit, my question seems to be, "What happens if an entirely novel input set contains no matches to the table?"link_only
model would be the wrong approach for this scenario, and that I'd be better off usingfind_matches_to_new_records
like in Link types when comparing new records to a pre-existing golden record #1502find_matches_to_new_records
, would this look something like:dedupe_only
on the base table for afind_matches_to_new_records
prediction between the left and right tables?Any insights here would be greatly appreciated. Thank you!
Beta Was this translation helpful? Give feedback.
All reactions