Extending an existing dataset #2519

ykim770 · 2024-11-22T18:29:33Z

ykim770
Nov 22, 2024

Hi Team and Community,

I've been exploring the potential for Splink to replace an existing process that assigns golden records. At present, this is composed of deterministic rules alongside a clerical team that handles conflicts in the case where the rules identify matches between an input record and multiple golden records.

The base table is 100 million rows, where 22 million have been processed and assigned to the golden record table, which has 18 million rows.

Moving forward, I'd like to perform a scheduled incremental link_only comparison between an input set (left) and the golden record table (right). I've been referencing previous discussions such as #1814, #1502, #2464, and #2354. The left table can vary from 30k - 2 million rows, and I wouldn't be concerned with deduplication within the table. The existing process identifies roughly 10k new entries to the right table on a daily basis, where the rest are identified as matches. The data quality between the two tables are very similar.

I'm thinking of implementing something similar to what is mention in #2464, where the right table is a 'spine' that gets extended with the input set based on the highest probability between a row in the left and right tables. A clerical review process would be in place for multiple high match_probabilities within a certain threshold and low match probabilities would be a new entry to the right table.

Would I be correct in thinking that re-training the link_only model for every iteration is necessary to account for variability in the left table? (IE entirely novel records, matching records, some mixture of both)
Is there any downside in whittling down the left table? For example, a left outer join on the left table against the right table, then running a link_only with the smaller input table, which could still contain dupes that haven't been linked to the golden record. Zooming out a bit, my question seems to be, "What happens if an entirely novel input set contains no matches to the table?"
- My instinct says that training a link_only model would be the wrong approach for this scenario, and that I'd be better off using find_matches_to_new_records like in Link types when comparing new records to a pre-existing golden record #1502
In the scenario where I'd like to use find_matches_to_new_records, would this look something like:
- Dedupe the base table and save off the model
- Subsequent runs would then re-use this model and the right table
  - Is it sensible to use parameters generated from dedupe_only on the base table for a find_matches_to_new_records prediction between the left and right tables?
  - I'm a bit hesitant here, as Choosing a dataset for training a link_only model #1814 makes me think that the m and lambda parameters could lead my predictions astray. I also don't have a nice case where I can say definitively that a record should exist in the golden record table, though I think it's reasonable that the m parameter could be re-used given similar data quality.
- Feed in input table and determine match probabilities
- Re-generate model based off of base table on monthly / yearly basis to account for changes in incoming data to the base table

Any insights here would be greatly appreciated. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending an existing dataset #2519

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Extending an existing dataset #2519

ykim770 Nov 22, 2024

Replies: 0 comments

ykim770
Nov 22, 2024