-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(WIP) Feature/postgres similarity functions #2224
Conversation
Co-authored-by: Tom Hepworth <[email protected]>
Co-authored-by: Zoe Slade <[email protected]>
…threshold Minor bug in filtering predict table
…ettings_validation_documentation Update documentation on settings validation in response to code changes
Remove reference to github action that will not come to be
Co-authored-by: Zoe Slade <[email protected]>
Fixing spurious error messages with Databricks enable_splink
Fix Splink 4 blog post link
Make spellcheck work cross-platform
…rt comparisons
@vfrank66 - thanks so much for this. It's all looking good except one thing. I've added tests for the new functions. The jaro and jaro winkler tests pass, but the damerau_levenshtein ones do not. I have used the same tests from duckdb here I'm no expert but i had a little look at the wiki description of the algo here. Do you think it's possible that your implementation is the version without transpositions, whereas the duckdb one is the one with transpositions, and that's the source of the test failure? Do you have any thoughts on how to proceed? For what it's worth, i'm happy to merge the jaro/jaro winkler code as it is - so if you'd prefer to do that, and think about this later, let me know |
IF (i > 1 AND j > 1 AND SUBSTRING(s1 FROM i FOR 1) = SUBSTRING(s2 FROM j - 1 FOR 1) AND SUBSTRING(s1 FROM i - 1 FOR 1) = SUBSTRING(s2 FROM j FOR 1)) THEN | ||
d[i + 1][j + 1] := LEAST( | ||
d[i + 1][j + 1], | ||
d[i - 1][j - 1] + cost -- transposition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transpositions
@RobinL sorry for the late reply I need to start by saying I need some time to make some changes. In Postgres the functions I created in sql were abysmal in performance. It would be ideal to run off a .c compiled extension, which is available. I fully instead to switch out my logic with pg_similarity extension but I have not done. pg_similarity extension has levenshtein but not damerau_levenshtein. I had terrible cost and performance on AWS RDS so I switched to duckdb with great performance. I do have transpositions although they are handling differently. I think the main differenc is this algorithm: https://github.com/duckdb/duckdb/blob/0be3e7b43680f0bfd851f8788581aaaf4bf8cd3f/src/core_functions/scalar/string/damerau_levenshtein.cpp#L7, but I have no idea how to implement this. Since I do not have time to review and implement this change right this moment, I asked claude-sonnet 3.5, for better or worse, here is original https://gist.github.com/vfrank66/3f80eb2ee3a2fbaae3e790085ad57075 here is revised: https://gist.github.com/vfrank66/7dce95e64548fa9ce213652ab5fb30ae |
Thanks @vfrank66 - happy just to leave this open for a bit in case you have time. That's very useful info re: performance. I meant to say actually - I did some investigation recently into the postgres extension for DuckDB, which I think is a very promising way forward. Copy and pasing a message from our internal Slack which may be of interest: I’m giving the duckdb postgres extension a whirl for CPR work. I was keen to understand what it actually does. It seems pretty suitable for use in Splink as the best way of executing linkage agaisnt a postgres backend. In a nutshell, when you run lengthy/complex duckdb SQL against a postgres table, it:
We can see this at work e.g. with this code:
|
Oh yay that is great I will make time to look into this Postgres extension next week. that may be exactly what I need. Due to Postgres function performance on millions of comparisons I wanted to store data is Postgres, calc in duckdb, and store back to Postgres. Due to ACID compliance across multiple predicting applications. I do understand DuckDb in-memory is ACID but would only apply for single process. |
I should have a gist soon, another week, of my completed work, this is just a update. This works great and if you are okay with using client side side processing,with duckdb but persisting to postgres this is much better than splink.postgres in terms of performance. I am no longer going to use splink.postgres implementation, because duckdb is so much faster. Currently only problem I do have is that this does not work |
Thanks for the update. Yeah I also experienced trouble/unexpected results with the duckdb postgres extension. Specifically, doing a select * from table limit 1 was, to my suprise, fetching the whole table back to duckdb and then doing the limit. Hopefully this will be improved going forwards. The pages per task tip is a good one, I wasn't aware of that option Overall very useful to get feedback re 'postgres native ' Vs 'postgres via duckdb ' - I've been wondering for a while whether we should start recommending the duckdb approach but have never had time to test it on a real workload |
This worked on the predicate filter
|
@vfrank66 I'm just going through and cleaning up open PRs. What do you thinks best to do with this? I think the best options are either to: |
I just saw this too: This allows duckdb workloads to be run within postgres. Feels like between this and the postgres extension for duckdb the recommended approach should be to use one of these options rather than UDFs in postgres. Do you agree? |
Yes I agree. This should be dropped. I would even recommend dropping Postgres support personally. Anyone that wishes to use it would not get all the similarity functions and if added the performance would be work for any developer running over 10 million total comparisons, based on my short experience. Although I do understand some people have bigger database servers than they do application servers. So maybe it should be left alone. |
Thanks - yeah, agree it's a bit of a niche use case. Thanks anyway for your work on this and letting us know about how it's performed, very useful |
See
#2199
for original PR
https://github.com/duckdb/duckdb/blob/0be3e7b43680f0bfd851f8788581aaaf4bf8cd3f/src/core_functions/scalar/string/damerau_levenshtein.cpp#L4
https://github.com/duckdb/duckdb/blob/main/test/sql/function/string/test_damerau_levenshtein.test
Also
https://iamberke.com/post/2012/04/10/Damerau-Levenshtein-distance-in-SQL
Here's the exampe code I've been using:
example