(WIP) Feature/postgres similarity functions #2224

RobinL · 2024-07-02T16:48:33Z

See
#2199
for original PR

https://github.com/duckdb/duckdb/blob/0be3e7b43680f0bfd851f8788581aaaf4bf8cd3f/src/core_functions/scalar/string/damerau_levenshtein.cpp#L4
https://github.com/duckdb/duckdb/blob/main/test/sql/function/string/test_damerau_levenshtein.test

Also
https://iamberke.com/post/2012/04/10/Damerau-Levenshtein-distance-in-SQL

Here's the exampe code I've been using:

example

import os
from uuid import uuid4

from sqlalchemy import create_engine, text

import splink.postgres.comparison_level_library as cll
import splink.postgres.comparison_library as cl
from splink.datasets import splink_datasets
from splink.postgres.blocking_rule_library import block_on
from splink.postgres.linker import PostgresLinker



def get_pg_credentials():
    return {
        "user": os.environ.get("SPLINKTEST_PG_USER", "splinkognito"),
        "password": os.environ.get("SPLINKTEST_PG_PASSWORD", "splink123!"),
        "host": os.environ.get("SPLINKTEST_PG_HOST", "localhost"),
        "port": os.environ.get("SPLINKTEST_PG_PORT", "5432"),
        "db": os.environ.get("SPLINKTEST_PG_DB", "splink_db"),
    }


# Create the engine
creds = get_pg_credentials()
engine = create_engine(
    f"postgresql+psycopg2://{creds['user']}:{creds['password']}"
    f"@{creds['host']}:{creds['port']}/{creds['db']}"
)


cl_settings = {
    "link_type": "dedupe_only",
    "comparisons": [
        {
            "output_column_name": "first_name",
            "comparison_levels": [
                cll.null_level("first_name"),
                cll.exact_match_level("first_name"),
                cll.jaro_winkler_level("first_name", 0.8),
                cll.else_level(),
            ],
            "comparison_description": "Exact match vs. anything else",
        },
        {
            "output_column_name": "surname",
            "comparison_levels": [
                cll.null_level("surname"),
                cll.exact_match_level("surname"),
                cll.damerau_levenshtein_level("surname", 2),
                cll.else_level(),
            ],
            "comparison_description": "Exact match vs. anything else",
        },
        cl.damerau_levenshtein_at_thresholds("dob", [2, 1]),
        cl.jaro_at_thresholds("email", [0.9]),
    ],
    "blocking_rules_to_generate_predictions": [
        block_on("first_name"),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
}


df = splink_datasets.fake_1000
linker = PostgresLinker(
    df,
    cl_settings,
    engine=engine,
)
linker.estimate_u_using_random_sampling(1e4)

Co-authored-by: Tom Hepworth <[email protected]>

Co-authored-by: Zoe Slade <[email protected]>

…threshold Minor bug in filtering predict table

…ettings_validation_documentation Update documentation on settings validation in response to code changes

Remove reference to github action that will not come to be

Co-authored-by: Zoe Slade <[email protected]>

Fixing spurious error messages with Databricks enable_splink

Fix Splink 4 blog post link

Make spellcheck work cross-platform

…rt comparisons

RobinL · 2024-07-02T18:23:49Z

@vfrank66 - thanks so much for this. It's all looking good except one thing. I've added tests for the new functions.

The jaro and jaro winkler tests pass, but the damerau_levenshtein ones do not.

I have used the same tests from duckdb here

I'm no expert but i had a little look at the wiki description of the algo here.

Do you think it's possible that your implementation is the version without transpositions, whereas the duckdb one is the one with transpositions, and that's the source of the test failure?

Do you have any thoughts on how to proceed?

For what it's worth, i'm happy to merge the jaro/jaro winkler code as it is - so if you'd prefer to do that, and think about this later, let me know

vfrank66 · 2024-07-10T14:50:01Z

splink/postgres/linker.py

+                    IF (i > 1 AND j > 1 AND SUBSTRING(s1 FROM i FOR 1) = SUBSTRING(s2 FROM j - 1 FOR 1) AND SUBSTRING(s1 FROM i - 1 FOR 1) = SUBSTRING(s2 FROM j FOR 1)) THEN
+                        d[i + 1][j + 1] := LEAST(
+                            d[i + 1][j + 1],
+                            d[i - 1][j - 1] + cost  -- transposition


transpositions

vfrank66 · 2024-07-10T15:04:25Z

@RobinL sorry for the late reply I need to start by saying I need some time to make some changes. In Postgres the functions I created in sql were abysmal in performance. It would be ideal to run off a .c compiled extension, which is available. I fully instead to switch out my logic with pg_similarity extension but I have not done. pg_similarity extension has levenshtein but not damerau_levenshtein.

I had terrible cost and performance on AWS RDS so I switched to duckdb with great performance.

I do have transpositions although they are handling differently. I think the main differenc is this algorithm: https://github.com/duckdb/duckdb/blob/0be3e7b43680f0bfd851f8788581aaaf4bf8cd3f/src/core_functions/scalar/string/damerau_levenshtein.cpp#L7, but I have no idea how to implement this.

Since I do not have time to review and implement this change right this moment, I asked claude-sonnet 3.5, for better or worse, here is original https://gist.github.com/vfrank66/3f80eb2ee3a2fbaae3e790085ad57075 here is revised: https://gist.github.com/vfrank66/7dce95e64548fa9ce213652ab5fb30ae

RobinL · 2024-07-10T15:52:17Z

Thanks @vfrank66 - happy just to leave this open for a bit in case you have time.

That's very useful info re: performance. I meant to say actually - I did some investigation recently into the postgres extension for DuckDB, which I think is a very promising way forward. Copy and pasing a message from our internal Slack which may be of interest:

I’m giving the duckdb postgres extension a whirl for CPR work. I was keen to understand what it actually does.

It seems pretty suitable for use in Splink as the best way of executing linkage agaisnt a postgres backend.

In a nutshell, when you run lengthy/complex duckdb SQL against a postgres table, it:

Pulls the minimum amount of data from postgres required to execute the query, correctly preserving typing.
1.b They’ve written optimised code to parallelise the read , to transfer data as efficiently as possible
Once data is received by duckdb (into memory), it effectively becomes a duckdb tabe, and all further processing is done using DuckDB

We can see this at work e.g. with this code:

con.execute("INSTALL postgres")
con.execute("LOAD postgres")

con.execute(
    f"ATTACH '{postgres_connection_string_preprod}' AS postgres_db (TYPE POSTGRES)"
)
con.execute("SET pg_debug_show_queries=true")

sql = """
select jaro_winkler_similarity(l.first_name, r.first_name) as first_name_sim
from postgres_db.personrecordservice.person l
cross join postgres_db.personrecordservice.person r
limit 10
"""
con.execute(sql).df()```

where we see this is the only work being done by postgres:
```BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ


	COPY (SELECT "first_name" FROM "personrecordservice"."person" WHERE ctid BETWEEN '(0,0)'::tid AND '(4294967295,0)'::tid) TO STDOUT (FORMAT binary);
	


	COPY (SELECT "first_name" FROM "personrecordservice"."person" WHERE ctid BETWEEN '(0,0)'::tid AND '(4294967295,0)'::tid) TO STDOUT (FORMAT binary);
	

COMMIT```

Though it does potentially look like there’s duplication there in the double read.

I think you might be able to use with the DuckDB linker as is by doing a `create table df as (select * fom postgres` and then passing the string "df" to the DuckDB linker as the data argument, with the connection set to the same duckdb connection that's been set up to connectthrough to postgres

vfrank66 · 2024-07-11T16:11:42Z

Oh yay that is great I will make time to look into this Postgres extension next week. that may be exactly what I need.

Due to Postgres function performance on millions of comparisons I wanted to store data is Postgres, calc in duckdb, and store back to Postgres. Due to ACID compliance across multiple predicting applications. I do understand DuckDb in-memory is ACID but would only apply for single process.

vfrank66 · 2024-07-18T14:07:18Z

I should have a gist soon, another week, of my completed work, this is just a update.

This works great and if you are okay with using client side side processing,with duckdb but persisting to postgres this is much better than splink.postgres in terms of performance. I am no longer going to use splink.postgres implementation, because duckdb is so much faster.

Currently only problem I do have is that this does not work con.execute("SET pg_experimental_filter_pushdown=true") . Instead it is copying all the data. Which is fine for me because I can predicate pushdown with pyscopg beforehand. And one must set con.execute("SET pg_pages_per_task=250000;") otherwise the default batch size is 1_000.

RobinL · 2024-07-18T17:14:15Z

Thanks for the update. Yeah I also experienced trouble/unexpected results with the duckdb postgres extension. Specifically, doing a select * from table limit 1 was, to my suprise, fetching the whole table back to duckdb and then doing the limit. Hopefully this will be improved going forwards. The pages per task tip is a good one, I wasn't aware of that option

Overall very useful to get feedback re 'postgres native ' Vs 'postgres via duckdb ' - I've been wondering for a while whether we should start recommending the duckdb approach but have never had time to test it on a real workload

vfrank66 · 2024-07-19T14:47:18Z

This worked on the predicate filter

con.execute("SET pg_use_ctid_scan=false")     
con.execute("SET pg_experimental_filter_pushdown=true")
 
con.execute(
        """
SELECT sanitized_gender, tf_sanitized_gender
FROM db.public."__splink__df_tf_sanitized_gender"
WHERE sanitized_gender = 'MALE'
"""
    ).df()
 
Proceduse a correct predicate filter:
 

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ
 
 
        COPY (SELECT "sanitized_gender", "tf_sanitized_gender" FROM "public"."__splink__df_tf_sanitized_gender" WHERE ctid BETWEEN '(0,0)'::tid AND '(4294967295,0)'::tid AND ("sanitized_gender" = 'MALE' AND "sanitized_gender" IS NOT NULL)) TO STDOUT (FORMAT binary);

RobinL · 2024-08-19T07:51:37Z

@vfrank66 I'm just going through and cleaning up open PRs. What do you thinks best to do with this? I think the best options are either to:
(1) Merge the jaro and jaro winkler UDFs
(2) Close this an recommend to users that they'd be better using duckdb with the postgres extension

RobinL · 2024-08-21T08:41:23Z

I just saw this too:
https://www.theregister.com/2024/08/20/postgresql_duckdb_extension/
Which references the pg_duckdb extension here.

This allows duckdb workloads to be run within postgres. Feels like between this and the postgres extension for duckdb the recommended approach should be to use one of these options rather than UDFs in postgres. Do you agree?

vfrank66 · 2024-08-21T13:57:56Z

Yes I agree. This should be dropped.

I would even recommend dropping Postgres support personally. Anyone that wishes to use it would not get all the similarity functions and if added the performance would be work for any developer running over 10 million total comparisons, based on my short experience.

Although I do understand some people have bigger database servers than they do application servers. So maybe it should be left alone.

RobinL · 2024-08-21T15:12:37Z

Thanks - yeah, agree it's a bit of a niche use case. Thanks anyway for your work on this and letting us know about how it's performed, very useful

zmbc and others added 26 commits April 3, 2024 12:35

Make spellcheck work cross-platform

58d6d9c

Merge branch 'master' into spellcheck-cross-platform

2fbcced

Merge branch 'master' into spellcheck-cross-platform

f159e05

Make spellchecker script executable

75e752f

Include task in pyspelling call

99838f3

Co-authored-by: Tom Hepworth <[email protected]>

Include sentence to encourage contributions

71844d5

Co-authored-by: Zoe Slade <[email protected]>

Update documentation on settings validation in response to code changes

ab17375

Update predict.py

ebba34b

Merge pull request #2152 from moj-analytical-services/bugfix_predict_…

86f955c

…threshold Minor bug in filtering predict table

Merge pull request #2149 from moj-analytical-services/docs/updating_s…

ad28a62

…ettings_validation_documentation Update documentation on settings validation in response to code changes

Fixing spurious error messages with Databricks enable_splink

3d7cf00

format

7dccd66

remove ref to github action

268f77e

Merge pull request #2163 from moj-analytical-services/docs_tweak

8425395

Remove reference to github action that will not come to be

Reword script instructions

e252813

Co-authored-by: Zoe Slade <[email protected]>

Merge branch 'master' into spellcheck-cross-platform

df49f62

Merge pull request #2159 from aymonwuolanne/master

be6a9ad

Fixing spurious error messages with Databricks enable_splink

Merge branch 'master' into spellcheck-cross-platform

5c6df64

Fix Splink 4 blog post link

6c0437c

Merge pull request #2172 from probjects/master

1c69ebc

Fix Splink 4 blog post link

Merge branch 'master' into spellcheck-cross-platform

0a10a93

Merge pull request #2131 from zmbc/spellcheck-cross-platform

479cef8

Make spellcheck work cross-platform

(feat) #2198 add postgres backend similarity functions to fully suppo…

0684423

…rt comparisons

fix missing import, run format again?

139cf12

update broken documentation build

026a90d

fix documententation bad on postgres_docker

50d8cac

RobinL changed the title ~~Feature/postgres similarity functions~~ (WIP) Feature/postgres similarity functions Jul 2, 2024

vflumeris and others added 3 commits July 2, 2024 16:49

lint with black

e8c7923

try removing additional install in docs workflow

e35f43c

add tests of postgres functions

7137024

exports

e59bfd3

vfrank66 reviewed Jul 10, 2024

View reviewed changes

RobinL force-pushed the master branch from 854f716 to 0b659d7 Compare July 24, 2024 08:44

RobinL mentioned this pull request Aug 19, 2024

(feat) #2198 add postgres backend similarity functions to fully supported #2199

Closed

10 tasks

RobinL closed this Aug 21, 2024

RobinL mentioned this pull request Aug 22, 2024

Fix bug in Postgres example #2352

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) Feature/postgres similarity functions #2224

(WIP) Feature/postgres similarity functions #2224

RobinL commented Jul 2, 2024 •

edited

Loading

RobinL commented Jul 2, 2024

vfrank66 Jul 10, 2024

vfrank66 commented Jul 10, 2024

RobinL commented Jul 10, 2024

vfrank66 commented Jul 11, 2024

vfrank66 commented Jul 18, 2024

RobinL commented Jul 18, 2024 •

edited

Loading

vfrank66 commented Jul 19, 2024

RobinL commented Aug 19, 2024

RobinL commented Aug 21, 2024

vfrank66 commented Aug 21, 2024

RobinL commented Aug 21, 2024 •

edited

Loading

(WIP) Feature/postgres similarity functions #2224

(WIP) Feature/postgres similarity functions #2224

Conversation

RobinL commented Jul 2, 2024 • edited Loading

RobinL commented Jul 2, 2024

vfrank66 Jul 10, 2024

Choose a reason for hiding this comment

vfrank66 commented Jul 10, 2024

RobinL commented Jul 10, 2024

vfrank66 commented Jul 11, 2024

vfrank66 commented Jul 18, 2024

RobinL commented Jul 18, 2024 • edited Loading

vfrank66 commented Jul 19, 2024

RobinL commented Aug 19, 2024

RobinL commented Aug 21, 2024

vfrank66 commented Aug 21, 2024

RobinL commented Aug 21, 2024 • edited Loading

RobinL commented Jul 2, 2024 •

edited

Loading

RobinL commented Jul 18, 2024 •

edited

Loading

RobinL commented Aug 21, 2024 •

edited

Loading