Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(WIP) Feature/postgres similarity functions #2224

Closed
wants to merge 30 commits into from

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Jul 2, 2024

See
#2199
for original PR

https://github.com/duckdb/duckdb/blob/0be3e7b43680f0bfd851f8788581aaaf4bf8cd3f/src/core_functions/scalar/string/damerau_levenshtein.cpp#L4
https://github.com/duckdb/duckdb/blob/main/test/sql/function/string/test_damerau_levenshtein.test

Also
https://iamberke.com/post/2012/04/10/Damerau-Levenshtein-distance-in-SQL

Here's the exampe code I've been using:

example
import os
from uuid import uuid4

from sqlalchemy import create_engine, text

import splink.postgres.comparison_level_library as cll
import splink.postgres.comparison_library as cl
from splink.datasets import splink_datasets
from splink.postgres.blocking_rule_library import block_on
from splink.postgres.linker import PostgresLinker



def get_pg_credentials():
    return {
        "user": os.environ.get("SPLINKTEST_PG_USER", "splinkognito"),
        "password": os.environ.get("SPLINKTEST_PG_PASSWORD", "splink123!"),
        "host": os.environ.get("SPLINKTEST_PG_HOST", "localhost"),
        "port": os.environ.get("SPLINKTEST_PG_PORT", "5432"),
        "db": os.environ.get("SPLINKTEST_PG_DB", "splink_db"),
    }


# Create the engine
creds = get_pg_credentials()
engine = create_engine(
    f"postgresql+psycopg2://{creds['user']}:{creds['password']}"
    f"@{creds['host']}:{creds['port']}/{creds['db']}"
)


cl_settings = {
    "link_type": "dedupe_only",
    "comparisons": [
        {
            "output_column_name": "first_name",
            "comparison_levels": [
                cll.null_level("first_name"),
                cll.exact_match_level("first_name"),
                cll.jaro_winkler_level("first_name", 0.8),
                cll.else_level(),
            ],
            "comparison_description": "Exact match vs. anything else",
        },
        {
            "output_column_name": "surname",
            "comparison_levels": [
                cll.null_level("surname"),
                cll.exact_match_level("surname"),
                cll.damerau_levenshtein_level("surname", 2),
                cll.else_level(),
            ],
            "comparison_description": "Exact match vs. anything else",
        },
        cl.damerau_levenshtein_at_thresholds("dob", [2, 1]),
        cl.jaro_at_thresholds("email", [0.9]),
    ],
    "blocking_rules_to_generate_predictions": [
        block_on("first_name"),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
}


df = splink_datasets.fake_1000
linker = PostgresLinker(
    df,
    cl_settings,
    engine=engine,
)
linker.estimate_u_using_random_sampling(1e4)

zmbc and others added 26 commits April 3, 2024 12:35
…threshold

Minor bug in filtering predict table
…ettings_validation_documentation

Update documentation on settings validation in response to code changes
Remove reference to github action that will not come to be
Fixing spurious error messages with Databricks enable_splink
Fix Splink 4 blog post link
Make spellcheck work cross-platform
@RobinL RobinL changed the title Feature/postgres similarity functions (WIP) Feature/postgres similarity functions Jul 2, 2024
@RobinL
Copy link
Member Author

RobinL commented Jul 2, 2024

@vfrank66 - thanks so much for this. It's all looking good except one thing. I've added tests for the new functions.

The jaro and jaro winkler tests pass, but the damerau_levenshtein ones do not.

I have used the same tests from duckdb here

I'm no expert but i had a little look at the wiki description of the algo here.

Do you think it's possible that your implementation is the version without transpositions, whereas the duckdb one is the one with transpositions, and that's the source of the test failure?

Do you have any thoughts on how to proceed?

For what it's worth, i'm happy to merge the jaro/jaro winkler code as it is - so if you'd prefer to do that, and think about this later, let me know

IF (i > 1 AND j > 1 AND SUBSTRING(s1 FROM i FOR 1) = SUBSTRING(s2 FROM j - 1 FOR 1) AND SUBSTRING(s1 FROM i - 1 FOR 1) = SUBSTRING(s2 FROM j FOR 1)) THEN
d[i + 1][j + 1] := LEAST(
d[i + 1][j + 1],
d[i - 1][j - 1] + cost -- transposition

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transpositions

@vfrank66
Copy link

@RobinL sorry for the late reply I need to start by saying I need some time to make some changes. In Postgres the functions I created in sql were abysmal in performance. It would be ideal to run off a .c compiled extension, which is available. I fully instead to switch out my logic with pg_similarity extension but I have not done. pg_similarity extension has levenshtein but not damerau_levenshtein.

I had terrible cost and performance on AWS RDS so I switched to duckdb with great performance.

I do have transpositions although they are handling differently. I think the main differenc is this algorithm: https://github.com/duckdb/duckdb/blob/0be3e7b43680f0bfd851f8788581aaaf4bf8cd3f/src/core_functions/scalar/string/damerau_levenshtein.cpp#L7, but I have no idea how to implement this.

Since I do not have time to review and implement this change right this moment, I asked claude-sonnet 3.5, for better or worse, here is original https://gist.github.com/vfrank66/3f80eb2ee3a2fbaae3e790085ad57075 here is revised: https://gist.github.com/vfrank66/7dce95e64548fa9ce213652ab5fb30ae

@RobinL
Copy link
Member Author

RobinL commented Jul 10, 2024

Thanks @vfrank66 - happy just to leave this open for a bit in case you have time.

That's very useful info re: performance. I meant to say actually - I did some investigation recently into the postgres extension for DuckDB, which I think is a very promising way forward. Copy and pasing a message from our internal Slack which may be of interest:


I’m giving the duckdb postgres extension a whirl for CPR work. I was keen to understand what it actually does.

It seems pretty suitable for use in Splink as the best way of executing linkage agaisnt a postgres backend.

In a nutshell, when you run lengthy/complex duckdb SQL against a postgres table, it:

  1. Pulls the minimum amount of data from postgres required to execute the query, correctly preserving typing.
    1.b They’ve written optimised code to parallelise the read , to transfer data as efficiently as possible
  2. Once data is received by duckdb (into memory), it effectively becomes a duckdb tabe, and all further processing is done using DuckDB

We can see this at work e.g. with this code:

con.execute("INSTALL postgres")
con.execute("LOAD postgres")

con.execute(
    f"ATTACH '{postgres_connection_string_preprod}' AS postgres_db (TYPE POSTGRES)"
)
con.execute("SET pg_debug_show_queries=true")

sql = """
select jaro_winkler_similarity(l.first_name, r.first_name) as first_name_sim
from postgres_db.personrecordservice.person l
cross join postgres_db.personrecordservice.person r
limit 10
"""
con.execute(sql).df()```

where we see this is the only work being done by postgres:
```BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ


	COPY (SELECT "first_name" FROM "personrecordservice"."person" WHERE ctid BETWEEN '(0,0)'::tid AND '(4294967295,0)'::tid) TO STDOUT (FORMAT binary);
	


	COPY (SELECT "first_name" FROM "personrecordservice"."person" WHERE ctid BETWEEN '(0,0)'::tid AND '(4294967295,0)'::tid) TO STDOUT (FORMAT binary);
	

COMMIT```

Though it does potentially look like there’s duplication there in the double read.

I think you might be able to use with the DuckDB linker as is by doing a `create table df as (select * fom postgres` and then passing the string "df" to the DuckDB linker as the data argument, with the connection set to the same duckdb connection that's been set up to connectthrough to postgres

@vfrank66
Copy link

Oh yay that is great I will make time to look into this Postgres extension next week. that may be exactly what I need.

Due to Postgres function performance on millions of comparisons I wanted to store data is Postgres, calc in duckdb, and store back to Postgres. Due to ACID compliance across multiple predicting applications. I do understand DuckDb in-memory is ACID but would only apply for single process.

@vfrank66
Copy link

I should have a gist soon, another week, of my completed work, this is just a update.

This works great and if you are okay with using client side side processing,with duckdb but persisting to postgres this is much better than splink.postgres in terms of performance. I am no longer going to use splink.postgres implementation, because duckdb is so much faster.

Currently only problem I do have is that this does not work con.execute("SET pg_experimental_filter_pushdown=true") . Instead it is copying all the data. Which is fine for me because I can predicate pushdown with pyscopg beforehand. And one must set con.execute("SET pg_pages_per_task=250000;") otherwise the default batch size is 1_000.

@RobinL
Copy link
Member Author

RobinL commented Jul 18, 2024

Thanks for the update. Yeah I also experienced trouble/unexpected results with the duckdb postgres extension. Specifically, doing a select * from table limit 1 was, to my suprise, fetching the whole table back to duckdb and then doing the limit. Hopefully this will be improved going forwards. The pages per task tip is a good one, I wasn't aware of that option

Overall very useful to get feedback re 'postgres native ' Vs 'postgres via duckdb ' - I've been wondering for a while whether we should start recommending the duckdb approach but have never had time to test it on a real workload

@vfrank66
Copy link

This worked on the predicate filter

con.execute("SET pg_use_ctid_scan=false")     
con.execute("SET pg_experimental_filter_pushdown=true")
 
con.execute(
        """
SELECT sanitized_gender, tf_sanitized_gender
FROM db.public."__splink__df_tf_sanitized_gender"
WHERE sanitized_gender = 'MALE'
"""
    ).df()
 
Proceduse a correct predicate filter:
 

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ
 
 
        COPY (SELECT "sanitized_gender", "tf_sanitized_gender" FROM "public"."__splink__df_tf_sanitized_gender" WHERE ctid BETWEEN '(0,0)'::tid AND '(4294967295,0)'::tid AND ("sanitized_gender" = 'MALE' AND "sanitized_gender" IS NOT NULL)) TO STDOUT (FORMAT binary);

@RobinL
Copy link
Member Author

RobinL commented Aug 19, 2024

@vfrank66 I'm just going through and cleaning up open PRs. What do you thinks best to do with this? I think the best options are either to:
(1) Merge the jaro and jaro winkler UDFs
(2) Close this an recommend to users that they'd be better using duckdb with the postgres extension

@RobinL
Copy link
Member Author

RobinL commented Aug 21, 2024

I just saw this too:
https://www.theregister.com/2024/08/20/postgresql_duckdb_extension/
Which references the pg_duckdb extension here.

This allows duckdb workloads to be run within postgres. Feels like between this and the postgres extension for duckdb the recommended approach should be to use one of these options rather than UDFs in postgres. Do you agree?

@vfrank66
Copy link

Yes I agree. This should be dropped.

I would even recommend dropping Postgres support personally. Anyone that wishes to use it would not get all the similarity functions and if added the performance would be work for any developer running over 10 million total comparisons, based on my short experience.

Although I do understand some people have bigger database servers than they do application servers. So maybe it should be left alone.

@RobinL
Copy link
Member Author

RobinL commented Aug 21, 2024

Thanks - yeah, agree it's a bit of a niche use case. Thanks anyway for your work on this and letting us know about how it's performed, very useful

@RobinL RobinL closed this Aug 21, 2024
@RobinL RobinL mentioned this pull request Aug 22, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants