FuzzyMatch

The inspiration for this work is that SSIS jobs that use fuzzy matching cannot be setup to run through SQL Server jobs unless you are paying for the enterprise edition of SQL Server. Thus, I set out to create a quick fuzzy match/lookup structure within SQL Server.

The BK-Tree data structure, created in 1973 by W.A.Burkhard and R.M.Keller(hence the name BK), is useful for finding similar words to a dictionary of words, without having to compare with every word in the dictionary.

This project is a t-sql implementation of that data structure, which includes an insert algorithm that creates the BK-tree structure in T-SQL using adjacency list for the table structure storage and a search algorithm that parses the tree to find similar words.

For more information regarding the BK-tree data structure, see: https://dl.acm.org/doi/10.1145/362003.362025

Use a dictionary of words or other strings that you would like to fuzzy match.

For this example I used https://github.com/zeisler/scrabble/blob/master/db/dictionary.csv

Order of SQL for Setup:

F_Levenshtein.sql - altered version of https://blog.softwx.net/2014/12/optimizing-levenshtein-algorithm-in-tsql.html
BK_Tree.sql
BK_Tree_Triggers.sql
P_BK_Tree_Search.sql

To initialize the tree, you can run a normal insert call(the insert trigger will take care of the P_BK_Tree_Insert calls) such as:

insert into BK_Tree (word) select word from Dictionary

BK-Tree insert computes in O(N) time - took 3 mins 36 seconds for the 172k dictionary. Interesting bit of information: This insert was ~2 hours prior to adding a breadth-first index to BK_Tree table BEFORE the insert is run. This is unusual because most of the times, adding a index to a table before doing a bulk insert will cause the insert to take longer. The key to this one running faster is that part of the insert is to calculate the parent node id for each row, 1 at a time. Calculating this id makes use of the index, hence the speed up when having the index already on the table.

BK-Tree search computes in O(log N) time Some example outputs of P_BK_Tree_Search on the Scrabble Words Dictionary of words and the runtimes:

One last note: To remove a word from the tree, run: UPDATE BK_Tree SET active = 0 WHERE word = 'word' Trying to delete a row entirely from the table which is akin to deleting a node from the tree will cause the structure of the tree to be ruined as all child nodes of that node will no longer be reachable by the search function.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
BK_Tree.sql		BK_Tree.sql
BK_Tree_Triggers.sql		BK_Tree_Triggers.sql
F_Levenshtein.sql		F_Levenshtein.sql
LICENSE		LICENSE
P_BK_Tree_Search.sql		P_BK_Tree_Search.sql
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FuzzyMatch

About

Releases

Packages

Languages

License

sfelman/FuzzyMatch

Folders and files

Latest commit

History

Repository files navigation

FuzzyMatch

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages