-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Q: Database size limit for duplicate removal? #65
Comments
Hi, I do not think it is a problem with the specs of your machine, but trying to conduct the merge without some blocking. From your code, I read that you only have three fields to match and the first two refer to names. My first recommendation would be to use the Quick question: have the names been separated into components? For example, first name, middle name, and last name? Keep us posted! Ted |
Thanks for the quick reply @tedenamorado !
Gui |
Hi Gui, Yes, I think parsing may help. Imagine if you have in one file the name: When working with dates, I usually transform them in the number of days to a specific date in the future and then divide that by 365.25 to get a transformation of a date into a yearly unit scale. For example,
Such a transformation allows you to incorporate all the components of a date into one number and you can use the All my best, Ted |
Hi Ted, I'm giving your strategy a try and had a follow up question. Blocking by year of birth results in a large number of blocks. How should I go about running deduplication on each block and merging results back into a single, deduplicated dataframe? I know this is a general R question and I can probably come up with a method but maybe you already have an efficient routine developed? I found this post #63 (comment) may do what I need, after replacing the general fastLink call with a deduplication call. Do you agree? Thanks, |
Hi Gui, If you are running this on just one computer, the approach you mention would work. For deduplication, I wrote a short primer on how to do it, you can access it here: https://www.dropbox.com/s/jhob3yz0594bo34/Enamorado_Census.pdf?raw=1 If you have access to a cluster computer, then each block becomes a job and you can distribute the blocks. Hope this helps! Ted |
Hi TED, Since I do not have access to a cluster for this job I am using a simple loop to process each block serially. Runtime seems within reason but I am running into memory limitations. I initially used 6 CPUs, now I am down to 2 to see if the memory bottleneck goes away. Any tips on that would be appreciated. Best, |
Hi Gui, If you are running into memory limitations, it can be one of two things:
If the problem is 2, then you can solve it by saving (to disk) the matched datasets at each loop. Removing those objects after saving using the If using the tricks above does not help, I think the next step is to check which blocks are too large and subset those one step further (if possible). Keep us posted! Ted |
Hi Gui, How many blocks do you have? Is the problem always for the same block or does it vary? Do you use a Windows OS? Then, I also recommend a restart (not shutting down) to free up all RAM before the deduplication. It typically is not necessary but I have seen memory-related problems sometimes disappear by restarting. Anders |
Hi @tedenamorado and @aalexandersson , Thank you very much for the advice.
I will implement @tedenamorado 's cleanup strategy with Thanks again \o/ |
That is relatively many blocks compared with what I usually use. Therefore, I am curious could the issue be the opposite? That is, is the number of observations in the smallest block smaller than the number of blocks? It would have generated an error for a record linkage (on 2 datasets) but I am not sure how fastLink handles this issue for deduplication. What is the runtime? Do you get an error message? How many observations does the smallest block have? |
Dear @aalexandersson ,
> summary(block_length)
Min. 1st Qu. Median Mean 3rd Qu.
1.0000000 391.5000000 8204.0000000 14745.1300813 31667.5000000
Max.
43764.0000000
|
Yes, I think that it will help to remove the blocks that contain observations fewer than the block number. For record linkage, when using this code for a block with fewer observations than the block number
I get this error message:
I suspect that |
Also, blocking with enough observations will not solve any scalability issue. Therefore, you may want to first work with a much smaller deduplication dataset, for example 18,000 records (1% of your 1.8M) to test that your Then, once you trust your code, you can scale up to using the full dataset. |
|
Hi Gui, Another idea that could work for those records with a birth year before 1940 is to pack them in just one cluster. Keep us posted on how it goes! Ted |
Hi,
I am trying to perform deduplication on a database with 1.8M records. The analysis has been running for ~10 days on a 8-core machine with 32Gb RAM. Do you believe this task can be achieved on such a machine or do I need a bigger server?
My command is as follows:
Best,
Gui
The text was updated successfully, but these errors were encountered: