Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nameReweight NA issue #54

Open
EmericA570 opened this issue Jul 15, 2021 · 1 comment
Open

nameReweight NA issue #54

EmericA570 opened this issue Jul 15, 2021 · 1 comment

Comments

@EmericA570
Copy link

EmericA570 commented Jul 15, 2021

Hello everyone,

Nice work with the package. It works well for me.

I just have a few question about reweighting posterior probabilities. After using nameReweight or just fastLink with nameReweight and firstname.field I only have NA in zeta.name. I don't understand why. I looked in the function and it should be because of that : 'matches.names.A$zeta.j.names[matches.names.A[,ind] != 2] <- NA'. But I don't understand it.

Also I would need to reweight using more than one field. I already did some modification but I wanted to know if there was any reason why you didn't do it.

In fact I realized that I'm not really of how to use the nameReweight function. Could you explain me ?

Best,

Emeric

@tedenamorado
Copy link
Collaborator

Hi @AuriantEmeric,

I hope all is well. Sorry for the late reply.

The name reweight function takes the empirical distribution of names and basically reweights matches according to the name frequency. This leads to common names being down-weighted and matching on infrequent names up weights the matching probability.

Our code, as it currently stands, can reweight probabilities based on one field. For example:

## Load the package and data
library(fastLink)
data(samplematch)

## The fastLink function only allows you do reweight one field at a time
matches.out <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname"),
  reweight.names = T,
  firstname.field = c("firstname")
)

## You can also reweight by last name
matches.out <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname"),
  reweight.names = T,
  firstname.field = c("lastname")
)

Now, to reweight by two fields, you would need to make further assumptions about the prevalence of names and last names. For example, if you were to assume first and last names are independent, then you can just multiply the matching probabilities after adjusting for first name frequency and the last name frequency counterparts.

If anything, please do not hesitate to reach out.

All my best,

Ted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants