You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Just a possible improvement that I've used in recent blog posts; a function which matches team names from another dataset onto engsoccerdata team names.
For each team name in a vector, it finds the highest similarity string in name_other variable of teamnames dataframe (using the levenshteinSim function from the RecordLinkage package).
Works well for me so far but untested with non-England teams.
#---------------------------------------------------------------------------
# matchTeamnames()
#---------------------------------------------------------------------------
# Matches a vector of team names with names used by 'teamnames' dataframe in
# engsoccerdata package
#---------------------------------------------------------------------------
# * Inputs a vector of team names outputs the original
# dataframe with new teamname in column 'team' and old teamname in column
# 'team_old'
# * 'min_dist' specifies lowest similarity threshold for a match; if all
# possible matches for a team are below this value, returns 'NA'
# * Returns a vector by default; if checkResults' is TRUE, returns a
# dataframe of old names and best matches for purposes of validation
#---------------------------------------------------------------------------
matchTeamnames <- function(teams, min_dist = 0.1, checkResults = FALSE) {
require(engsoccerdata)
require(RecordLinkage)
require(dplyr)
teams <- as.character(teams)
old_new_df <- lapply(unique(teams), function(x) {
distance <- levenshteinSim(as.character(x), as.character(teamnames$name_other))
# threshold on distance
new_name <- ifelse(max(distance, na.rm=T) >= min_dist, as.character(teamnames[which.max(distance),]$name), "NA")
old_new_df <- data.frame(old_name = x, new_name, distance = max(distance, na.rm=T), stringsAsFactors = FALSE)
}) %>%
plyr::rbind.fill()
if(checkResults) {
return(old_new_df)
} else {
teams <- old_new_df$new_name[match(teams, old_new_df$old_name)]
return(teams)
}
}
The text was updated successfully, but these errors were encountered:
Just a possible improvement that I've used in recent blog posts; a function which matches team names from another dataset onto
engsoccerdata
team names.For each team name in a vector, it finds the highest similarity string in
name_other
variable ofteamnames
dataframe (using thelevenshteinSim
function from theRecordLinkage
package).Works well for me so far but untested with non-England teams.
The text was updated successfully, but these errors were encountered: