-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duplicate game_id + play_id? #9
Comments
Thanks for finding more stuff! Is this in the newer data folder (not legacy-data)? Some of these are errors in the underlying data (ie NFL repeating play IDs) but will look into this |
yep yep, newest data! downloaded it the other night. i think the "best" news for the duped data is that there weren't visible discrepancies between the rows that i saw. i ran a standard deviation check on duped rows and all had SD==0 for the numerical columns. doesn't solve the riddle though -- if there were some systematic different then maybe it'd be due to some sort of join somewhere in the preprocessing pipeline. but that doesn't seem to be the case |
There was actually an issue with joining in the pipeline that got fixed a couple days ago (I was something dumb when joining CP to the full data) so if you have an older version of the data, re-downloading it would fix this. I'm actually re-scraping the data and saving csv.gz and .parquet files to make it easier for non-R users to try things out so once I close the other issue you might want to re-download and see if the dups are still there |
hm okay. i downloaded the data saturday at 3:15pm, so unsure if that's pre or post fix. will re-download when you push up the new stuff |
Okay I THINK we're good but please let me know if you find something to the contrary. game_id, qtr, half_seconds_remaining, desc should uniquely identify plays (play_id has some errors from NFL where they are duplicated, especially in older seasons) pbp <- purrr::map_df(1999:2019, function(x) {
readRDS(
url(
glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_{x}.rds")
)
)
})
dupes <- pbp %>% janitor::get_dupes(game_id, qtr, half_seconds_remaining, desc)
No duplicate combinations found of: game_id, qtr, half_seconds_remaining, desc |
I'm going to guess it's a similar issue that mlb has it their play IDs. The game level ID is unique but for some infrastructure reason the play IDs can be duped and the broad reason is due to processing on separate servers. Which... Seems near impossible to incur duped IDs ever but alas. Thanks Ben! |
I am using data cloned on Aug 13 2020. I use (gameid, drive, playid) as a composite key in my PostgreSQL database. I found the following duplicates when I loaded the 1999-2019 csv data into my database: Key (gameid, drive, playid)=(2000_03_PIT_CLE, 18, 2767) already exists. I have addressed this in my database; but I thought it important to flag for this project. The play records (desc etc.) are unique; but the playid is not. |
Yep those are errors on the data we've been given. I'll un-close this just in case other people come here with the same question, but this isn't something that we can fix. |
about 3% of the data has duplicate game_id + play_id combos and i'm unsure exactly why. seems like you can just run a straight up drop_duplicates() function on the full rows and most drop out. if you subset to
game_id
andplay_id
then that drops more rows.specifically, looks like it is mostly on passing plays. seems pretty evenly distributed across all seasons (1300-2000 dupes per season). can't seem to find any particular pattern in the
desc
.perhaps this is a known problem already in a different repo. anyway, hope that helps!
The text was updated successfully, but these errors were encountered: