Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate game_id + play_id? #9

Open
nickwan opened this issue Jun 8, 2020 · 8 comments
Open

duplicate game_id + play_id? #9

nickwan opened this issue Jun 8, 2020 · 8 comments
Labels
wontfix This will not be worked on

Comments

@nickwan
Copy link

nickwan commented Jun 8, 2020

about 3% of the data has duplicate game_id + play_id combos and i'm unsure exactly why. seems like you can just run a straight up drop_duplicates() function on the full rows and most drop out. if you subset to game_id and play_id then that drops more rows.

specifically, looks like it is mostly on passing plays. seems pretty evenly distributed across all seasons (1300-2000 dupes per season). can't seem to find any particular pattern in the desc.

perhaps this is a known problem already in a different repo. anyway, hope that helps!

@guga31bb
Copy link
Member

guga31bb commented Jun 8, 2020

Thanks for finding more stuff!

Is this in the newer data folder (not legacy-data)? Some of these are errors in the underlying data (ie NFL repeating play IDs) but will look into this

@nickwan
Copy link
Author

nickwan commented Jun 8, 2020

yep yep, newest data! downloaded it the other night. i think the "best" news for the duped data is that there weren't visible discrepancies between the rows that i saw. i ran a standard deviation check on duped rows and all had SD==0 for the numerical columns. doesn't solve the riddle though -- if there were some systematic different then maybe it'd be due to some sort of join somewhere in the preprocessing pipeline. but that doesn't seem to be the case

@guga31bb
Copy link
Member

guga31bb commented Jun 8, 2020

There was actually an issue with joining in the pipeline that got fixed a couple days ago (I was something dumb when joining CP to the full data) so if you have an older version of the data, re-downloading it would fix this. I'm actually re-scraping the data and saving csv.gz and .parquet files to make it easier for non-R users to try things out so once I close the other issue you might want to re-download and see if the dups are still there

@nickwan
Copy link
Author

nickwan commented Jun 8, 2020

hm okay. i downloaded the data saturday at 3:15pm, so unsure if that's pre or post fix. will re-download when you push up the new stuff

@guga31bb
Copy link
Member

guga31bb commented Jun 8, 2020

Okay I THINK we're good but please let me know if you find something to the contrary. game_id, qtr, half_seconds_remaining, desc should uniquely identify plays (play_id has some errors from NFL where they are duplicated, especially in older seasons)

pbp <- purrr::map_df(1999:2019, function(x) {
  readRDS(
    url(
      glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_{x}.rds")
    )
  )
})

dupes <- pbp %>% janitor::get_dupes(game_id, qtr, half_seconds_remaining, desc)
No duplicate combinations found of: game_id, qtr, half_seconds_remaining, desc

@nickwan
Copy link
Author

nickwan commented Jun 8, 2020

I'm going to guess it's a similar issue that mlb has it their play IDs. The game level ID is unique but for some infrastructure reason the play IDs can be duped and the broad reason is due to processing on separate servers. Which... Seems near impossible to incur duped IDs ever but alas.

Thanks Ben!

@guga31bb guga31bb closed this as completed Jun 8, 2020
@MargareeMan
Copy link

MargareeMan commented Aug 15, 2020

@guga31bb and @nickwan:

I am using data cloned on Aug 13 2020. I use (gameid, drive, playid) as a composite key in my PostgreSQL database. I found the following duplicates when I loaded the 1999-2019 csv data into my database:

Key (gameid, drive, playid)=(2000_03_PIT_CLE, 18, 2767) already exists.
Key (gameid, drive, playid)=(2000_03_PIT_CLE, 18, 2768) already exists.
Key (gameid, drive, playid)=(2000_06_WAS_PHI, 12, 1825) already exists.
Key (gameid, drive, playid)=(2000_06_WAS_PHI, 12, 1825) already exists.
Key (gameid, drive, playid)=(2000_11_OAK_DEN, 15, 2323) already exists.
Key (gameid, drive, playid)=(2000_11_OAK_DEN, 15, 2323) already exists.
Key (gameid, drive, playid)=(2002_05_KC_NYJ, 8, 1020) already exists.
Key (gameid, drive, playid)=(2005_04_SEA_WAS, 13, 2861) already exists.
Key (gameid, drive, playid)=(2005_04_SEA_WAS, 13, 2861) already exists.
Key (gameid, drive, playid)=(2006_10_TB_CAR, 1, 103) already exists.
Key (gameid, drive, playid)=(2006_10_TB_CAR, 1, 103) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.

I have addressed this in my database; but I thought it important to flag for this project. The play records (desc etc.) are unique; but the playid is not.

@guga31bb
Copy link
Member

Yep those are errors on the data we've been given. I'll un-close this just in case other people come here with the same question, but this isn't something that we can fix.

@guga31bb guga31bb reopened this Aug 15, 2020
@tanho63 tanho63 added the wontfix This will not be worked on label Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants