duplicate game_id + play_id? #9

nickwan · 2020-06-08T16:02:46Z

about 3% of the data has duplicate game_id + play_id combos and i'm unsure exactly why. seems like you can just run a straight up drop_duplicates() function on the full rows and most drop out. if you subset to game_id and play_id then that drops more rows.

specifically, looks like it is mostly on passing plays. seems pretty evenly distributed across all seasons (1300-2000 dupes per season). can't seem to find any particular pattern in the desc.

perhaps this is a known problem already in a different repo. anyway, hope that helps!

The text was updated successfully, but these errors were encountered:

guga31bb · 2020-06-08T16:26:23Z

Thanks for finding more stuff!

Is this in the newer data folder (not legacy-data)? Some of these are errors in the underlying data (ie NFL repeating play IDs) but will look into this

nickwan · 2020-06-08T16:55:42Z

yep yep, newest data! downloaded it the other night. i think the "best" news for the duped data is that there weren't visible discrepancies between the rows that i saw. i ran a standard deviation check on duped rows and all had SD==0 for the numerical columns. doesn't solve the riddle though -- if there were some systematic different then maybe it'd be due to some sort of join somewhere in the preprocessing pipeline. but that doesn't seem to be the case

guga31bb · 2020-06-08T17:00:48Z

There was actually an issue with joining in the pipeline that got fixed a couple days ago (I was something dumb when joining CP to the full data) so if you have an older version of the data, re-downloading it would fix this. I'm actually re-scraping the data and saving csv.gz and .parquet files to make it easier for non-R users to try things out so once I close the other issue you might want to re-download and see if the dups are still there

nickwan · 2020-06-08T17:03:44Z

hm okay. i downloaded the data saturday at 3:15pm, so unsure if that's pre or post fix. will re-download when you push up the new stuff

guga31bb · 2020-06-08T23:04:11Z

Okay I THINK we're good but please let me know if you find something to the contrary. game_id, qtr, half_seconds_remaining, desc should uniquely identify plays (play_id has some errors from NFL where they are duplicated, especially in older seasons)

pbp <- purrr::map_df(1999:2019, function(x) {
  readRDS(
    url(
      glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_{x}.rds")
    )
  )
})

dupes <- pbp %>% janitor::get_dupes(game_id, qtr, half_seconds_remaining, desc)
No duplicate combinations found of: game_id, qtr, half_seconds_remaining, desc

nickwan · 2020-06-08T23:14:50Z

I'm going to guess it's a similar issue that mlb has it their play IDs. The game level ID is unique but for some infrastructure reason the play IDs can be duped and the broad reason is due to processing on separate servers. Which... Seems near impossible to incur duped IDs ever but alas.

Thanks Ben!

MargareeMan · 2020-08-15T23:10:31Z

@guga31bb and @nickwan:

I am using data cloned on Aug 13 2020. I use (gameid, drive, playid) as a composite key in my PostgreSQL database. I found the following duplicates when I loaded the 1999-2019 csv data into my database:

Key (gameid, drive, playid)=(2000_03_PIT_CLE, 18, 2767) already exists.
Key (gameid, drive, playid)=(2000_03_PIT_CLE, 18, 2768) already exists.
Key (gameid, drive, playid)=(2000_06_WAS_PHI, 12, 1825) already exists.
Key (gameid, drive, playid)=(2000_06_WAS_PHI, 12, 1825) already exists.
Key (gameid, drive, playid)=(2000_11_OAK_DEN, 15, 2323) already exists.
Key (gameid, drive, playid)=(2000_11_OAK_DEN, 15, 2323) already exists.
Key (gameid, drive, playid)=(2002_05_KC_NYJ, 8, 1020) already exists.
Key (gameid, drive, playid)=(2005_04_SEA_WAS, 13, 2861) already exists.
Key (gameid, drive, playid)=(2005_04_SEA_WAS, 13, 2861) already exists.
Key (gameid, drive, playid)=(2006_10_TB_CAR, 1, 103) already exists.
Key (gameid, drive, playid)=(2006_10_TB_CAR, 1, 103) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.
Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.

I have addressed this in my database; but I thought it important to flag for this project. The play records (desc etc.) are unique; but the playid is not.

guga31bb · 2020-08-15T23:53:56Z

Yep those are errors on the data we've been given. I'll un-close this just in case other people come here with the same question, but this isn't something that we can fix.

guga31bb closed this as completed Jun 8, 2020

guga31bb reopened this Aug 15, 2020

tanho63 added the wontfix This will not be worked on label Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate game_id + play_id? #9

duplicate game_id + play_id? #9

nickwan commented Jun 8, 2020

guga31bb commented Jun 8, 2020

nickwan commented Jun 8, 2020

guga31bb commented Jun 8, 2020

nickwan commented Jun 8, 2020

guga31bb commented Jun 8, 2020

nickwan commented Jun 8, 2020

MargareeMan commented Aug 15, 2020 •

edited

Loading

guga31bb commented Aug 15, 2020

duplicate game_id + play_id? #9

duplicate game_id + play_id? #9

Comments

nickwan commented Jun 8, 2020

guga31bb commented Jun 8, 2020

nickwan commented Jun 8, 2020

guga31bb commented Jun 8, 2020

nickwan commented Jun 8, 2020

guga31bb commented Jun 8, 2020

nickwan commented Jun 8, 2020

MargareeMan commented Aug 15, 2020 • edited Loading

guga31bb commented Aug 15, 2020

MargareeMan commented Aug 15, 2020 •

edited

Loading