This R package provides a function to easily build panel data from PSID raw data.
Warning: the wealth-supplement setup has changed on the PSID system. wealth variables are now part of the family files for waves 1999 onwards. The wealth=TRUE
option has therefore been removed from the package. See this issue for more details.
The Panel Study of Income Dynamics is a publicly available dataset.
- you can use the data center to build simple datasets
- not workable for larger datasets
- some variables don't show up (although you know they exist)
- the ftp interface gets slower the more periods you are looking at
- the click and scroll exercise of selecting the right variables in each period is extremely error prone.
- merging the data manually is non-trivial.
This package attempts to help the task of building a panel dataset. The user directly downloads ASCII data from the PSID server into R
, without the need for any other software like stata or sas. To build the panel, the user must then specify the variable names in each wave of the questionnaire in a data.frame fam.vars
, as well as the variables from the individual index in ind.vars
. The helper function getNamesPSID
is helpful in finding different variable names across waves - see examples below.
First present a real world example building a full 1968-2017 panel. Then we show some tests.
- You want a
data.table
with the following columns:PID,year,income,wage,age,educ
and some more variables. - You went to the PSID variable search to look up the relevant variable names in each year in either the
individual-level
orfamily-level
datasets. - You created a list of those variables as I did in
inst/psid-lists
of this package - You noted that there is NO EDUCATION variable in the individual index file in 1968 and 1969
- Instead of the variable name for
EDUC
in 1968 and 1969 you want to putNA
- Instead of the variable name for
- You noted that there is NO HOURLY WAGE variable in the family index file in 1993
- Instead of the variable name for
HOURLY WAGE
in 1993 you want to putNA
- Instead of the variable name for
# Build panel with income, wage, age, education and several other variables
# [this is the body of the function build.psid()]
library(psidR)
library(data.table)
r = system.file(package="psidR")
f = fread(file.path(r,"psid-lists","famvars.txt"))
i = fread(file.path(r,"psid-lists","indvars.txt"))
> i
dataset year variable label name
1: PSID Individual Data by Years 1968 ER30019 INDIVIDUAL WEIGHT 68 weight
2: PSID Individual Data by Years 1969 ER30042 INDIVIDUAL WEIGHT 69 weight
3: PSID Individual Data by Years 1970 ER30066 INDIVIDUAL WEIGHT 70 weight
4: PSID Individual Data by Years 1971 ER30090 INDIVIDUAL WEIGHT 71 weight
5: PSID Individual Data by Years 1972 ER30116 INDIVIDUAL WEIGHT 72 weight
---
143: PSID Individual Data Index 2009 ER34020 HIGHEST GRADE FINISHED educ
144: PSID Individual Data Index 2011 ER34119 HIGHEST GRADE FINISHED educ
145: PSID Individual Data Index 2013 ER34230 HIGHEST GRADE FINISHED educ
146: PSID Individual Data Index 2015 ER34349 HIGHEST GRADE FINISHED educ
147: PSID Individual Data Index 2017 ER34548 HIGHEST GRADE FINISHED educ
> f
dataset year variable label name
1: PSID Main Family Data 1968 V47 HD ANN HRS WORKED LAST YR hours
2: PSID Main Family Data 1969 V465 HD ANN HRS WORKED LAST YR hours
3: PSID Main Family Data 1970 V1138 HD ANN HRS WORKED LAST YR hours
4: PSID Main Family Data 1971 V1839 HD ANN HRS WORKED LAST YR hours
5: PSID Main Family Data 1972 V2439 HD ANN HRS WORKED LAST YR hours
---
609: PSID Family-level 2009 ER42139 A52 LIKELIHOOD OF MOVING likelihood_move
610: PSID Family-level 2011 ER47447 A52 LIKELIHOOD OF MOVING likelihood_move
611: PSID Family-level 2013 ER53147 A52 LIKELIHOOD OF MOVING likelihood_move
612: PSID Family-level 2015 ER60162 A52 LIKELIHOOD OF MOVING likelihood_move
613: PSID Family-level 2017 ER66163 A52 LIKELIHOOD OF MOVING likelihood_move
# alternatively, use `getNamesPSID`:
# cwf <- read.xlsx("http://psidonline.isr.umich.edu/help/xyr/psid.xlsx")
# Suppose you know the name of the variable in a certain year, and it is
# "ER17013". then get the correpsonding name in another year with
# getNamesPSID("ER17013", cwf, years = 2001) # 2001 only
# getNamesPSID("ER17013", cwf, years = 2003) # 2003
# getNamesPSID("ER17013", cwf, years = NULL) # all years
# getNamesPSID("ER17013", cwf, years = c(2005, 2007, 2009)) # some years
# next, bring into required shape:
i = dcast(i[,list(year,name,variable)],year~name, value.var = "variable")
f = dcast(f[,list(year,name,variable)],year~name, value.var = "variable")
> head(i)
year age educ empstat weight
1: 1968 ER30004 ER30010 <NA> ER30019
2: 1969 ER30023 <NA> <NA> ER30042 # NOTICE THE NA for educ HERE!!
3: 1970 ER30046 ER30052 <NA> ER30066
4: 1971 ER30070 ER30076 <NA> ER30090
5: 1972 ER30094 ER30100 <NA> ER30116
6: 1973 ER30120 ER30126 <NA> ER30137
> head(f)
year age_youngest_child debt empstat_ faminc hours hvalue ...
1: 1968 V120 <NA> V196 V81 V47 V5 ...
2: 1969 V1013 <NA> V639 V529 V465 V449 ...
3: 1970 V1243 <NA> V1278 V1514 V1138 V1122 ...
4: 1971 V1946 <NA> V1983 V2226 V1839 V1823 ...
5: 1972 V2546 <NA> V2581 V2852 V2439 V2423 ...
6: 1973 V3099 <NA> V3114 V3256 V3027 V3021 ...
# call the builder function
d = build.panel(datadir=datadr,fam.vars=f,ind.vars=i, heads.only = TRUE,sample="SRC",design="all")
# d contains your panel
save(d,file="~/psid.Rds")
Here are some tests:
# one year test, no ind file
# call function `small.test.noind()`
# get var names from cross walk
cwf = openxlsx::read.xlsx(system.file(package="psidR","psid-lists","psid.xlsx"))
head_age_var_name <- getNamesPSID("ER17013", cwf, years=c(2003))
# create family vars data.frame
famvars = data.frame(year=c(2003),age=head_age_var_name)
# call function
build.panel(fam.vars=famvars,datadir=dd)
# one year test, ind file
# call function `small.test.ind()`
cwf = openxlsx::read.xlsx(system.file(package="psidR","psid-lists","psid.xlsx"))
head_age_var_name <- getNamesPSID("ER17013", cwf, years=c(2003))
educ = getNamesPSID("ER30323",cwf,years=2003)
famvars = data.frame(year=c(2003),age=head_age_var_name)
indvars = data.frame(year=c(2003),educ=educ)
build.panel(fam.vars=famvars,ind.vars=indvars,datadir=dd)
# three year test, ind file
# call function `medium.test.ind()`
cwf = openxlsx::read.xlsx(system.file(package="psidR","psid-lists","psid.xlsx"))
head_age_var_name <- getNamesPSID("ER17013", cwf, years=c(2003,2005,2007))
educ = getNamesPSID("ER30323",cwf,years=c(2003,2005,2007))
famvars = data.frame(year=c(2003,2005,2007),age=head_age_var_name)
indvars = data.frame(year=c(2003,2005,2007),educ=educ)
build.panel(fam.vars=famvars,ind.vars=indvars,datadir=dd)
# etc for
medium.test.noind()
# example output:
INFO [2018-10-10 10:58:23] Will download missing datasets now
INFO [2018-10-10 10:58:23] will download family files: 2003, 2005, 2007
This can take several hours/days to download.
want to go ahead? give me 'yes' or 'no'.yes
please enter your PSID username: *******
please enter your PSID password: *******
INFO [2018-10-10 10:58:46] downloading file ~/psid/FAM2003ER
INFO [2018-10-10 10:58:50] now reading and processing SAS file ~/psid/FAM2003ER into R
INFO [2018-10-10 11:07:02] downloading file ~/psid/FAM2005ER
INFO [2018-10-10 11:07:05] now reading and processing SAS file ~/psid/FAM2005ER into R
INFO [2018-10-10 11:14:44] downloading file ~/psid/FAM2007ER
INFO [2018-10-10 11:14:48] now reading and processing SAS file ~/psid/FAM2007ER into R
INFO [2018-10-10 11:28:25] finished downloading files to ~/psid/
INFO [2018-10-10 11:28:25] continuing now to build the dataset
INFO [2018-10-10 11:28:25] psidR: Loading Family data from .rda files
INFO [2018-10-10 11:28:34] psidR: loaded individual file: ~/psid/IND2015ER.rda
INFO [2018-10-10 11:28:34] psidR: total memory load in MB: 1252
INFO [2018-10-10 11:28:34]
INFO [2018-10-10 11:28:34] psidR: currently working on data for year 2003
INFO [2018-10-10 11:28:36]
INFO [2018-10-10 11:28:36] psidR: currently working on data for year 2005
INFO [2018-10-10 11:28:37]
INFO [2018-10-10 11:28:37] psidR: currently working on data for year 2007
INFO [2018-10-10 11:28:39] balanced design reduces sample from 97377 to 89571
INFO [2018-10-10 11:28:39] End of build.panel
> x
age interview ID1968 pernum sequence relation.head pid year
1: 92 1 848 2 1 10 848002 2003
2: 64 2 1173 1 1 10 1173001 2003
3: 48 3 1866 32 2 30 1866032 2003
4: 48 3 1866 171 1 10 1866171 2003
5: 48 3 1866 175 0 0 1866175 2003
---
89567: 49 8332 6069 4 2 20 6069004 2007
89568: 49 8332 6069 30 0 0 6069030 2007
89569: 49 8332 6069 171 3 33 6069171 2007
89570: 49 8332 6069 173 1 10 6069173 2007
89571: 49 8332 6069 174 0 0 6069174 2007
# etc for
medium.test.ind.NA()
The package is on CRAN, so just type
install.packages('psidR')
Alternatively to get the up-t-date version from this repository,
install.packages('devtools')
install_github("psidR",username="floswald")
the main function in the package is build.panel
and it has a reproducible example which you can look at by typing
require(psidR)
example(build.panel)
The PSID has a wealth of add-on datasets. Once you have a panel those are easy to merge on. The panel will have a variable interview
, which is the identifier in the supplemental dataset.
If you use psidR
in your work, please consider citing it. You could just do
> citation(package="psidR")
To cite package ‘psidR’ in publications use:
Florian Oswald (2020). psidR: Build Panel Data Sets from PSID Raw
Data. R package version 2.0. https://github.com/floswald/psidR
A BibTeX entry for LaTeX users is
@Manual{,
title = {psidR: Build Panel Data Sets from PSID Raw Data},
author = {Florian Oswald},
year = {2020},
note = {R package version 2.0},
url = {https://github.com/floswald/psidR},
}
Thanks!