Public Use Microdata Sample (PUMS) data is exciting! But in your excitement over PUMS don’t forget that you are handling a survey and it is important know what margins of error you’re working with.
The PUMS Technical Documentation provides instructions on how to calculate standard error, but some may find it to be a little overwhelming. Here is how you calculate the direct standard error for your PUMS estimtes:
Don’t let this overwhelm you! SEs can’t just be ignored (even if you cross your fingers and hope for your sample size to be ‘good enough’).
Luckily there are survey-specific R packages that can help you navigate error calculations. This post provides an example using the srvyr
package. There are other packages that will do the same job–and of course you can always write your own function. But the srvyr package is easy to use, incorporates dplyr
syntax and piping and gives you a nice, clean result.
The Census provides examples of estimates, standard error and margin of error calculations, just to let you know that you’re understanding the data and you’re on the right track. You can download the PUMS Estimates for User Verification files here.
I have seen several R-specific tutorials out there for using PUMS data, but I have not seen too many that work through direct standard error calculations, and even fewer that work specifically with the srvyr package and PUMS data. I hope that by providing this example I can encourage R users everywhere to address error in their use of PUMS (and remember that data is fallible!).
Read in the PUMS Population record (downloaded from FactFinder):
ppl <- fread(paste0(rootDat, "csv_pca/psam_p06.csv"), data.table = FALSE, stringsAsFactors = FALSE, showProgress = FALSE, verbose = FALSE)
Create age groups that match PUMS Estimates for User Verification
ppl <-
ppl %>%
mutate(age_groups =
cut(AGEP, breaks = c(-1, 4, 9, 14, 19, 24, 34, 44, 54, 59, 64, 74, 84, 100),
labels = c("Age 0-4", "Age 5-9", "Age 10-14",
"Age 15-19", "Age 20-24", "Age 25-34",
"Age 35-44", "Age 45-54", "Age 55-59",
"Age 60-64", "Age 65-74", "Age 75-84",
"Age 85+")))
The goal is to estimate the population in each age group (defined above) and calculate the direct standard error and margin of error for the estimates.
The srvyr
package makes it easy to estimate the standard error using replicate weights.
Define the survey parameters and calculate standard errors for your custom estimates
# create list of replicate weights
repwlist <- rep(paste0("PWGTP",1:80))
repwlist
## [1] "PWGTP1" "PWGTP2" "PWGTP3" "PWGTP4" "PWGTP5" "PWGTP6" "PWGTP7"
## [8] "PWGTP8" "PWGTP9" "PWGTP10" "PWGTP11" "PWGTP12" "PWGTP13" "PWGTP14"
## [15] "PWGTP15" "PWGTP16" "PWGTP17" "PWGTP18" "PWGTP19" "PWGTP20" "PWGTP21"
## [22] "PWGTP22" "PWGTP23" "PWGTP24" "PWGTP25" "PWGTP26" "PWGTP27" "PWGTP28"
## [29] "PWGTP29" "PWGTP30" "PWGTP31" "PWGTP32" "PWGTP33" "PWGTP34" "PWGTP35"
## [36] "PWGTP36" "PWGTP37" "PWGTP38" "PWGTP39" "PWGTP40" "PWGTP41" "PWGTP42"
## [43] "PWGTP43" "PWGTP44" "PWGTP45" "PWGTP46" "PWGTP47" "PWGTP48" "PWGTP49"
## [50] "PWGTP50" "PWGTP51" "PWGTP52" "PWGTP53" "PWGTP54" "PWGTP55" "PWGTP56"
## [57] "PWGTP57" "PWGTP58" "PWGTP59" "PWGTP60" "PWGTP61" "PWGTP62" "PWGTP63"
## [64] "PWGTP64" "PWGTP65" "PWGTP66" "PWGTP67" "PWGTP68" "PWGTP69" "PWGTP70"
## [71] "PWGTP71" "PWGTP72" "PWGTP73" "PWGTP74" "PWGTP75" "PWGTP76" "PWGTP77"
## [78] "PWGTP78" "PWGTP79" "PWGTP80"
Use the function srvyr::as_survey_rep
# create survey design
pumsd <- ppl %>%
as_survey_rep(
variables = c(age_groups), # select grouping variables
weights = PWGTP, # person weight
repweights = repwlist, # list of replicate weights
combined_weights = TRUE, # tells the function that replicate weights are included in the data
mse =TRUE, # tells the function to calc mse
type="other", # statistical method
scale=4/80, # scaling set by ACS
rscale=rep(1,80))
summary(pumsd)
## Call: Called via srvyr
## with 80 replicates and MSE variances.
## Sampling variables:
## - repweights: `PWGTP1 + PWGTP2 + PWGTP3 + PWGTP4 + PWGTP5 + PWGTP6 + PWGTP7 + PWGTP8 + PWGTP9 + PWGTP10 + PWGTP11 + PWGTP12 + PWGTP13 + PWGTP14 + PWGTP15 + PWGTP16 + PWGTP17 + PWGTP18 + PWGTP19 + PWGTP20 + PWGTP21 + PWGTP22 + PWGTP23 + PWGTP24 + PWGTP25 + PWGTP26 + PWGTP27 + PWGTP28 + PWGTP29 + PWGTP30 + PWGTP31 + PWGTP32 + PWGTP33 + PWGTP34 + PWGTP35 + PWGTP36 + PWGTP37 + PWGTP38 + PWGTP39 + PWGTP40 + PWGTP41 + PWGTP42 + PWGTP43 + PWGTP44 + PWGTP45 + PWGTP46 + PWGTP47 + PWGTP48 + PWGTP49 + PWGTP50 + PWGTP51 + \n PWGTP52 + PWGTP53 + PWGTP54 + PWGTP55 + PWGTP56 + PWGTP57 + PWGTP58 + PWGTP59 + PWGTP60 + PWGTP61 + PWGTP62 + PWGTP63 + PWGTP64 + PWGTP65 + PWGTP66 + PWGTP67 + PWGTP68 + PWGTP69 + PWGTP70 + PWGTP71 + PWGTP72 + PWGTP73 + PWGTP74 + PWGTP75 + PWGTP76 + PWGTP77 + PWGTP78 + PWGTP79 + PWGTP80`
## - weights: PWGTP
## Data variables: age_groups (fct)
## Variables:
## [1] "age_groups"
Specify that you want the standard error and coefficient of variation (the ratio of the standard error to the estimated value) for the population counts.
The survey design pumsd
has already defined that PWGTP is the person weights that will be summed.
Finalize the table by multiplying the standard error by 1.645 to get the margin of error.
# calculate child pop count by race and nativity
pop <-
pumsd %>%
group_by(age_groups) %>% # group by age group
summarise( # get the (survey weighted) count of people in each age group
count = survey_total(vartype=c("se", "cv"))) %>% # ask for the standard error and coefficient of variation reported back.
mutate(count_moe = count_se * 1.645) # multiply SE by 1.645 to get MOE at 90% conf level (Census standard)
library(reactable)
reactable(pop, minRows = 15, height = 270, pagination = FALSE,
highlight = TRUE, compact = TRUE,
columns = list(
age_groups = colDef(name = "Age Group"),
count = colDef(name = "Population Count",
format= colFormat(separators = TRUE, digits = 0)),
count_se = colDef(name="Population Count SE",
format = colFormat(separators = TRUE, digits = 0)),
count_cv = colDef(name = "Population Count CV",
format = colFormat(percent = TRUE, digits = 1)),
count_moe = colDef(name="Population Count MOE",
format = colFormat(separators = TRUE, digits = 0))))
Compare results to the PUMS Estimates for User Verification
These population estimates, standard errors and margins of error match what the Census published. So, great!
The coefficient of variation can help you determine the fitness of use of your estimate. Note in this case that all CVs are well within the acceptable range (varies by project/data/geography, but around 30-35% generally considered acceptable).