The Joy of Calculating the Direct Standard Error for PUMS Estimates

Public Use Microdata Sample (PUMS) data is exciting! But in your excitement over PUMS don’t forget that you are handling a survey and it is important know what margins of error you’re working with.

The PUMS Technical Documentation provides instructions on how to calculate standard error, but some may find it to be a little overwhelming. Here is how you calculate the direct standard error for your PUMS estimtes:

Standard error formula using replicate weights

Don’t let this overwhelm you! SEs can’t just be ignored (even if you cross your fingers and hope for your sample size to be ‘good enough’).

Luckily there are survey-specific R packages that can help you navigate error calculations. This post provides an example using the srvyr package. There are other packages that will do the same job–and of course you can always write your own function. But the srvyr package is easy to use, incorporates dplyr syntax and piping and gives you a nice, clean result.

The Census provides examples of estimates, standard error and margin of error calculations, just to let you know that you’re understanding the data and you’re on the right track. You can download the PUMS Estimates for User Verification files here.

I have seen several R-specific tutorials out there for using PUMS data, but I have not seen too many that work through direct standard error calculations, and even fewer that work specifically with the srvyr package and PUMS data. I hope that by providing this example I can encourage R users everywhere to address error in their use of PUMS (and remember that data is fallible!).

Load & Prepare Data

Read in the PUMS Population record (downloaded from FactFinder):

ppl <- fread(paste0(rootDat, "csv_pca/psam_p06.csv"), data.table = FALSE, stringsAsFactors = FALSE, showProgress = FALSE, verbose = FALSE)

Code Age Groups

Create age groups that match PUMS Estimates for User Verification

ppl <- 
ppl %>%
  mutate(age_groups = 
             cut(AGEP, breaks = c(-1, 4, 9, 14, 19, 24, 34, 44, 54, 59, 64, 74, 84, 100),
                 labels = c("Age 0-4", "Age 5-9", "Age 10-14",
                  "Age 15-19", "Age 20-24", "Age 25-34",
                  "Age 35-44", "Age 45-54", "Age 55-59",
                  "Age 60-64", "Age 65-74", "Age 75-84",
                  "Age 85+")))

Get Population Estimates

The goal is to estimate the population in each age group (defined above) and calculate the direct standard error and margin of error for the estimates.

The srvyr package makes it easy to estimate the standard error using replicate weights.

Set up the survey design

Define the survey parameters and calculate standard errors for your custom estimates

First, define your replicate weights

# create list of replicate weights
repwlist <- rep(paste0("PWGTP",1:80))

repwlist

##  [1] "PWGTP1"  "PWGTP2"  "PWGTP3"  "PWGTP4"  "PWGTP5"  "PWGTP6"  "PWGTP7" 
##  [8] "PWGTP8"  "PWGTP9"  "PWGTP10" "PWGTP11" "PWGTP12" "PWGTP13" "PWGTP14"
## [15] "PWGTP15" "PWGTP16" "PWGTP17" "PWGTP18" "PWGTP19" "PWGTP20" "PWGTP21"
## [22] "PWGTP22" "PWGTP23" "PWGTP24" "PWGTP25" "PWGTP26" "PWGTP27" "PWGTP28"
## [29] "PWGTP29" "PWGTP30" "PWGTP31" "PWGTP32" "PWGTP33" "PWGTP34" "PWGTP35"
## [36] "PWGTP36" "PWGTP37" "PWGTP38" "PWGTP39" "PWGTP40" "PWGTP41" "PWGTP42"
## [43] "PWGTP43" "PWGTP44" "PWGTP45" "PWGTP46" "PWGTP47" "PWGTP48" "PWGTP49"
## [50] "PWGTP50" "PWGTP51" "PWGTP52" "PWGTP53" "PWGTP54" "PWGTP55" "PWGTP56"
## [57] "PWGTP57" "PWGTP58" "PWGTP59" "PWGTP60" "PWGTP61" "PWGTP62" "PWGTP63"
## [64] "PWGTP64" "PWGTP65" "PWGTP66" "PWGTP67" "PWGTP68" "PWGTP69" "PWGTP70"
## [71] "PWGTP71" "PWGTP72" "PWGTP73" "PWGTP74" "PWGTP75" "PWGTP76" "PWGTP77"
## [78] "PWGTP78" "PWGTP79" "PWGTP80"

Define the survey design

Use the function srvyr::as_survey_rep

# create survey design
pumsd <- ppl %>%
  as_survey_rep(
    variables = c(age_groups), # select grouping variables
    weights = PWGTP,            # person weight
    repweights = repwlist,      # list of replicate weights
    combined_weights = TRUE,    # tells the function that replicate weights are included in the data
    mse =TRUE,                  # tells the function to calc mse
    type="other",               # statistical method
    scale=4/80,                 # scaling set by ACS
    rscale=rep(1,80))               

summary(pumsd)

## Call: Called via srvyr
## with 80 replicates and MSE variances.
## Sampling variables:
##  - repweights: `PWGTP1 + PWGTP2 + PWGTP3 + PWGTP4 + PWGTP5 + PWGTP6 + PWGTP7 + PWGTP8 + PWGTP9 + PWGTP10 + PWGTP11 + PWGTP12 + PWGTP13 + PWGTP14 + PWGTP15 + PWGTP16 + PWGTP17 + PWGTP18 + PWGTP19 + PWGTP20 + PWGTP21 + PWGTP22 + PWGTP23 + PWGTP24 + PWGTP25 + PWGTP26 + PWGTP27 + PWGTP28 + PWGTP29 + PWGTP30 + PWGTP31 + PWGTP32 + PWGTP33 + PWGTP34 + PWGTP35 + PWGTP36 + PWGTP37 + PWGTP38 + PWGTP39 + PWGTP40 + PWGTP41 + PWGTP42 + PWGTP43 + PWGTP44 + PWGTP45 + PWGTP46 + PWGTP47 + PWGTP48 + PWGTP49 + PWGTP50 + PWGTP51 + \n    PWGTP52 + PWGTP53 + PWGTP54 + PWGTP55 + PWGTP56 + PWGTP57 + PWGTP58 + PWGTP59 + PWGTP60 + PWGTP61 + PWGTP62 + PWGTP63 + PWGTP64 + PWGTP65 + PWGTP66 + PWGTP67 + PWGTP68 + PWGTP69 + PWGTP70 + PWGTP71 + PWGTP72 + PWGTP73 + PWGTP74 + PWGTP75 + PWGTP76 + PWGTP77 + PWGTP78 + PWGTP79 + PWGTP80`
##  - weights: PWGTP
## Data variables: age_groups (fct)
## Variables: 
## [1] "age_groups"

Summarize

Specify that you want the standard error and coefficient of variation (the ratio of the standard error to the estimated value) for the population counts.
The survey design pumsd has already defined that PWGTP is the person weights that will be summed.
Finalize the table by multiplying the standard error by 1.645 to get the margin of error.

# calculate child pop count by race and nativity
pop <-
  pumsd %>%
  group_by(age_groups) %>%    # group by age group
  summarise(                  # get the (survey weighted) count of people in each age group
    count = survey_total(vartype=c("se", "cv"))) %>% # ask for the standard error and coefficient of variation reported back.
  mutate(count_moe = count_se * 1.645)          # multiply SE by 1.645 to get MOE at 90% conf level (Census standard)

View the Results

library(reactable)

reactable(pop, minRows = 15, height = 270, pagination = FALSE,
          highlight = TRUE, compact = TRUE,
          columns = list(
            age_groups = colDef(name = "Age Group"),
            count = colDef(name = "Population Count", 
                           format= colFormat(separators = TRUE, digits = 0)),
            count_se = colDef(name="Population Count SE", 
                              format = colFormat(separators = TRUE, digits = 0)),
            count_cv = colDef(name = "Population Count CV", 
                              format = colFormat(percent = TRUE, digits = 1)),
            count_moe = colDef(name="Population Count MOE", 
                               format = colFormat(separators = TRUE, digits = 0))))

Final Steps:

Compare results to the PUMS Estimates for User Verification
These population estimates, standard errors and margins of error match what the Census published. So, great!
The coefficient of variation can help you determine the fitness of use of your estimate. Note in this case that all CVs are well within the acceptable range (varies by project/data/geography, but around 30-35% generally considered acceptable).