Stargazer Confidence Interval Incorrect?

2.2k views Asked by At

So I am really fond of the stargazer package for displaying the statistics for regression models. I've been using R and Stata together to complete some problems in a textbook. One issue that I have found is that the confidence interval printed by the stargazer package does not correspond to the confidence interval by stata. I determined that the CI in stata is the correct one after doing it by hand.

Because the issue might may possibly lie in how I am handling the data, I offer it here as an optional choice. My primary concern is to determine why the CI's do not respond. From a previous post, here is one possible way of finding the data I am using;

install.packages("devtools")  # if not already installed
library(devtools)
install_git("https://github.com/ccolonescu/PoEdata")

library(PoEdata)   # loads the package in memory
library(multcomp)  # for hypo testing
data(fair4)         # loads the data set of interest

In Stata, the name of the dataset I am using is called fair4.dta. For the data itself, you can use it manually,

structure(list(year = structure(c(1880, 1884, 1888, 1892, 1896, 
1900, 1904, 1908, 1912, 1916, 1920, 1924, 1928, 1932, 1936, 1940, 
1944, 1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980, 1984, 
1988, 1992, 1996, 2000, 2004, 2008), label = "year", format.stata = "%9.0g"), 
    vote = structure(c(50.2200012207031, 49.8460006713867, 50.4140014648438, 
    48.2680015563965, 47.7599983215332, 53.1710014343262, 60.0060005187988, 
    54.4830017089844, 54.7080001831055, 51.681999206543, 36.1189994812012, 
    58.2439994812012, 58.8199996948242, 40.8409996032715, 62.4580001831055, 
    54.9990005493164, 53.773998260498, 52.3699989318848, 44.5950012207031, 
    57.7639999389648, 49.9129981994629, 61.3440017700195, 49.5960006713867, 
    61.7890014648438, 48.9480018615723, 44.6969985961914, 59.1699981689453, 
    53.9020004272461, 46.5449981689453, 54.7360000610352, 50.2649993896484, 
    51.2330017089844, 46.5999984741211), label = "Incumbent share of the two-party presidential vote", format.stata = "%9.0g"), 
    party = structure(c(-1, -1, 1, -1, 1, -1, -1, -1, -1, 1, 
    1, -1, -1, -1, 1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, 1, -1, 
    -1, -1, 1, 1, -1, -1), label = "= 1 if Democratic incumbent at election time; -1 if a Republican incumbent", format.stata = "%9.0g"), 
    person = structure(c(0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 
    0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 
    1, 0), label = "= 1 if incumbent is running for election and 0 otherwise", format.stata = "%9.0g"), 
    duration = structure(c(1.75, 2, 0, 0, 0, 0, 1, 1.25, 1.5, 
    0, 1, 0, 1, 1.25, 0, 1, 1.25, 1.5, 1.75, 0, 1, 0, 1, 0, 1, 
    0, 0, 1, 1.25, 0, 1, 0, 1), label = "number of terms incumbent administration in power", format.stata = "%9.0g"), 
    war = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 
    0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0), label = "= 1 for elections of 1920, 1944, and 1948 and 0 otherwise.", format.stata = "%9.0g"), 
    growth = structure(c(3.87899994850159, 1.58899998664856, 
    -5.55299997329712, 2.76300001144409, -10.0240001678467, -1.42499995231628, 
    -2.4210000038147, -6.2810001373291, 4.16400003433228, 2.22900009155273, 
    -11.4630002975464, -3.87199997901917, 4.6230001449585, -14.4989995956421, 
    11.7650003433228, 3.90199995040894, 4.27899980545044, 3.5789999961853, 
    0.690999984741211, -1.45099997520447, 0.377000004053116, 
    5.10900020599365, 5.04300022125244, 5.91400003433228, 3.75099992752075, 
    -3.59699988365173, 5.44000005722046, 2.17799997329712, 2.66199994087219, 
    3.12100005149841, 1.21899998188019, 2.69000005722046, 0.219999998807907
    ), label = "growth rate GDP in first three quarters of the election year", format.stata = "%9.0g"), 
    inflation = structure(c(1.97399997711182, 1.05499994754791, 
    0.603999972343445, 2.2739999294281, 3.41000008583069, 2.54800009727478, 
    1.44200003147125, 1.87899994850159, 2.17199993133545, 4.2519998550415, 
    0, 5.16099977493286, 0.18299999833107, 7.19999980926514, 
    2.49699997901917, 0.0810000002384186, 0, 0, 2.36199998855591, 
    1.93499994277954, 1.96700000762939, 1.25999999046326, 3.13899993896484, 
    4.81500005722046, 7.63000011444092, 7.83099985122681, 5.25899982452393, 
    2.90599989891052, 3.27999997138977, 2.06200003623962, 1.60500001907349, 
    2.32500004768372, 2.88000011444092), label = "growth rate of GDP deflator during first 15 quarters of admin", format.stata = "%9.0g"), 
    goodnews = structure(c(9, 2, 3, 7, 6, 7, 5, 8, 8, 3, 0, 10, 
    7, 4, 9, 8, 0, 0, 7, 5, 5, 10, 7, 4, 5, 5, 8, 4, 2, 4, 8, 
    1, 3), label = "number of quarters in first 15 with real GDP per capita growth > 3.2", format.stata = "%9.0g")), notes = c("more complete variable definitions in fair.def", 
"1"), .Names = c("year", "vote", "party", "person", "duration", 
"war", "growth", "inflation", "goodnews"), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -33L))

So here is the stargazer code that is giving me trouble:

presidential <- read_dta("~/Directory/fair4.dta")
pres.lm = lm(vote ~ growth, data = subset(presidential, 
                                        presidential$year >= 1916)   
stargazer(pres.lm,
            type = "text",
            intercept.bottom = T,
            digits = 5,
            report = "vc*stp",
            ci = T
          )
confint(pres.lm, level = 0.95)

Consider the difference in the confidence intervals.

(0.52948, 1.24241)     # in R, Stargazer
0.5087671  1.263126    # in R, confint(pres.lm)
.5087671    1.263126   # in Stata

I also calculated by hand for the confidence intervals and the confit() and the Stata numbers check out. The t-critical value for this dataset should be t_(N-2 , prob) = t(22,.0025) = -2.073873.

In addition, I made sure to create an entirely new data frame. That is, instead of subsetting within the the lm() argument, I subset it first. When comparing this method to the previous one, I still get the same exact (incorrect) confidence intervals.

# subset into a new dataframe
presidential.1 = subset(presidential, presidential$year >= 1916)

# create the model
pres.lm.2 = lm(vote ~ growth, data = presidential.1)

# compare the two
stargazer(pres.lm,pres.lm.2,
          type = "text",
          intercept.bottom = F,
          digits = 5,
          report = "vc*stp",
          ci = T,
          t.auto = T)

                                         (1)                  (2)         
-----------------------------------------------------------------------
Constant                          50.84840***          50.84840***     
                              (48.86384, 52.83295) (48.86384, 52.83295)
                                  t = 50.21835         t = 50.21835    
                                  p = 0.00000          p = 0.00000     

growth                             0.88595***           0.88595***     
                               (0.52948, 1.24241)   (0.52948, 1.24241) 
                                  t = 4.87126          t = 4.87126     
                                  p = 0.00008          p = 0.00008   

# correct intervals from Stata and R's confint()
growth       0.5087671  1.263126

Am I running the code incorrectly? It really isn't a big deal for me to run the stargazer command and print only the coefficients and the t-stats, but it is kind of disappointing that I would have to run confint() as a separate command given that the output for Stargazer is gorgeous. It is quite odd because the coefficient estimates and the t-statistics are perfect. The confidence intervals are off by varying degrees, and I would like to know what the cause of this might be. Any advice would be greatly appreciated.

1

There are 1 answers

1
paqmo On BEST ANSWER

The simple answer is that stata and confint calculate confidence intervals using the t-distribution, while stargazer's internal method uses the normal distribution. The result is that the former two are more conservative in their estimates and thus have wider CI compared to stargazer. (Well, I'm assuming with stata here, but since it gives the same results as confint I feel it is a safe assumption).

Looking deep into the source code for stargazer (line 688ff) we can find how CIs are calculated:

z.value <- qnorm((1 + .format.ci.level.use)/2)
coef <- .global.coefficients[.global.coefficient.variables[which.variable],i]
se <- .global.std.errors[.global.coefficient.variables[which.variable],i]
ci.lower.bound <- coef - z.value * se
ci.upper.bound <- coef + z.value * se

It uses qnorm to set the critical value.

Compare to confint:

a <- (1 - level)/2
a <- c(a, 1 - a)
fac <- qt(a, object$df.residual) ##Relevant line, uses T-distribution
pct <- format.perc(a, 3)
ci <- array(NA, dim = c(length(parm), 2L), dimnames = list(parm, 
    pct))
ses <- sqrt(diag(vcov(object)))[parm]
ci[] <- cf[parm] + ses %o% fac

Compare:

#Using normal/z distribution
> pres.lm$coefficients[2] + sqrt(diag(vcov(pres.lm)))[2] %o% c(-qnorm((1 + 0.95)/2), qnorm((1 + 0.95)/2))
            [,1]     [,2]
growth 0.5294839 1.242409

#Using t-distribution with df degrees of freedom
> df <- pres.lm$df.residual
> pres.lm$coefficients[2] + sqrt(diag(vcov(pres.lm)))[2] %o% c(-qt((1 + 0.95)/2, df), qt((1 + 0.95)/2, df))
            [,1]     [,2]
growth 0.5087671 1.263126

Probably the easiest way to handle this if you are committed to stargazer is to use the ci.custom argument:

> stargazer(pres.lm, type = "text", ci.custom = list(confint(pres.lm)))

===============================================
                        Dependent variable:    
                    ---------------------------
                               vote            
-----------------------------------------------
growth                       0.886***          
                          (0.509, 1.263)       

Constant                     50.848***         
                         (48.749, 52.948)      

-----------------------------------------------
Observations                    24             
R2                             0.519           
Adjusted R2                    0.497           
Residual Std. Error       4.798 (df = 22)      
F Statistic           23.729*** (df = 1; 22)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Once the sample size is sufficiently large, the t-distribution converges on the z-distribution and the differences between the CIs become much smaller.

set.seed(432)

x1 <- rnorm(10000, 100, 50)
u <- 2 * rnorm(10000)
y <- 50 + x1 * 0.752 * u

fit <- lm(y ~ x1)

> confint(fit)
                  2.5 %     97.5 %
(Intercept) 39.29108955 54.1821315
x1          -0.02782141  0.1061173
> stargazer(fit, type= "text", ci = T)

===============================================
                        Dependent variable:    
                    ---------------------------
                                 y             
-----------------------------------------------
x1                             0.039           
                          (-0.028, 0.106)      

Constant                     46.737***         
                         (39.292, 54.181)      

-----------------------------------------------
Observations                  10,000           
R2                            0.0001           
Adjusted R2                   0.00003          
Residual Std. Error     168.194 (df = 9998)    
F Statistic            1.313 (df = 1; 9998)    
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

With a sample size of 24, the t-distribution with 22 degrees of freedom has much fatter tails than the z!