17 April 2013

Reinhart & Rogoff: Everyone makes coding mistakes, we need to make it easy to find them + Graphing uncertainty

You may have already seen a lot written on the replication of Reinhart & Rogoff’s (R & R) much cited 2010 paper done by Herndon, Ash, and Pollin. If you haven’t, here is a round up of some of some of what has been written: Konczal, Yglesias, Krugman, Cowen, Peng, FT Alphaville.

This is an interesting issue for me because it involves three topics I really like: political economy, reproducibility, and communicating uncertainty. Others have already commented on these topics in detail. I just wanted to add to this discussion by (a) talking about how this event highlights a real need for researchers to use systems that make finding and correcting mistakes easy, (b) incentivising mistake finding/correction rather than penalising it, and (c) showing uncertainty.

Systems for Finding and Correcting Mistakes

One of the problems Herndon, Ash, and Pollin found in R&R’s analysis was and Excel coding error. I love to hate on Excel as much as the next R devotee, but I think that is missing the point. The real lesson is not “don’t use Excel” the real lesson is: we all make mistakes.

(Important point: I refer throughout this post to errors caused by coding mistakes rather than purposeful fabrications and falsifications.)

Coding mistakes are an ever present part of our work. The problem is not that we make coding mistakes. Despite our best efforts we always will. The problem is that we often use tools and practices that make it difficult to find and correct our mistakes.

This is where I can get in some Excel hating: tools and practices that make it difficult to find mistakes include binary files (like Excel’s) that can’t be version controlled in a way that fully reveals the research process, not commenting code, not making your data readily available in formats that make replication easy, not having a system for quickly fixing mistakes when they are found. Sorry R users, but the last three are definitely not exclusive to Excel.

It took Herndon, Ash, and Pollin a considerable amount of time to replicate R & R’s findings and therefore find the Excel error. This seems partially because R & R did not make their analysis files readily available (Herndon, Ash, and Pollin had to ask for them). I’m not sure how this error is going to be corrected and documented. But I imagine it will be like most research corrections: kind of on the fly, mostly emailing and reposting.

How big of a detail is this? There is some debate over how big of a problem this mistake is. Roger Peng ends his really nice post:

The vibe on the Internets seems to be that if only this problem had been identified sooner, the world would be a better place. But my cynical mind says, uh, no. You can toss this incident in the very large bucket of papers with some technical errors that are easily fixed. Thankfully, someone found these errors and fixed them, and that’s a good thing. Science moves on.

I agree with most of this paragraph. But, given how important R & R’s finding was to major policy debates it would have been much better if the mistake was caught sooner rather than later. The tools and practices R & R used made it harder to find and correct the mistake, so policymakers were operating with less accurate information for longer.

Solutions: I’ve written in some detail in the most recent issue of The Political Methodologist about how cloud-based version control systems like GitHub can be used to make finding and correcting mistakes easier. Pull requests, for example, are a really nice way to directly suggest corrections.

Incentivising Error Finding and Correction

Going forward I think it will be interesting to see how this incident shapes researchers’ perceived incentives to make their work easily replicable. Replication is an important part of finding the mistakes that everyone makes. If being found to make a coding mistake (not a fabrication) has a negative impact on your academic career then there are incentives to make finding mistakes difficult, by for example making replication difficult. Most papers do not receive nearly as much attention as R & R’s. So, for most researchers making replication difficult will make it pretty unlikely that anyone will replicate your research and you’ll be home free.

This is a perverse incentive indeed.

What can we do? Many journals now require replicable code to accompany published articles. This is a good incentive. Maybe we should go further, and somehow directly incentivise the finding and correction of errors in data sets and analysis code. Ideas could include giving more weight to replication studies at hiring and promotion committees. Maybe even allowing these committees to include information on researchers’ GitHub pull requests that meaningfully improve other’s work by correcting mistakes.

This of course might create future perversion incentives to add errors so that they can then be found. I think this is a bit fanciful. There are surely enough negative social incentives (i.e. embarrassment) surrounding making mistakes to prevent this.

Showing Uncertainty

Roger Peng’s post highlighted the issue of graphing uncertainty, but I just wanted to build it out a little further. The interpretation of the correlation R & R’s found between GDP Growth and Government Debt could have been tempered significantly before any mistakes were found by more directly communicating their original uncertainty. In their original paper, they presented the relationship using bar graphs of average and median GDP growth per grouped debt/GDP level:

Beyond showing the mean and median there is basically no indication of the distribution of the data they are from.

Herndon, Ash, and Pollin put together some nice graphs of these distributions (and avoid that thing economists do of using two vertical axis with two different meanings).

Here is one that gets rid of the groups altogether:

If R & R had shown a simple scatter plot like this (though they did exclude some of the higher GDP Growth country-years at the high debt end, so their's would have looked different), it would have been much more difficult to overly interpret the substantive–policy–value of a correlation between GDP/growth and debt/GDP.

Maybe this wouldn’t have actually changed the policy debate that much, As Mark Blyth argues in his recent book on austerity “facts never disconfirm a good ideology” (p. 18). But at least Paul Krugman might not have had to debate debt/GDP cutoff points on CNBC (for example time point 12:40):


P.S. To R & R’s credit, they do often make their data available. Their data has been useful for at least one of my papers. However, it is often available in a format that is hard to use for cross-country statistical analysis, including, I would imagine, their own. Though I have never found any errors in the data, reporting and implementing corrections to this data would be piecemeal at best.

11 April 2013

Dropbox & R Data

I'm always looking for ways to download data from the internet into R. Though I prefer to host and access plain-text data sets (CSV is my personal favourite) from GitHub (see my short paper on the topic) sometimes it's convenient to get data stored on Dropbox.

There has been a change in the way Dropbox URLs work and I just added some functionality to the repmis R package. So I though that I'ld write a quick post on how to directly download data from Dropbox into R.

The download method is different depending on whether or not your plain-text data is in a Dropbox Public folder or not.

Dropbox Public Folder

Dropbox is trying to do away with its public folders. New users need to actively create a Public folder. Regardless, sometimes you may want to download data from one. It used to be that files in Public folders were accessible through non-secure (http) URLs. It's easy to download these into R, just use the read.table command, where the URL is the file name. Dropbox recently changed Public links to be secure (https) URLs. These cannot be accessed with read.table.

Instead you need can use the source_data command from repmis:

FinURL <-"https://dl.dropbox.com/u/12581470/code/Replicability_code/Fin_Trans_Replication_Journal/Data/public.fin.msm.model.csv"

# Download data
FinRegulatorData <- repmis::source_data(FinURL,
                             sep = ",",
                             header = TRUE)

Non-Public Dropbox Folders

Getting data from a non-Public folder into R was a trickier. When you click on a Dropbox-based file's Share Link button you are taken to a secure URL, but not for the file itself. The Dropbox webpage you're taken to is filled with lots of other Dropbox information. I used to think that accessing a plain-text data file embedded in one of these webpages would require some tricky web scrapping. Luckily, today I ran across this blog post by Kay Cichini.

With some modifications I was able to easily create a function that could download data from non-Public Dropbox folders. The source_DropboxData command is in the most recent version of repmis (v0.2.4) is the result. All you need to know is the name of the file you want to download and its Dropbox key. You can find both of these things in the URL for the webpage that appears when you click on Share Link. Here is an example:

https://www.dropbox.com/s/exh4iobbm2p5p1v/fin_research_note.csv

The file name is at the very end (fin_research_note.csv) and the key is the string of letters and numbers in the middle (exh4iobbm2p5p1v). Now we have all of the information we need for source_DropboxData:

FinDataFull <- repmis::source_DropboxData("fin_research_note.csv",
                                  "exh4iobbm2p5p1v",
                                  sep = ",",
                                  header = TRUE)

15 February 2013

FillIn: a function for filling in missing data in one data frame with info from another

Update (10 March 2013): FillIn is now part of the budding DataCombine package.


Sometimes I want to use R to fill in values that are missing in one data frame with values from another. For example, I have data from the World Bank on government deficits. However, there are some country-years with missing data. I gathered data from Eurostat on deficits and want to use this data to fill in some of the values that are missing from my World Bank data.

Doing this is kind of a pain so I created a function that would do it for me. It's called FillIn.

An Example

Here is an example using some fake data. (This example and part of the function was inspired by a Stack Exchange conversation between JD Long and Josh O'Brien.)

First let's make two data frames: one with missing values in a variable called fNA. And a data frame with a more complete variable called fFull.

# Create data set with missing values
naDF <- data.frame(a = sample(c(1,2), 100, rep=TRUE), 
                   b = sample(c(3,4), 100, rep=TRUE), 
                   fNA = sample(c(100, 200, 300, 400, NA), 100, rep=TRUE))
                   
# Created full data set
fillDF <- data.frame(a = c(1,2,1,2), 
                     b = c(3,3,4,4),
                     fFull = c(100, 200, 300, 400))

Now we just enter some information into FillIn about what the data set names are, what variables we want to fill in, and what variables to join the data sets on.

# Fill in missing f's from naDF with values from fillDF
FilledInData <- FillIn(D1 = naDF, D2 = fillDF, 
                       Var1 = "fNA", Var2 = "fFull", KeyVar = c("a", "b"))

## [1] "16 NAs were replaced."
## [1] "The correlation between fNA and fFull is 0.313"

D1 and Var1 are for the data frame and variables you want to fill in. D2 and Var2 are what you want to use to fill them in with. KeyVar specifies what variables you want to use to joint the two data frames.

FillIn lets you know how many missing values it is filling in and what the correlation coefficient is between the two variables you are using. Depending on your missing data issues, this could be an indicator of whether or not Var2 is an appropriate substitute for Var1.

Installation

FillIn is currently available as a GitHub Gist and can be installed with this code:

devtools::source_gist("4959237")

You will need the devtools package to install it. For it to work properly you will also need the data.table package.

The Full Code

3 February 2013

InstallOldPackages: a repmis command for installing old R package versions

A big problem in reproducible research is that software changes. The code you used to do a piece of research may depend on a specific version of software that has since been changed. This is an annoying problem in R because install.packages only installs the most recent version of a package. It can be tedious to collect the old versions.

On Toby Dylan Hocking's suggestion, I added tools to the repmis package so that you can install, load, and cite specific R package versions. It should work for any package version that is stored on the CRAN archive (http://cran.r-project.org).

To only install old package versions use the new repmis command InstallOldPackages. For example:

# Install old versions of the e1071 and gtools packages.

# Create vectors of the package names and versions to install
# Note the names and version numbers must be in the same order
Names <- c("e1071", "gtools")
Vers <- c("1.6", "2.6.1")

# Install old package versions into the default library
InstallOldPackages(pkgs = Names, versions = Vers)

You can also now have LoadandCite install specific package versions:

# Install, load, and cite specific package versions

# Create vectors of the package names and versions to install
# Note the names and version numbers must be in the same order
Names <- c("e1071", "gtools") 
Vers <- c("1.6", "2.6.1")

# Run LoadandCite
LoadandCite(pkgs = Names, versions = Vers, install = TRUE, file = "PackageCites.bib")

See this post for more details on LoadandCite.

Future

I intend to continue improving these capabilities. So please post any suggestions for improvement (or report any bugs) at on the GitHub issues page.

31 January 2013

repmis: misc. tools for reproducible research in R

I've started to put together an R package called repmis. It has miscellaneous tools for reproducible research with R. The idea behind the package is to collate commands that simplify some of the common R code used within knitr-type reproducible research papers.

It's still very much in the early stages of development and has two commands:

  • LoadandCite: a command to load all of the R packages used in a paper and create a BibTeX file containing citation information for them. It can also install the packages if they are on CRAN.
  • source_GitHubData: a command for downloading plain-text formatted data stored on GitHub or at any other secure (https) URL.

I've written about why you might want to use source_GitHubData before (see here and here).

You can use LoadandCite in a code chunk near the beginning of a knitr reproducible research document to load all of the R packages you will use in the document and automatically generate a BibTeX file you can draw on to cite them. Here's an example:

# Create vector of package names
PackagesUsed <- c("knitr", "xtable")

# Load and Cite
repmis::LoadandCite(PackagesUsed, file = "PackageCitations.bib") 

LoadandCite draws on knitr's write_bib command to create the bibliographies, so each citation is given a BibTeX key like this: R-package_name. For example the key for the xtable package is R-xtable. Be careful to save the citations in a new .bib file, because LoadandCite overwrites existing files.

Citation of R packages is very inconsistent in academic publications. Hopefully by making it easier to cite packages more people will do so.

Install/Constribute

Instructions for how to install repmis are available here.

Please feel free to fork the package and suggest additional commands that could be included.

6 January 2013

source_GitHubData: a simple function for downloading data from GitHub into R

Update 31 January: I've folded source_GitHubData into the repmis packaged. See this post.


Update 7 January 2012: I updated the internal workings of source_GitHubData so that it now relies on httr rather than RCurl. Also it is more directly descended from devtool's source_url command.

This has two advantages.

  • Shortened URL's can be used instead of the data sets' full GitHub URL,
  • The ssl.verifypeer issue is resolved. (Though please let me know if you have problems).

The post has been rewritten to reflect these changes.


In previous posts I've discussed how to download data stored in plain-text data files (e.g. CSV, TSV) on GitHub directly into R.

Not sure why it took me so long to get around to this, but I've finally created a little function that simplifies the process of downloading plain-text data from GitHub. It's called source_GitHubData. (The name mimicks the devtools syntax for functions like source_gist and source_url. The function's syntax is actually just a modified version of source_url.)

The function is stored in a GitHub Gist HERE (it's also at the end of this post). You can load it directly into R with devtools' source_gist command.

Here is an example of how to use the function to download the electoral disproportionality data I discussed in an earlier post.


# Load source_GitHubData
library(devtools)

# The functions' gist ID is 4466237
source_gist("4466237")

# Create Disproportionality data UrlAddress object
# Make sure the URL is for the "raw" version of the file
# The URL was shortened using bitly
UrlAddress <- "http://bit.ly/Ss6zDO"

# Download data
Data <- source_GitHubData(url = UrlAddress)

# Show Data variable names
names(Data)

## [1] "country"            "year"               "disproportionality"

There you go.

Note that the the function is set by default to load comma-separated data (CSV). This can easily be changed with the sep argument.

30 December 2012

Update to Graphing Non-Proportional Hazards in R


Update 1 February 2013: I've moved all of the functionality described in this post into an R package called simtvc. Have a look. It is much easier to use.


This is a quick update for a previous post on Graphing Non-Proportional Hazards in R.

In the previous post I showed how to simulate and graph 1,000 non-proportional hazard ratios at roughly every point in time across an observation period. In the previous example I kept in simulation outliers. Some people have suggested dropping the top and bottom 2.5 percent of simulated values (i.e. keeping the middle 95 percent).

Luckily this can be accomplished with Hadley Wickham's plyr package and three lines of code. The trick is to use plyr's ddply command to subset the data frame at each point in Time where we simulated values. In the previous example the simulated values were in a variable called HRqmv. In each subset we use the quantile command from base R to create logical variables indicating if a simulation of HRqmv is greater than the 0.975 or less than the 0.025 quantile. Then we simply subset the data frame.

Here's how you do it: after all of the values have been simulated and the data set ordered and right before graphing the results add this code:


# Indicate bottom 2.5% of observations
TVSimPerc <- ddply(TVSim, .(Time), transform, Lower = HRqmv < quantile(HRqmv, c(0.025)))

# Indicate top 2.5% of observations
TVSimPerc <- ddply(TVSimPerc, .(Time), transform, Upper = HRqmv > quantile(HRqmv, c(0.975)))

# Drop simulations outside of the middle 95%
TVSimPerc <- subset(TVSimPerc, Lower == FALSE & Upper == FALSE)

Adding this code to the previous example creates this graph:

Here is the full code to replicate the example: