Someone told me that if you have to write the same line of code more than a few times you probably should be using a function to do it. However, I haven’t always taken this to heart. When I started working with R, I would load a number of files in by hand. If I had 10 files, I would have 10 objects for each file, which I would then append together. This habit became impractical when I started cleaning my dissertation data. One of my chapters requires appending of 185 csv files. Loading in 185 csv files individually and then combining would be a nightmare.

In this post,  I will walk you through a recent example where I loaded and appended multiple files programmatically rather than individually in R.

Step 1: Name the files as consistently as possible. This allows for pattern matching.

  • In this example,  I have a bunch of files that I have downloaded from ProPublica’s Congress API. The files are named “propub103.csv”, “propub104.csv”, etc.
  • This will allow me to use regular expression matching to avoid typing the names of all of the files.

 Step 2: Have all the files in the same folder and set the working directory to that folder.

  • You need to let R know where to look!
  • Setting working directory in R:  setwd(“filepath“)

Step 3: Create a list of the file names using the list.files function and a regular expression.

files <- list.files(pattern = “propub10[3-9].csv”)

  • As the name aptly describes, the list.files function creates a list of the names of the files in a particular folder. The pattern argument allows you to use a regular expression, i.e., a type of string that describes a search pattern. [list.files documentation]
  • The [3-9] indicates to the computer to look for propub103.csv, propub104.csv, propub105.csv, propub106.csv, propub107.csv, propub108.csv, propub109.csv
  • The regular expression that you need varies depending on your naming conventions and needs. [ regex documentation]
  • If you are seeking to create a list of all the files in a folder, the easiest thing to do is this: list.files(pattern=”*.csv”)

Step 4: Combine the files using the bind_rows function from the dplyr library and the lapply and fread functions

combined_files <- bind_rows(lapply(files, fread))

  • Here, I’m using the bind_rows function from the tidyverse libraries.  It combines a list of data frames together (the same thing as the do.call(rbind, dfs) function). [dplyr::bind documentation]
    • FYI: I use dplyr here only because I’m in the habit of using it over data.table. The rbindlist function of the data.table library serves the same purpose.  [data.table documentation]
    • If you have differing number of columns in your dataframes, bind_rows by default keeps the extra column(s) and fills the missing information as NA. If you use the rbindlist function in data.table, I believe you need to specify your preference using the fill argument.
  • However, in this case, I don’t have a list of dataframes, I have a list of file names. This brings us to the lapply function, which allows the user to apply a function to each item in a list. The function takes a list and a function as the primary arguments. [lapply documentation]
  • The function I am passing to lapply here is fread, which reads in regular delimited files. You can also use read.csv or read.table, but fread has worked better for me in these cases. [fread documentation]

The code ends up looking like this:

library(dplyr)

setwd(“/your/path/here/input”)

files <- list.files(pattern = “propub10[3-9].csv”)

combined_files <- bind_rows(lapply(files, fread))

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s