Someone told me that if you have to write the same line of code more than a few times you probably should be using a function to do it. However, I haven’t always taken this to heart. When I started working with R, I would load a number of files in by hand. If I had 10 files, I would have 10 objects for each file, which I would then append together. This habit became impractical when I started cleaning my dissertation data. One of my chapters requires appending of 185 csv files. Loading in 185 csv files individually and then combining would be a nightmare.

In this post,  I will walk you through a recent example where I loaded and appended multiple files programmatically rather than individually in R.

Step 1: Name the files as consistently as possible. This allows for pattern matching.

  • In this example,  I have a bunch of files that I have downloaded from ProPublica’s Congress API. The files are named “propub103.csv”, “propub104.csv”, etc.
  • This will allow me to use regular expression matching to avoid typing the names of all of the files.

 Step 2: Have all the files in the same folder and set the working directory to that folder.

  • You need to let R know where to look!
  • Setting working directory in R:  setwd(“filepath“)

Step 3: Create a list of the file names using the list.files function and a regular expression.

files <- list.files(pattern = “propub10[3-9].csv”)

  • As the name aptly describes, the list.files function creates a list of the names of the files in a particular folder. The pattern argument allows you to use a regular expression, i.e., a type of string that describes a search pattern. [list.files documentation]
  • The [3-9] indicates to the computer to look for propub103.csv, propub104.csv, propub105.csv, propub106.csv, propub107.csv, propub108.csv, propub109.csv
  • The regular expression that you need varies depending on your naming conventions and needs. [ regex documentation]
  • If you are seeking to create a list of all the files in a folder, the easiest thing to do is this: list.files(pattern=”*.csv”)

Step 4: Combine the files using the bind_rows function from the dplyr library and the lapply and fread functions

combined_files <- bind_rows(lapply(files, fread))

  • Here, I’m using the bind_rows function from the tidyverse libraries.  It combines a list of data frames together (the same thing as the do.call(rbind, dfs) function). [dplyr::bind documentation]
    • FYI: I use dplyr here only because I’m in the habit of using it over data.table. The rbindlist function of the data.table library serves the same purpose.  [data.table documentation]
    • If you have differing number of columns in your dataframes, bind_rows by default keeps the extra column(s) and fills the missing information as NA. If you use the rbindlist function in data.table, I believe you need to specify your preference using the fill argument.
  • However, in this case, I don’t have a list of dataframes, I have a list of file names. This brings us to the lapply function, which allows the user to apply a function to each item in a list. The function takes a list and a function as the primary arguments. [lapply documentation]
  • The function I am passing to lapply here is fread (found in data.table package), which reads in regular delimited files. You can also use read.csv or read.table, but fread has worked better for me in these cases. [fread documentation]

The code ends up looking like this:

library(dplyr)

library(data.table)

setwd(“/your/path/here/input”)

files <- list.files(pattern = “propub10[3-9].csv”)

combined_files <- bind_rows(lapply(files, fread))

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s