Cleaning Data with Inconsistent Naming Conventions

Inconsistent naming conventions is a potential problem when dealing with social data that you did not collect. I thought I would post a quick solution in R for those who may be intimidated by coping with this type of problem.

Context: I have a dataset with individuals’ full names in a column named “FullName” where last names are listed first, followed by first names and middle initials. I need to create a column of last names and a column of first names using the columns of full names.

Problem: Some of the entries had individuals’ last and first names separated by a comma (Smith, John A), whereas others were separated by a space (Smith John A).

Solution:

I start by creating new columns, one for the individuals’ last names, and one for the individuals’ first names.

df$LastName <- NA
df$FirstName <- NA

I then separate the names that are divided by a comma using the sapply and strsplit functions. sapply applies a function to a list or a vector. The function I am applying is a string splitting function called ‘strsplit’. This allows the user to split a string entry like “Smith, John A” into “Smith” and “John A”. The  ”  ‘[‘, 1 ” retrieves the first element of the split,  “Smith” in the example. Likewise ‘[‘, 2 retrieves the second element, or “John A” in the example.

df$LastName <- sapply(strsplit(df$FullName, ‘,’), `[`, 1)
df$FirstName <- sapply(strsplit(df$FullName, ‘,’), `[`, 2)

I then do something similar for the names that are divided by a space. The names that were separated by a space would have had a missing first name after the previous code, so I used that to my advantage. I used the ifelse function to identify if the first name was missing (thus the name wasn’t separated by a column). Then I applied the same strsplit function but instead of splitting at the comma, I separated at the first space. The final piece of the ifelse statement indicates what to plug into the variable if the first name wasn’t missing. In more simpler terms, the ifelse function is organized into three parts:  ifelse(‘1.conditions to look for’, ‘2. function to apply’, ‘3. what to do in case of else’)

df$LastName <- ifelse(is.na(df$FirstName), sapply(strsplit(df$FullName, ‘ ‘), `[`, 1), df$LastName)
df$FirstName <- ifelse(is.na(df$FirstName), sapply(strsplit(df$FullName, ‘ ‘), `[`, 2), df$FirstName)

N.B., I realize that I’m running the risk of splitting last names that are separated by a space, but I made the choice to accept that risk because the inconsistent naming convention issue was so common. Not dealing with this issue would create more difficulty in predicting race/ethnicity than the splitting of last names that are not-hyphenated but include two surnames (e.g., Hispanic naming conventions: “Dominguez Jiminez Jose A”, etc).

 

Alternative: If you are unconcerned with maintaining as much of the integrity of the last name as possible and just want whatever comes first in the last name, you could skip all of the above and swap out the comma for a space  with gsub and run this:

df$LastName <- sapply(strsplit(gsub(‘,’, ‘ ‘, df$FullName), ‘ ‘), `[`, 1)

How to Load and Append Multiple Files in R

Someone told me that if you have to write the same line of code more than a few times you probably should be using a function to do it. However, I haven’t always taken this to heart. When I started working with R, I would load a number of files in by hand. If I had 10 files, I would have 10 objects for each file, which I would then append together. This habit became impractical when I started cleaning my dissertation data. One of my chapters requires appending of 185 csv files. Loading in 185 csv files individually and then combining would be a nightmare.

In this post,  I will walk you through a recent example where I loaded and appended multiple files programmatically rather than individually in R.

Step 1: Name the files as consistently as possible. This allows for pattern matching.

  • In this example,  I have a bunch of files that I have downloaded from ProPublica’s Congress API. The files are named “propub103.csv”, “propub104.csv”, etc.
  • This will allow me to use regular expression matching to avoid typing the names of all of the files.

 Step 2: Have all the files in the same folder and set the working directory to that folder.

  • You need to let R know where to look!
  • Setting working directory in R:  setwd(“filepath“)

Step 3: Create a list of the file names using the list.files function and a regular expression.

files <- list.files(pattern = “propub10[3-9].csv”)

  • As the name aptly describes, the list.files function creates a list of the names of the files in a particular folder. The pattern argument allows you to use a regular expression, i.e., a type of string that describes a search pattern. [list.files documentation]
  • The [3-9] indicates to the computer to look for propub103.csv, propub104.csv, propub105.csv, propub106.csv, propub107.csv, propub108.csv, propub109.csv
  • The regular expression that you need varies depending on your naming conventions and needs. [ regex documentation]
  • If you are seeking to create a list of all the files in a folder, the easiest thing to do is this: list.files(pattern=”*.csv”)

Step 4: Combine the files using the bind_rows function from the dplyr library and the lapply and fread functions

combined_files <- bind_rows(lapply(files, fread))

  • Here, I’m using the bind_rows function from the tidyverse libraries.  It combines a list of data frames together (the same thing as the do.call(rbind, dfs) function). [dplyr::bind documentation]
    • FYI: I use dplyr here only because I’m in the habit of using it over data.table. The rbindlist function of the data.table library serves the same purpose.  [data.table documentation]
    • If you have differing number of columns in your dataframes, bind_rows by default keeps the extra column(s) and fills the missing information as NA. If you use the rbindlist function in data.table, I believe you need to specify your preference using the fill argument.
  • However, in this case, I don’t have a list of dataframes, I have a list of file names. This brings us to the lapply function, which allows the user to apply a function to each item in a list. The function takes a list and a function as the primary arguments. [lapply documentation]
  • The function I am passing to lapply here is fread, which reads in regular delimited files. You can also use read.csv or read.table, but fread has worked better for me in these cases. [fread documentation]

The code ends up looking like this:

library(dplyr)

setwd(“/your/path/here/input”)

files <- list.files(pattern = “propub10[3-9].csv”)

combined_files <- bind_rows(lapply(files, fread))

 

5 Tips for Reproducible Code for Research

Reproducible code is about transparency of research. Transparency not only enables others to understand and replicate your research, but it helps you to understand your own work. Working on large projects (such as a dissertation) requires many scripts that merge, clean, visualize, and analyze data. Organization, version control, README files, and other tools will help you remember what files and scripts you used, what plots you included, etc. without requiring a forensic analysis of your folders. So here are my current tips and tricks to making this possible:

N.B., Some of my examples below are in R; however, I am confident that they can be replicated in Python, STATA, or your program of choice.

Tip #1: Organize files in folders that make sense to you and others. For inspiration, here’s a peek into my organizing system:

For me, it makes sense to organize folders into stages of the research process.

Screen Shot 2017-10-11 at 11.46.33 PM

  1. In the above example, the “construct” folder includes all the necessary scripts and files to construct the dataset(s) to be used in the analysis.
  2. Within the “code” folder, I have files organized by the stage of construction. For instance, I create two data sets that differ on the level of data aggregation, and each level gets its own folder. The numbered labeling structure indicates the order in the construction process.
    • Screen Shot 2017-10-11 at 11.54.32 PM
  3. I have a “tmp” folder to organize the intermediary (or temporary) files needed to construct the dataset.
  4. In the “analysis” folder, I have all the files that I need for the analysis and drafting. Using RMarkdown makes this a little easier, but I do have an additional file for code that hasn’t made it into the draft. I save a PDF of every version of the manuscript in the “drafts” folder.
    • Screen Shot 2017-10-12 at 12.02.25 AM
  5. The trick is to try to balance the length of file paths and the intelligibility of organizing, which can at times be at odds with each other.

 

Tip #2: Include README files for your folders. README.txt files will help anyone else accessing your files (or even your future self) understand your organization system. What I include in this file depends on the folder it is describing. I include a description of folder contents, any notes about labeling that might be relevant, which data/script files are the most important, etc. I have found that this has been even helpful for me when I’m trying to make sense of what I did a year ago. (I include a “0-” in the file path because I like having these README files at the top of the folder. )

Tip #3: Use Master Scripts that link to subscripts. With larger projects, one long master script can be unwieldy and make changing the file annoying (especially with a lot of merges, cleaning, and data manipulation).  Here is what I suggest:

  1. Break down your scripts into specific tasks.
    • Have separate scripts to prep, merge, and clean, and manipulate your data.
    • Note the purpose of the file and any relevant information at the top of the file. You may also consider noting the date of the last modification and the name of the last person to modify (if working with others).
    • Screen Shot 2017-10-12 at 12.30.37 AM.png
  2. Write a Master Script to allow you to run all of your subscripts in one shot when compiling your data.
    • In R, all this requires is using the source function (documentation)
    • Screen Shot 2017-10-12 at 12.42.41 AM.png

Tip #4: Set a working directory, especially when working with others. There is nothing more frustrating than having to change every file path in a script because someone didn’t set a working directory.  For those readers who have never done this, this allows you to truncate all of the file paths that follow. This means that other users only need to change one line of code and the rest of the code will work without a hitch. It also makes the script more readable.

  • In R:
    • Screen Shot 2017-10-12 at 12.50.18 AM
  • In Python:
    • Screen Shot 2017-10-12 at 12.48.35 AM
  • In STATA:
    • Screen Shot 2017-10-12 at 12.53.05 AM

Tip #5: Version Control. I have a few suggestions here, which to some degree are redundancies.

  1. Back up regularly, and back up in a way that saves snapshots of your work in time. Don’t completely overwrite your files, just in case you need to revert back to a previous version of that file. I back up in two ways: a) quarterly on an external hard drive, and b) all the time using Backblaze, a cloud-based backup service for $5 a month (website). When connected to the internet, Backblaze constantly syncs your files to the cloud, creating timestamped versions of every file as the file changes. Whatever you use to back-up, creating time-stamped snapshots is crucial.
  2. Use GitHub. GitHub not only allows you to see the changes in your code but also has functionalities to reconcile those differences. A tutorial to get started with GitHub is forthcoming!

Bonus Tip: Spring Cleaning. Every once and a while prune and clean out your folders to make sure the folders are not filled with old files that don’t reflect the present state of the project. Trying to find the right file in a sea of out-dated files can be time-consuming. In cases where I think I need to actually delete the file, I find that moving old files to a folder labeled “delete”  has helped this process. I can move the file to the “delete” folder for a period of time until I feel confident that there will be no negative consequences to my housekeeping.  Also, regular backups will prevent you from every completely getting rid of files, while not cluttering up your workflow.