Cleaning Data with Inconsistent Naming Conventions

Inconsistent naming conventions is a potential problem when dealing with social data that you did not collect. I thought I would post a quick solution in R for those who may be intimidated by coping with this type of problem.

Context: I have a dataset with individuals’ full names in a column named “FullName” where last names are listed first, followed by first names and middle initials. I need to create a column of last names and a column of first names using the columns of full names.

Problem: Some of the entries had individuals’ last and first names separated by a comma (Smith, John A), whereas others were separated by a space (Smith John A).

Solution:

I start by creating new columns, one for the individuals’ last names, and one for the individuals’ first names.

df$LastName <- NA
df$FirstName <- NA

I then separate the names that are divided by a comma using the sapply and strsplit functions. sapply applies a function to a list or a vector. The function I am applying is a string splitting function called ‘strsplit’. This allows the user to split a string entry like “Smith, John A” into “Smith” and “John A”. The  ”  ‘[‘, 1 ” retrieves the first element of the split,  “Smith” in the example. Likewise ‘[‘, 2 retrieves the second element, or “John A” in the example.

df$LastName <- sapply(strsplit(df$FullName, ‘,’), `[`, 1)
df$FirstName <- sapply(strsplit(df$FullName, ‘,’), `[`, 2)

I then do something similar for the names that are divided by a space. The names that were separated by a space would have had a missing first name after the previous code, so I used that to my advantage. I used the ifelse function to identify if the first name was missing (thus the name wasn’t separated by a column). Then I applied the same strsplit function but instead of splitting at the comma, I separated at the first space. The final piece of the ifelse statement indicates what to plug into the variable if the first name wasn’t missing. In more simpler terms, the ifelse function is organized into three parts:  ifelse(‘1.conditions to look for’, ‘2. function to apply’, ‘3. what to do in case of else’)

df$LastName <- ifelse(is.na(df$FirstName), sapply(strsplit(df$FullName, ‘ ‘), `[`, 1), df$LastName)
df$FirstName <- ifelse(is.na(df$FirstName), sapply(strsplit(df$FullName, ‘ ‘), `[`, 2), df$FirstName)

N.B., I realize that I’m running the risk of splitting last names that are separated by a space, but I made the choice to accept that risk because the inconsistent naming convention issue was so common. Not dealing with this issue would create more difficulty in predicting race/ethnicity than the splitting of last names that are not-hyphenated but include two surnames (e.g., Hispanic naming conventions: “Dominguez Jiminez Jose A”, etc).

 

Alternative: If you are unconcerned with maintaining as much of the integrity of the last name as possible and just want whatever comes first in the last name, you could skip all of the above and swap out the comma for a space  with gsub and run this:

df$LastName <- sapply(strsplit(gsub(‘,’, ‘ ‘, df$FullName), ‘ ‘), `[`, 1)

How to Load and Append Multiple Files in R

Someone told me that if you have to write the same line of code more than a few times you probably should be using a function to do it. However, I haven’t always taken this to heart. When I started working with R, I would load a number of files in by hand. If I had 10 files, I would have 10 objects for each file, which I would then append together. This habit became impractical when I started cleaning my dissertation data. One of my chapters requires appending of 185 csv files. Loading in 185 csv files individually and then combining would be a nightmare.

In this post,  I will walk you through a recent example where I loaded and appended multiple files programmatically rather than individually in R.

Step 1: Name the files as consistently as possible. This allows for pattern matching.

  • In this example,  I have a bunch of files that I have downloaded from ProPublica’s Congress API. The files are named “propub103.csv”, “propub104.csv”, etc.
  • This will allow me to use regular expression matching to avoid typing the names of all of the files.

 Step 2: Have all the files in the same folder and set the working directory to that folder.

  • You need to let R know where to look!
  • Setting working directory in R:  setwd(“filepath“)

Step 3: Create a list of the file names using the list.files function and a regular expression.

files <- list.files(pattern = “propub10[3-9].csv”)

  • As the name aptly describes, the list.files function creates a list of the names of the files in a particular folder. The pattern argument allows you to use a regular expression, i.e., a type of string that describes a search pattern. [list.files documentation]
  • The [3-9] indicates to the computer to look for propub103.csv, propub104.csv, propub105.csv, propub106.csv, propub107.csv, propub108.csv, propub109.csv
  • The regular expression that you need varies depending on your naming conventions and needs. [ regex documentation]
  • If you are seeking to create a list of all the files in a folder, the easiest thing to do is this: list.files(pattern=”*.csv”)

Step 4: Combine the files using the bind_rows function from the dplyr library and the lapply and fread functions

combined_files <- bind_rows(lapply(files, fread))

  • Here, I’m using the bind_rows function from the tidyverse libraries.  It combines a list of data frames together (the same thing as the do.call(rbind, dfs) function). [dplyr::bind documentation]
    • FYI: I use dplyr here only because I’m in the habit of using it over data.table. The rbindlist function of the data.table library serves the same purpose.  [data.table documentation]
    • If you have differing number of columns in your dataframes, bind_rows by default keeps the extra column(s) and fills the missing information as NA. If you use the rbindlist function in data.table, I believe you need to specify your preference using the fill argument.
  • However, in this case, I don’t have a list of dataframes, I have a list of file names. This brings us to the lapply function, which allows the user to apply a function to each item in a list. The function takes a list and a function as the primary arguments. [lapply documentation]
  • The function I am passing to lapply here is fread (found in data.table package), which reads in regular delimited files. You can also use read.csv or read.table, but fread has worked better for me in these cases. [fread documentation]

The code ends up looking like this:

library(dplyr)

library(data.table)

setwd(“/your/path/here/input”)

files <- list.files(pattern = “propub10[3-9].csv”)

combined_files <- bind_rows(lapply(files, fread))