Cleaning Data with Inconsistent Naming Conventions

Inconsistent naming conventions is a potential problem when dealing with social data that you did not collect. I thought I would post a quick solution in R for those who may be intimidated by coping with this type of problem.

Context: I have a dataset with individuals’ full names in a column named “FullName” where last names are listed first, followed by first names and middle initials. I need to create a column of last names and a column of first names using the columns of full names.

Problem: Some of the entries had individuals’ last and first names separated by a comma (Smith, John A), whereas others were separated by a space (Smith John A).

Solution:

I start by creating new columns, one for the individuals’ last names, and one for the individuals’ first names.

df$LastName <- NA
df$FirstName <- NA

I then separate the names that are divided by a comma using the sapply and strsplit functions. sapply applies a function to a list or a vector. The function I am applying is a string splitting function called ‘strsplit’. This allows the user to split a string entry like “Smith, John A” into “Smith” and “John A”. The ” ‘[‘, 1 ” retrieves the first element of the split, “Smith” in the example. Likewise ‘[‘, 2 retrieves the second element, or “John A” in the example.

df$LastName <- sapply(strsplit(df$FullName, ‘,’), `[`, 1)
df$FirstName <- sapply(strsplit(df$FullName, ‘,’), `[`, 2)

I then do something similar for the names that are divided by a space. The names that were separated by a space would have had a missing first name after the previous code, so I used that to my advantage. I used the ifelse function to identify if the first name was missing (thus the name wasn’t separated by a column). Then I applied the same strsplit function but instead of splitting at the comma, I separated at the first space. The final piece of the ifelse statement indicates what to plug into the variable if the first name wasn’t missing. In more simpler terms, the ifelse function is organized into three parts: ifelse(‘1.conditions to look for’, ‘2. function to apply’, ‘3. what to do in case of else’)

df$LastName <- ifelse(is.na(df$FirstName), sapply(strsplit(df$FullName, ‘ ‘), `[`, 1), df$LastName)
df$FirstName <- ifelse(is.na(df$FirstName), sapply(strsplit(df$FullName, ‘ ‘), `[`, 2), df$FirstName)

N.B., I realize that I’m running the risk of splitting last names that are separated by a space, but I made the choice to accept that risk because the inconsistent naming convention issue was so common. Not dealing with this issue would create more difficulty in predicting race/ethnicity than the splitting of last names that are not-hyphenated but include two surnames (e.g., Hispanic naming conventions: “Dominguez Jiminez Jose A”, etc).

Alternative: If you are unconcerned with maintaining as much of the integrity of the last name as possible and just want whatever comes first in the last name, you could skip all of the above and swap out the comma for a space with gsub and run this:

df$LastName <- sapply(strsplit(gsub(‘,’, ‘ ‘, df$FullName), ‘ ‘), `[`, 1)

Share this:

Related

Leave a comment Cancel reply