<< Previous post: Merging data files with different variables
When is your dataset considered dirty or clean? There are a few issues that can interfere with smoothly running your data analysis: the organization and naming of variables; outliers and errors; and missing data. In each of these issues context is important. We’ll go over the naming of variables first to get you started quickly, and discuss the other two issues at a later stage when they are prone to arise.
First, we’ll take a quick glance at the name of the columns/variables in a dirty dataset. The variable names could have been written with periods (.) to separate words (such as “place.of.birth“). This can interfere with certain commands in R, and therefore it is more common to use an underscore (place_of_birth). Furthermore, since upper case letters in variable names are interpreted differently than lower case, it is easier to write everything in lower case, rather than remembering which exact letters of a variable are capitalized. If you have just a few variables, you can look through them in the Source pane. Alternatively, you can print the variable names in the Console:
 "id.number" "district1" "district2" "district3"
If you want a bit more information, especially on the data type of the values in a variable (character, logical (TRUE/FALSE), numeric, integer etc.), you can use the following:
id.number district1 district2 district3 "character" "character" "character" "character"
More on data types and structures in R can be found here
Specific column names can be changed using the following command and example:
names(data.frame)[names(data.frame) == "previous variable name"] <- "new variable name"
names(data1)[names(data1) == "id.number"] <- "idnumber"
If you have a very large dataset and too many variables are in a wrong format, you can write a loop in R that repeats a certain command. We’ll go over that soon, but for now, you can change each problematic variable name separately right before analyzing that variable.
Second, sometimes the values of what should be one variable can be spread out over multiple variables, and this could interfere with proper data analysis as well. For example if there are 20 different variables with different numbered districts a person could live in (with a subdistrict division of “north” or “west” as value) it would be more efficient to collapse these into one variable, named district. . If a question results in one exclusive answer, it is most likely best represented by one variable. As with other data cleaning issues, context is important, so decide what structure works best for your dataset.
To gather several variables into one:
install.packages("tidyr") library(tidyr) data.frame <- gather(data.frame, "new variable 1", "new variable 2", old variable 1:old variable 3, na.rm = TRUE)
- library(tidyr) – first load the Tidyr package
- new variable 1 – the name of the new variable that will contain the old variable names (variable 1 through variable 3) as values
- new variable 2 – the name of the new variable that will take the values that were contained in variable 1 through variable 3
- old variable 1: old variable 3 – the variables that need to be gathered, numbered appropriately
- na.rm = TRUE – will remove rows that have NA as value
Or for the above table:
install.packages("tidyr") library(tidyr) data1 <- gather(data1, "district", "subdistrict", district1:district3, na.rm = TRUE)
If you’d leave out “na.rm = TRUE”, you will end up with this dataset instead:
data1 <- gather(data1, "district", "subdistrict", district1:district3)
Make sure to double check the class and number of rows of the new variables that have been formed (numeric, integer, character, etc.) with:
str(new data.frame$new variable 1)
It is not a bad habit to also double check the class of existing variables before analyzing them.
The gather() action can be reversed with spread(). For detailed information on gather(), spread() and more, check out the Tidyr documentation.