Organizing and cleaning data: missing values, part 1

<< Previous post: Organizing & cleaning data: variables

Missing values in R are indicated with NA, and in contrast to programs like Stata or SAS, NA is not smaller or larger than any other values. It is just not there.

To see how many values are missing for every variable in your dataset, use the following loop (thank you Zach):


for (Var in names(data.frame)) {
    missing <- sum(is.na(data.frame[,Var])) 
    if (missing > 0) {
        print(c(Var,missing))
    }
} 

For example:

for (Var in names(data2)) {
    missing <- sum(is.na(data2[,Var]))
    if (missing > 0) {
        print(c(Var,missing))
    }
}

Output:

"weight" "432" 
"gender" "2"

 

However, this would not count any incorrectly labeled missing values. Before continuing with your data analysis, you have to make sure NA is represented as a true missing value, and not for example as “NA”, since R would see this as a string/character. One crude way to find out is scroll through your data in the Source pane: the difference is in how NA is written. If it is written in grey italics (NA), it is actually missing, and R will treat it as missing. If the font is black and not italic (NA), it is a string/character. Likewise, incorrectly labeled missing values can be indicated with other strings or numbers such as “missing”, “<NA>”, “9999” (character), 9999 (numeric) etc. R will treat this as any other value in that variable. Another method is to create tables with frequencies for your variables of interest, whereby the incorrectly labeled missing values will show up as a category (see Exploring your data), while real missing values show up in the last category that is listed, called <NA>.

Convert any incorrect NA to a real NA using:

data.frame$variable[data.frame$variable == "incorrect missing indicator"] <- NA

In which you can substitute “incorrect missing indicator” with any other incorrect value that you have found in your data. Keep in mind that if it’s a character to use the quotation marks (“NA”), and if it’s a number (9999), to omit them.

For example:

data2$weight[data2$weight == "NA"] <- NA

Whether to include missing values in your analysis, will be discussed in a later post.

Next: Exploring data with descriptive statistics and frequency tables

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *