<< Previous: Organizing and cleaning data: missing values, part 1

Once your variables are in the right shape, you can start examining your data. I like to start out with running descriptive statistics and frequency analyses for the variables I’m interested in, and see if there are any missing values and values out of the expected range. If you created your own data (e.g. from an experiment), you’re most likely very familiar with the specific variables and values. But if you’re dealing with survey data, missing variables show up for several reasons – respondents skipped the questions by accident or on purpose, or perhaps they weren’t asked a specific question due to previous questions or demographics (‘the legitimate skip’). Make sure to check out the documentation for surveys first (both the codebook and the actual questionnaire) to find out the details of your variables of interest.

The first quick command for numeric or logical values you can use is summary( )

summary(data.frame$variable)

For example:

summary(data2$weight)

Output:

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 52 110 123 130 144 353 432

Here you can see right away that you have 432 missing values. These empty values are not included when calculating the mean, i.e. removing these cases would not create a different summary output.

You can also use the same command to generate summaries for your whole data set, but this only makes sense if you have a limited number of variables:

summary(data2)

If you want to just generate the mean of a variable, use the following line, and be sure to include the last part (na.rm = TRUE). If you don’t specify this, the result would be NA if any of the values are missing.

mean(data.frame$variable,na.rm = TRUE)

For example:

mean(data2$weight, na.rm = TRUE)

Output:

129.9744

Notice that this is a slightly different value as we obtained in the summary( ) command, as it has more significant digits.

A little bit more detail than summary( ) can be obtained with the describe function in the “psych” package (install first):

install.packages("psych") library(psych) describe(data.frame$variable)

For example:

describe(data2$weight) vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 2964 129.97 31.4 123 126.12 25.2 52 353 301 1.51 3.82 0.58

And this also has a ‘by’ function, where you can obtain statistics for subgroups:

describeBy(data.frame$variable1,data.frame$variable2)

Where variable 1 is the variable to describe, and variable 2 is the variable that determines the subgroups.

For example, if I want to obtain statistics on body weight for boys and girls separately:

describeBy(data2$weight, data2$gender) $`0` vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1565 121.69 25.58 117 118.68 19.27 52 324 272 1.72 6.14 0.65 $`1` vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1399 139.24 34.58 130 135.64 29.65 69 353 284 1.25 2.52 0.92

For character and logical values, you can use a table to see the frequencies for different entries. By default, the NA’s are ignored, unless you specify that you want to include them with the following code:

table(data.frame$variable,useNA = "always")

For example:

table(data2$gender, useNA = "always") 0 1 <NA> 1813 1581 2

If you use the word “ifany” instead of “always” for missing values, you will only see the number of missing values in the table if there are any, and not if there are zero missing values.

Notice that the n is higher for each of the genders than it is in the descriptive statistics above it. But if you look at the missing values in the summary( ) output (NA’s = 432), you can see that everything adds up perfectly (1813 + 1581 + 2 = 1399 + 1565 + 432). That is, if the NA’s don’t overlap for any of the cases.

To create two-way tables, use the CrossTable function in the “gmodels” package (install first).

install.packages("gmodels") library(gmodels) CrossTable(data.frame$variable1,data.frame$variable2, prop.chisq = FALSE, missing.include = TRUE)

Variable 1 will become the row variable, and variable 2 the column variable. This function has a lot of options (check out the help information in R), and one of the default options is to display the Chi-square contribution. Since we’re not interested in that (yet), I’ve turned that off for now.

The last part (missing.include = TRUE) turns on inclusion of the missing values in the table. By default, the table will include proportions of rows, columns, and the whole table. Turning off will change all proportions as the missing values will not be part of the cross table anymore.

Instead of proportions, you can also opt to display percentages. In that case, specify the “SPSS format” instead of the default “SAS format” in the function like this:

CrossTable(data.frame$variable1,data.frame$variable2, prop.chisq = FALSE, missing.include = TRUE, format = c("SPSS"))

Example of “SAS format” output, creating a cross table of gender (0, 1, or missing) and grade (1, 2, 5, or missing):

CrossTable(data2$gender,data2$grade, prop.chisq = FALSE, missing.include = TRUE)

Next: Visualizing data: getting started with bar charts >>