Exploring data with descriptive statistics and frequency tables

<< Previous: Organizing and cleaning data: missing values, part 1

Once your variables are in the right shape, you can start examining your data. I like to start out with running descriptive statistics and frequency analyses for the variables I’m interested in, and see if there are any missing values and values out of the expected range. If you created your own data (e.g. from an experiment), you’re most likely very familiar with the specific variables and values. But if you’re dealing with survey data, missing variables show up for several reasons – respondents skipped the questions by accident or on purpose, or perhaps they weren’t asked a specific question due to previous questions or demographics (‘the legitimate skip’). Make sure to check out the documentation for surveys first (both the codebook and the actual questionnaire) to find out the details of your variables of interest.

The first quick command for numeric or logical values you can use is summary( )

summary(data.frame$variable)

For example:

summary(data2$weight)

Output:

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 
52   110     123    130  144     353  432

Here you can see right away that you have 432 missing values. These empty values are not included when calculating the mean, i.e. removing these cases would not create a different summary output.

You can also use the same command to generate summaries for your whole data set, but this only makes sense if you have a limited number of variables:

summary(data2)

If you want to just generate the mean of a variable, use the following line, and be sure to include the last part (na.rm = TRUE). If you don’t specify this, the result would be NA if any of the values are missing.

mean(data.frame$variable, na.rm = TRUE)

For example:

mean(data2$weight, na.rm = TRUE)

Output:

129.9744

Notice that this is a slightly different value as we obtained in the summary( ) command, as it has more significant digits.

A little bit more detail than summary( ) can be obtained with the describe function in the “psych” package (install first):

install.packages("psych")
library(psych)
describe(data.frame$variable)

For example:

describe(data2$weight)

     vars n     mean    sd    median trimmed  mad   min max  range skew  kurtosis se 
X1   1    2964  129.97  31.4  123    126.12   25.2  52  353  301   1.51  3.82     0.58

And this also has a ‘by’ function, where you can obtain statistics for subgroups:

describeBy(data.frame$variable1, data.frame$variable2)

Where variable 1 is the variable to describe, and variable 2 is the variable that determines the subgroups.

For example, if I want to obtain statistics on body weight for boys and girls separately:

describeBy(data2$weight, data2$gender)
 
$`0`
    vars n     mean   sd    median trimmed mad   min max  range skew  kurtosis se
 X1 1    1565  121.69 25.58 117    118.68  19.27 52  324  272   1.72  6.14     0.65

$`1`
    vars n     mean   sd    median trimmed mad   min max  range skew  kurtosis se
 X1 1    1399  139.24 34.58 130    135.64  29.65 69  353  284   1.25  2.52     0.92

 

For character and logical values, you can use a table to see the frequencies for different entries. By default, the NA’s are ignored, unless you specify that you want to include them with the following code:

table(data.frame$variable, useNA = "always")

For example:

table(data2$gender, useNA = "always")

  0  1    <NA> 
1813 1581 2

If you use the word “ifany” instead of “always” for missing values, you will only see the number of missing values in the table if there are any, and not if there are zero missing values.

Notice that the n is higher for each of the genders than it is in the descriptive statistics above it. But if you look at the missing values in the summary( ) output (NA’s = 432), you can see that everything adds up perfectly (1813 + 1581 + 2 = 1399 + 1565 + 432). That is, if the NA’s don’t overlap for any of the cases.

 

To create two-way tables, use the CrossTable function in the “gmodels” package (install first).

install.packages("gmodels")
library(gmodels)

CrossTable(data.frame$variable1,data.frame$variable2, prop.chisq = FALSE, missing.include = TRUE)

Variable 1 will become the row variable, and variable 2 the column variable. This function has a lot of options (check out the help information in R), and one of the default options is to display the Chi-square contribution. Since we’re not interested in that (yet), I’ve turned that off for now.

The last part (missing.include = TRUE) turns on inclusion of the missing values in the table. By default, the table will include proportions of rows, columns, and the whole table. Turning off will change all proportions as the missing values will not be part of the cross table anymore.

Instead of proportions, you can also opt to display percentages. In that case, specify the “SPSS format” instead of the default “SAS format” in the function like this:

CrossTable(data.frame$variable1,data.frame$variable2, prop.chisq = FALSE, missing.include = TRUE, format = c("SPSS"))

Example of “SAS format” output, creating a cross table of gender (0, 1, or missing) and grade (1, 2, 5, or missing):

CrossTable(data2$gender,data2$grade, prop.chisq = FALSE, missing.include = TRUE)

CrossTableCrossTable

 

Next: Visualizing data: getting started with bar charts >>

Leave a Reply

Your email address will not be published. Required fields are marked *