Merging data files with different variables

<< Previous post: Converting & opening data files for use with R


When your data files have been transformed into the right .rdata format (e.g. by using a program like StatTransfer), you can simply open them in RStudio using:

File → Open file… → (find and select the specific .rdata file you would like to open)

Sometimes a complete database comes in multiple files, for example when additional variables have been collected over multiple years in a longitudinal study. In that situation, people who have filled out the surveys often have a unique identifying number, which can be used to merge the files. After opening the separate files (make sure they are both listed in the Environment tab), the following command can then be typed in the Console:

merged data.frame <- merge(data.frame 1, data.frame 2, by = "variable 1", all = TRUE)

For example:

data3 <- merge(data1, data2, by = "idnumber", all = TRUE)

This will create a new data set (data3) with all the different variables from data.frame 1 and2, merged based on the specific identifiers in idnumber, that needs to be equal in name in both files. If the names are not equal, check out this link, and look for by.x and by.y for more details. Make sure to include the quotation marks (” “) around the variable name. If there are more than two files to merge, you can create intermediate merged files, and repeat the process.

all = TRUE means that all cases from data.frame 1 and 2 are being kept. If in data.frame 2 the outcomes for a person with identifying number 322 is missing (maybe they didn’t fill out the survey in the second year), the newly merged file will still contain the information for case# 322 from data.frame 1. While this will create missing values for certain variables, this allows you to quantify and analyze the number of cases that failed to complete all surveys, for example.


Next: Organizing & cleaning data: variables >>

Leave a Reply

Your email address will not be published. Required fields are marked *