How can I merge stacked, longitudinal datasets with string variable ID's in Stata?

2.4k views Asked by At

I have been trying unsuccessfully to merge three Stata files that I originally imported from MS Excel against a 6 character string ID code (e.g. n5fpeb). They are stored as str6 variables in the Stata data files.

I have recorded some other variables that also identify each observation--a numeric participant ID and a school ID number, as each participant is a school pupil. My master dataset is in stacked format, because my data is longitudinal. When I attempt a 1:m merge (i.e. merge 1:m id using "C:\Users ... May.dta", generate(_merge1)), Stata returns the following error message : variable id does not uniquely identify observations in the master data.

I've read various guides, but can't figure out why the datasets won't merge. Could I be using the wrong command? Or perhaps the string variables, or multiple string variables, are confusing Stata? I'd like to learn how to cleanly add future observations to my master dataset.

2

There are 2 answers

1
Danielle On BEST ANSWER

The solution to your problem depends on what exactly you are trying to merge with your master dataset. From your description, I think your master data has each participant identified by numeric ID or alternatively by the string ID. Since you mention it is a longitudinal, stacked file, I would guess there is also a year variable (or some other time variable).

If the dataset you are trying to merge in is more observations of the same data (same variables, just more years perhaps), look into the append command.

If the dataset you are trying to merge in is at the participant level, then William is right and you want m:1. This is because you have many observations of the same participant in your master file stacked on top of each other. 1:m expects to find only one copy of each id in the stack.

If the dataset you are trying to merge in is at the particpant-year level (i.e. is also longitudinal), then you want merge 1:1 id year ... (or whatever your time variable is). This will work if there is only one copy of each id and year pair (i.e. only one record for each participant for each time period) in your stack of observations. Be warned, if your data is not clean and there is more than one, this will not work.

Hope this helps!

1
Roberto Ferrer On

I'll try to explain the error

variable id does not uniquely identify observations in the master data

Take a look at this dataset

clear
input ///
id date metric
1 25 45
1 26 98
1 27 54
2 23 09
2 24 13
2 25 87
end

list, sepby(id)

It's longitudinal (panel) data just as you say yours is. Although we have a variable named id, that doesn't mean that the variable uniquely identifies the observation. In fact, it doesn't. If I were to cover (say with a sheet of paper) the information on date and metric, there would be no way for you to identify uniquely each observation based on the variable id; for all three observations of each subject, id takes on the same value.

Now slide your sheet of paper so you can see both id and date. Because the combination of both variables creates unique-valued pairs (i.e. 1/25, 1/26, 1/27, 2/23, ...), then both variables, taken together, uniquely identify observations. Note that the merge command accepts a varlist, so you can specify several variables.

Stata is complaining bacause in your merge 1:m id, it expects id to uniquely identify the observation, but it doesn't. Additionally, it says the problem is with the master data, i.e., the one currently loaded.

Run

duplicates report id
duplicates report id date

after loading the example dataset, and you'll see duplicates when using id alone, but no duplicates for id date.

If you provide representative example datasets (master and using), along with how the final dataset is supposed to look like, you're likely to get specific help with the commands you need. See this page for guidelines.