Merging Datasets - Observations to Variables

Gin Nauwelaerts

Join Date: Feb 2024

Posts: 2
#1

Merging Datasets - Observations to Variables

15 Feb 2024, 03:06

Hi,
I have a question that I think is related to merging datasets.
I have one dataset with many observations about various topics. Then, I have a second dataset with a few observations about owning a business. The respondents are the same, although not all of them are in the business dataset, as this only includes the respondents that own a business. Some respondents own multiple businesses, and in this case, the business dataset has them registered as a separate observation. Hence, there may be multiple observations per respondent in this dataset.

I know that I can merge these with a one to many merge. However, I do not want multiple observation lines for the same respondent. I would like to have one observation line per respondent, including all there businesses and questions about those as variables.

Thus: E.g.:

Dataset 1:
Respondent
1
2
3
4

Dataset 2:
Respondent
2
2
3

Etc.

I would like:
Merged Data:
Respondent Business 1 Question about Business 1 Business 2

1 No ... No

2 Yes ... Yes

3 Yes ... No

4 No ... No

Etc.

Would anyone know how to get this done in Stata?

I hope my question was clear.
Many thanks in advance!
Best regards,
Gin
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10483

15 Feb 2024, 08:14

Context is important here. Why are the IDs in the second dataset repeated? Are they duplicates? Given that the variables are the same, is there any chance that a respondent answers differently on the same question? If so, how do we resolve that? If these are simply duplicates, the procedure is simple.

Code:

*DATASET 2
clear
input float respondent str3 business
2 "Yes"
2 "Yes"
3 "No"
end

duplicates drop *, force
rename business business2
tempfile 2
save `2'

*DATASET 1
clear
input float respondent str3 business
1 "No"
2 "Yes"
3 "Yes"
4 "No"
end
rename business business1
merge 1:1 respondent using `2', nogen
replace business2= "No" if missing(business2)

Res.:

Code:

. l

     +--------------------------------+
     | respon~t   busine~1   busine~2 |
     |--------------------------------|
  1. |        1         No         No |
  2. |        2        Yes        Yes |
  3. |        3        Yes         No |
  4. |        4         No         No |
     +--------------------------------+

Comment

Gin Nauwelaerts

Join Date: Feb 2024

Posts: 2
#3

15 Feb 2024, 08:55

Hi,
Thanks for your reply.
No, they are not duplicates. Every business is registered as a separate observation, so if a respondent is repeated, this means that they own multiple business; and thus that the answers to all the follow-up questions on that specific business are also different. Does that make sense?

Thanks,
Gin
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10483

16 Feb 2024, 00:49

One way to handle this is to create a variable for each business that an individual owns. Presumably, there is a variable that identifies which question relates to which business. You achieve this by reshaping wide the dataset. If an individual does not own multiple businesses, then whether it makes sense to replace his/her observations with "No" for a particular question is something that you need to decide.

Code:

*DATASET 2
clear
input float respondent str3 business float question
2 "Yes" 1
2 "No" 2
3 "No" 1
end

rename business business2_q
reshape wide business2_q, i(respondent) j(question)
tempfile 2
save `2'

*DATASET 1
clear
input float respondent str3 business
1 "No"
2 "Yes"
3 "Yes"
4 "No"
end
rename business business1
merge 1:1 respondent using `2', nogen
foreach q of varlist business2_q*{
    replace `q'= "No" if missing(`q')
}

Res.:

Code:

. l, abbrev(12)

     +------------------------------------------------------+
     | respondent   business1   business2_q1   business2_q2 |
     |------------------------------------------------------|
  1. |          1          No             No             No |
  2. |          2         Yes            Yes             No |
  3. |          3         Yes             No             No |
  4. |          4          No             No             No |
     +------------------------------------------------------+

Alternatively, you can aggregate the data by rephrasing the question. For example:

1. Q: Did you make any losses in the last quarter of 2023 on your main business? A: No
2. Q: Did you make any losses in the last quarter of 2023 on any of your other businesses? A: Yes

can be rephrased to:

Q: Did you make any losses in any of your businesses in the last quarter of 2023? A: Yes

The aggregation thus becomes "Yes" if there is at least one "Yes" response, "No" otherwise.

Code:

*DATASET 2
clear
input float respondent str3 business float question
2 "Yes" 1
2 "No" 2
3 "No" 1
end

bysort respondent: egen business2= max(business=="Yes")
contract respondent business2
drop _freq
tempfile 2
save `2'

*DATASET 1
clear
input float respondent str3 business
1 "No"
2 "Yes"
3 "Yes"
4 "No"
end
rename business business1
merge 1:1 respondent using `2', nogen
replace business2= 0 if missing(business2)
lab def business2 1 "Yes" 0 "No"
lab values business2 business2

Res.:

Code:

. l

     +--------------------------------+
     | respon~t   busine~1   busine~2 |
     |--------------------------------|
  1. |        1         No         No |
  2. |        2        Yes        Yes |
  3. |        3        Yes         No |
  4. |        4         No         No |
     +--------------------------------+

Respondent	Business 1	Question about Business 1	Business 2
1	No	...	No
2	Yes	...	Yes
3	Yes	...	No
4	No	...	No

Announcement