Using joinby to combine monthly and yearly panel datasets

Gustav Egede Hansen

Join Date: May 2021

Posts: 94
#1

Using joinby to combine monthly and yearly panel datasets

20 Aug 2021, 04:46

Hi everybody

I have a quick (maby redundant) question about joinby. I have many different panel datasets that I would like to combine. My main dataset has monthly data on, e.g., salary, while I also have yearly data on gender, educational level, etc. I want to keep the monthly variation, so I do not want to collapse the data and average the salary yearly. I also assume that gender and educational level does not vary throughout the year. The joinby-command seems to do the trick, but I just want to consult the forum to see I am missing something (e.g. would it behave differently if I have much more data) or whether there is a better tool?

Here are two (very simple) example datasets:

Code:

clear input float(id year gender) 1 2010 1 end

And the main dataset:

Code:

clear input float(id year salary month ym) 1 2010 1 1 600 1 2010 1 2 601 1 2010 2 3 602 1 2010 1 4 603 1 2010 2 5 604 1 2010 1 6 605 1 2010 2 7 606 1 2010 1 8 607 1 2010 3 9 608 1 2010 4 10 609 1 2010 5 11 610 1 2010 1 12 611 end format %tm ym

And the joinby:

Code:

joinby id year using "filename.dta"

Best
Gustav
Tags: None
Ken Chui

Join Date: Aug 2014

Posts: 1058
#2

20 Aug 2021, 05:32

The next question is how you would like to deal with unmatched cases in either of the file. When -joinby- is used without any other option, it performs an "inner join" and will omit all unmatched cases from both files. In other words, the final product would only contain cases that are present in BOTH files. Check out -help joinby- and see other options inside -unmatched()- and make sure the results are what you want.
Comment
Gustav Egede Hansen

Join Date: May 2021

Posts: 94
#3

20 Aug 2021, 06:16

Thanks Ken! Really good point! I will definitely check it out. I will probably go with:

Code:

joinby id year using "filename.dta", unmatched(both) _merge(_merge)

and see how that looks before making a decision.

But other than that do you think joinby is the proper approach?
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

20 Aug 2021, 07:44

Most users would choose the merge command, which is designed to handle unmatched observations directly. Assuming your example data is stored in Stata datasets yearly and monthly, but with an unmatched observation added to each dataset, we have

Code:

. use yearly, clear

. merge 1:m id year using monthly

    Result                      Number of obs
    -----------------------------------------
    Not matched                             2
        from master                         1  (_merge==1)
        from using                          1  (_merge==2)

    Matched                                12  (_merge==3)
    -----------------------------------------

. list, clean

       id   year   gender   salary   month        ym            _merge  
  1.    1   2009        1        .       .         .   Master only (1)  
  2.    1   2010        1        1       1    2010m1       Matched (3)  
  3.    1   2010        1        1       2    2010m2       Matched (3)  
  4.    1   2010        1        2       3    2010m3       Matched (3)  
  5.    1   2010        1        1       4    2010m4       Matched (3)  
  6.    1   2010        1        2       5    2010m5       Matched (3)  
  7.    1   2010        1        1       6    2010m6       Matched (3)  
  8.    1   2010        1        2       7    2010m7       Matched (3)  
  9.    1   2010        1        1       8    2010m8       Matched (3)  
 10.    1   2010        1        3       9    2010m9       Matched (3)  
 11.    1   2010        1        4      10   2010m10       Matched (3)  
 12.    1   2010        1        5      11   2010m11       Matched (3)  
 13.    1   2010        1        1      12   2010m12       Matched (3)  
 14.    1   2011        .        1       1    2011m1    Using only (2)

Again, the choice of technique is yours, but since you don't mention the merge command, I want to be sure you're aware of it. The joinby command's strength is handling the case where for examle n observations in year 2010 in one dataset need to match to each of m observations in year 2010 in the other dataset, giving n times m observations in the resulting dataset. The merge command cannot handle that.

Last edited by William Lisowski; 20 Aug 2021, 07:48.

Comment

Gustav Egede Hansen

Join Date: May 2021

Posts: 94
#5

20 Aug 2021, 08:03

Thanks William! You really helped me out. I am familiar with merge, but when I tried using it on my example data it did not work intially. But now - when I add an unmatched observation to each dataset - it works like a charm.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

20 Aug 2021, 10:14

I want to be clear – for the benefit of those who read this topic later – that merge will work equally well with your example datasets as presented in post #1. Adding unmatched observations is not a requirement, and their addition was not the solution to the original failure of your merge command.

Code:

. // merge 1:m with yearly as the main ("master") dataset . use yearly, clear . merge 1:m id year using monthly Result Number of obs ----------------------------------------- Not matched 0 Matched 12 (_merge==3) ----------------------------------------- . . // merge m:1 with monthly as the main ("master") dataset . use monthly, clear . merge m:1 id year using yearly Result Number of obs ----------------------------------------- Not matched 0 Matched 12 (_merge==3) -----------------------------------------

You did not show us the merge command that did not work, nor did you show us the error message that resulted, and you didn't ask about merge in post #1 but instead assumed that merge could not work and joinby was what you needed. So I cannot tell you why merge did not work, but my guess is that you made the wrong choice between 1:m, m:1, and 1:1 given however you assigned your datasets to their roles as the main dataset and the using dataset.

Code:

. // merge 1:1 does not work . use yearly, clear . merge 1:1 id year using monthly variables id year do not uniquely identify observations in the using data r(459);

A similar error message would have occurred had I specified m:1, or had I reversed the main and using datasets and used 1:m.

For merge to work, one of the datasets must have no more than 1 observation for any combination of the merge key values. The joinby command relaxes that requirement.
1 like
Comment
Gustav Egede Hansen

Join Date: May 2021

Posts: 94
#7

25 Aug 2021, 05:38

Thanks for the explanation William. You made what is going on very clear, which is highly appreciated!
Comment

Announcement