Append or joinby? Adding cases by matching on one variable and applying a condition for another variable

Amy Thom

Join Date: Jun 2021

Posts: 15
#1

Append or joinby? Adding cases by matching on one variable and applying a condition for another variable

13 Feb 2024, 11:49

I need to append new cases to a master dataset from a using dataset with two conditions

My master dataset contains 248 observations (children aged 5 to 17 years) from 111 families (variable: num_ea)

From my using dataset, I want to append observations for children aged <5 years AND from the same families (so matching on the variable num_ea)

I have read all the help for append and joinby, but I have been unsuccessful with all commands thus far. I'll be grateful for guidance! Thank you
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#2

13 Feb 2024, 12:05

Code:

use master_dataset, clear frame put num_ea, into(families) frame num_ea { duplicates drop } use using_dataset, clear frlink m:1 num_ea, frame(families) keep if !missing(families) frame drop families drop families append using master_dataset

As you did not provide example data, this code is untested and may contain typos or other errors. However, I believe it is substantially correct and will, at the very least, point you in the right direction.
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#3

13 Feb 2024, 12:07

From your description, I can say that -append- is inappropriate for your use case. Appending data is like a cut-and-paste procedure that just adds observations (rows) from one dataset to those of another.

Merging is when there is some information link between two datasets such that it only makes sense to match/merge one or more observation in one dataset with those of another. Merging can be variously accomplished using -frlink-, -merge- or -joinby- in official Stata. If you would like explicit help with code, please refer to the FAQ and post a minimal working data example (using -dataex-) for the two datasets so that we can try out the correct commands.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#4

13 Feb 2024, 12:18

It is not often I find myself in disagreement with Leonardo Guizzetti, but here we are.

While the situation in #1 is not completely clearly described, it sounds to me like her master data set contains partial data on the members of certain families and she wishes to add to it similar (perhaps even the same variables, one hopes) on the remaining members of those same families. While it is sometimes convenient to do this by -merge-, that leads to a data set in wide layout, with the data on the people (children < 5 years) in the using dataset "side by side" with the data of their other family members in the master data set. Apart from the general problems with data in wide layout, a more specific problem arises here. In at least some, and probably many, families there will be multiple people represented in the master data set and also more than one child < 5 years of age in the using data set. The question then arises how to decide which person's data in the master data set is to be linked to which child < 5's data in the using data set. -merge- will require making a unique choice for that. -joinby- will join every child < 5's data to that of every other person in the household from the master data set. I see no principled way to make a unique choice for -merge-, and the -joinby- solution makes no sense to me at all.

I believe O.P. was correct in seeking to append these two data sets: it then becomes a long data set with one observation per person. She simply faced the technical problem of excluding from the using data set children belonging to households that did not appear in the master data set. That problem has a fairly simple solution, as outlined in #2.
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#5

13 Feb 2024, 12:23

It's okay to disagree. This is why I asked for example datasets so that we could address any ambiguities. What let me to think that merging was more appropriate was the explicit mention to match on family id (and I was speaking generally about how merging can be accomplished in Stata, not suggesting that all of them were equally valid).

What I envisioned from #1 was that she had two datasets, one with child-level and another with family-level observations, and those needed to be matched by family. From that merged dataset, it's then easy to filter out based on age of the child and retain only matched family-children relationships.
Comment
Amy Thom

Join Date: Jun 2021

Posts: 15
#6

13 Feb 2024, 13:13

Thank you for your posts

Clyde is correct that I want to exclude from the using data set those children belonging to households that don't appear in the master data set

master dataset is 'allHHkids'

num_ea is the family identifier (note there are duplicates of this variable since it identifies multiple children in the same family)

Code:

* Example generated by -dataex-. For more info, type help dataex clear input int num_ea byte num_ind str7 ind_id byte(c_gender age) 1 3 "1i3" 1 17 9 3 "9i3" 1 7 10 4 "10i4" 1 17 15 3 "15i3" 2 16 32 6 "32i6" 2 13 32 5 "32i5" 2 17 35 4 "35i4" 1 9 35 5 "35i5" 2 7 35 3 "35i3" 1 11 56 6 "56i6" 2 14 end
Comment
Amy Thom

Join Date: Jun 2021

Posts: 15
#7

13 Feb 2024, 13:14

I don't want to make a new data frame, per se - I would prefer to append cases to the master data set since there are many more variables than what I am able to show with dataex
Comment
Amy Thom

Join Date: Jun 2021

Posts: 15
#8

13 Feb 2024, 13:19

When I run Clyde's code

frame put num_ea, into(families)
frame num_ea {duplicates drop}

Stata returns

frame num_ea not found
r(111)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#9

13 Feb 2024, 13:46

Re #7 and #8.

The frame created is just a temporary workspace used to create a list of the values of num_ea that appear in the master data set. The actual appending takes place in the default frame, and the other frame (families) gets deleted.

That -frame num_ea { duplicates drop }- is my error. I'm sorry. I meant to say:

Code:

frame families { duplicates drop }

With that change, I believe the code will work correctly. Using the example data provided in #6, and a similar example I based on that for using data, it performs correctly.
Comment

Amy Thom

Join Date: Jun 2021
Posts: 15

#10

13 Feb 2024, 13:58

I ran the following code:

Code:

use "C:\Users\amyet\Dropbox\Academic\PhD_ISA\Work_in_Progress\Working_Data_Folder\StataProject_soc imp analy\Datasets\Stata\Subset1_milkprod_SocialLCA_v1_allHHkids.dta", clear

frame put num_ea, into(families)
frame families {
    duplicates drop
    }

use "C:\Users\amyet\Dropbox\Academic\PhD_ISA\Work_in_Progress\Working_Data_Folder\StataProject_soc imp analy\Datasets\Stata\A03_Menage_JF_recombined.dta", clear
frlink m:1 num_ea, frame(families)
keep if !missing(families)
frame drop families
drop families
append using "C:\Users\amyet\Dropbox\Academic\PhD_ISA\Work_in_Progress\Working_Data_Folder\StataProject_soc imp analy\Datasets\Stata\Subset1_milkprod_SocialLCA_v1_allHHkids.dta"

Unfortunately, the master dataset is still sitting with the same 248 cases/children after running this code. It does not appear that any additional cases/children were added to the master dataset.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29958

#11

13 Feb 2024, 14:12

I don't know why that would be. Here I show you your original master data set, and a toy using data set that I created, modeled on your master data set, with some children matching the master data on num_ea, and others not, followed by a run of the code and the results:

Code:

. use master_dataset, clear

. list, noobs clean

    num_ea   num_ind   ind_id   c_gender   age  
         1         3      1i3          1    17  
         9         3      9i3          1     7  
        10         4     10i4          1    17  
        15         3     15i3          2    16  
        32         6     32i6          2    13  
        32         5     32i5          2    17  
        35         4     35i4          1     9  
        35         5     35i5          2     7  
        35         3     35i3          1    11  
        56         6     56i6          2    14  

. use using_dataset, clear

. list, noobs clean

    num_ea   num_ind   ind_id   c_gender   age  
        37         3      1i3          2     1  
         9         3      9i3          1     2  
        10         4     10i4          2     1  
        15         3     15i3          1     1  
        32         6     32i6          2     4  
        32         5     32i5          2     2  
        35         4     35i4          1     1  
        35         5     35i5          1     1  
        49         3     35i3          2     2  
        56         6     56i6          2     2  

. do "C:\Users\clyde\AppData\Local\Temp\STDb5cc_000000.tmp"

. use master_dataset, clear

.
. frame put num_ea, into(families)

. frame families {
.     duplicates drop

Duplicates in terms of all variables

(3 observations deleted)
. }

.
. use using_dataset, clear

.
. frlink m:1 num_ea, frame(families)
(2 observations in frame default unmatched)

. keep if !missing(families)
(2 observations deleted)

. frame drop families

. drop families

. append using master_dataset

.
end of do-file

. list, noobs clean

    num_ea   num_ind   ind_id   c_gender   age  
         9         3      9i3          1     2  
        10         4     10i4          2     1  
        15         3     15i3          1     1  
        32         6     32i6          2     4  
        32         5     32i5          2     2  
        35         4     35i4          1     1  
        35         5     35i5          1     1  
        56         6     56i6          2     2  
         1         3      1i3          1    17  
         9         3      9i3          1     7  
        10         4     10i4          1    17  
        15         3     15i3          2    16  
        32         6     32i6          2    13  
        32         5     32i5          2    17  
        35         4     35i4          1     9  
        35         5     35i5          2     7  
        35         3     35i3          1    11  
        56         6     56i6          2    14  

.

You can clearly see that a correctly selected subset of the children in the using data set now appear in the combined result, at the beginning of the data set.

So something is different between what we are doing. I wonder to what extent your actual data sets deviate from the examples I am using. And is it possible that the coding of num_ea is different in the two data sets, so that there actually aren't any matches on the numeric values of num_ea between the two data sets?

In order for me to troubleshoot, you will have to post back with example data from both data sets, examples which reproduce the problem you are encountering.

Comment

An Dao

Join Date: Nov 2021
Posts: 18

#12

19 Feb 2024, 23:12

Dear Statalist,

I am learning how to use "local" syntax to restore parameters for further analysis. However, the commands did not go through, resulted in all the AIC values assigned with missing values

My commands and example data set are shown below, please give your advice. Thank you in advance

* Fitting model

qui zinb fiveyoddmfs SES4 babysex1 age_fdv plauque_2y Nodenvisit_L5Y motherage if split != 0, inflate(SES4) level(97.5) // model zinb for zero inflated distribution
ereturn list

* Using local to call and store value of e(ll) and e(k) from the return list

local ll_0=e(ll)
local parameter_0=e(k)

* Calculate AIC using values of e(ll) and e(k) from the above local calls

qui gen AIC = -2 * log(`ll_0') + (2 * `parameter_0')

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte fiveyoddmfs float(SES4 babysex1) double age_fdv long(plauque_2y Nodenvisit_L5Y) byte motherage
.  -5.300604 .  . 1 .  .
.   4.120511 0 24 1 1 33
0   .3542542 0 64 1 2 30
0  4.7029696 0 36 2 2 34
0  -3.051065 1  . 1 . 20
.  -.6050887 0  . 3 . 29
.  1.3122363 .  . 1 .  .
.  -3.945837 .  . 1 .  .
0   5.222175 1  . 2 2 36
.  -3.770617 0  . 1 . 26
.          . .  . 1 .  .
.   -2.95614 1  . 3 . 23
. -2.1549318 0 24 3 1 31
. -1.8057196 .  . 1 .  .
6  3.7224984 1  . 1 . 35
. -1.3404984 .  . 1 .  .
.  -2.700492 1 24 2 2 21
.   3.407414 0 24 1 2 35
.          . .  . 1 .  .
.          . .  . 1 .  .
.   1.185773 . 12 2 .  .
.  -5.068585 .  . 1 .  .
1    2.89098 1 30 2 3 33
.          . .  . 1 .  .
.   1.494044 0  . 1 .  .
0  1.8064017 1 24 2 1 24
.  1.6100533 0  . 4 . 20
.  1.9896634 0  . 1 1 35
2   3.118999 0 52 3 2 35
0          . 0 24 3 1 18
0 -2.3078828 0 30 2 3 41
.          . .  . 1 .  .
.   3.743583 .  . 1 .  .
.          . .  . 1 .  .
. -2.8125024 .  . 1 .  .
.          . 0  . 1 .  .
0   2.664365 1 24 3 2 27
0  2.1556137 1 24 3 3 31
1  -3.484615 1 18 3 3 23
0          . 1 17 3 5 25
.          . .  . 1 .  .
.  -.8042078 .  . 1 .  .
0          . 1  . 3 1 40
.          . .  . 1 .  .
.  -8.238112 .  . 1 .  .
0   2.557625 0 48 2 1 30
.   3.182209 1  . 1 .  .
0   .7825772 1 48 3 3 29
.   1.440104 0  . 1 . 39
.   1.826258 1  . 1 . 32
end
label values plauque_2y plauque_2y
label def plauque_2y 1 ".", modify
label def plauque_2y 2 "0", modify
label def plauque_2y 3 "1", modify
label def plauque_2y 4 "2", modify
label values Nodenvisit_L5Y Nodenvisit_L5Y
label def Nodenvisit_L5Y 1 "0", modify
label def Nodenvisit_L5Y 2 "1", modify
label def Nodenvisit_L5Y 3 "2", modify
label def Nodenvisit_L5Y 5 "4", modify

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29958

#13

20 Feb 2024, 08:50

There is nothing wrong with your local macros. The following code shows that they are correctly initialized and available:

Code:

. * Fitting model
.
. qui zinb fiveyoddmfs SES4 babysex1 age_fdv plauque_2y Nodenvisit_L5Y motherage, inflate(SES4) level(97.5) // model zinb for zero inflated di
> stribution

. ereturn list

scalars:
               e(rank) =  10
                  e(N) =  11
                 e(ic) =  31
                  e(k) =  10
               e(k_eq) =  3
               e(k_dv) =  1
          e(converged) =  1
                 e(rc) =  0
                 e(ll) =  -5.769437992233001
         e(k_eq_model) =  1
               e(ll_0) =  -8.529217385233322
               e(chi2) =  5.519558786000642
                  e(p) =  .7868719185032242
             e(N_zero) =  8
               e(df_m) =  6
              e(k_aux) =  1

macros:
            e(cmdline) : "zinb fiveyoddmfs SES4 babysex1 age_fdv plauque_2y Nodenvisit_L5Y motherage, inflate(SES4) level(97.5)"
                e(cmd) : "zinb"
            e(predict) : "zip_p"
            e(inflate) : "logit"
           e(chi2type) : "LR"
  e(deriv_useminbound) : "off"
                e(opt) : "moptimize"
                e(vce) : "oim"
              e(title) : "Zero-inflated negative binomial regression"
               e(user) : "zinb_llf"
          e(ml_method) : "e2"
          e(technique) : "nr"
              e(which) : "max"
             e(depvar) : "fiveyoddmfs"
         e(properties) : "b V"

matrices:
                  e(b) :  1 x 10
                  e(V) :  10 x 10
               e(ilog) :  1 x 20
           e(gradient) :  1 x 10

functions:
             e(sample)   

.
. * Using local to call and store value of e(ll) and e(k) from the return list
.
. local ll_0=e(ll)

. local parameter_0=e(k)

.
. * Calculate AIC using values of e(ll) and e(k) from the above local calls
.
. display `ll_0'
-5.769438

. display `parameter_0'
10

The reason you are getting all missing values for your AIC is because your formula for it is incorrect. As you can see, the value of `ll_0' is a negative number, so you cannot take its logarithm. The formula you are using is the general AIC formula, but it is not directly applicable to -zinb-, which is a more complex model than a simple regression. I do not know the correct formula to use here. Happily, though, you do not need to know that. The command -estat ic- will provide you with the correct value.

Comment

An Dao

Join Date: Nov 2021

Posts: 18
#14

20 Feb 2024, 23:19

Dear Clyde,

Thank you for your explanation for the reason why AIC values were all missing.
I've known that the command "estat ic" can help to get the value of AIC and BIC. However, in my case, i would likevto create a new frame in my data set which includes "fold n_t n rmse rmse_u ll AIC BIC). Unfortunately, i do not know how to call AIC and BIC values after "estat ic". Therefore, I headed to create AIC, BIC manually as you can see at my command in #12.

The below is the whole commands that i've run. The matter remains lying in the issue you've identified. I would find the way to call the values of "ll, AIC, BIC" after using post estimation "estat ic" or to find any other ways that help to call these values. Because, I would like finally to operate the last syntax of the below command, which is " frame post results (0) (`n_t') (r(N)) (`rmse_0') (`rmse_u_0') (`ll_0') (AIC_0) (BIC_0)".

Please give your advice, Thank you.

set seed 1234
gen rand = uniform()
egen split = cut(rand), group(10) // split data set into 10 folds assigned value from 0 to 9
tab split

frame create results fold n_t n rmse rmse_u ll AIC BIC

*Fit the model using training set (split != `i')
qui zinb fiveyoddmfs SES4 babysex1 age_fdv plauque_2y Nodenvisit_L5Y motherage if split != 0, inflate(SES4) level(97.5) // model zinb for zero inflated distribution

* Calculate AIC and BIC
local ll_0=e(ll)
local parameter_0=e(k)
local df_m = e(df_m) // declare variables with local scope. The purpose of using "local" is to limit the scope of a variable to the block of code in which it is declare
local n_t = e(N) // number of the observation that the model has been fitted
qui gen AIC_0 = -2 * log(`ll_0') + (2 * `parameter_0')
qui gen BIC_0 = -2 * log(`ll_0') + (`parameter_0') * log(`n_t') if split==0

* Predict yhat of the test set using parameters of the training set
predict yhat_test0 if split == 0

* Estimate residuals in the test set
qui gen res2_0 = fiveyoddmfs - yhat_test0 if split ==0 // residual
qui replace res2_0 = (res2_0)^2 if split == 0 // square residuals

// Calculate RMSE of unused group using coefficients of training set (split == `i')

qui sum res2_0 if split == 0, meanonly
local rmse_0 = sqrt(r(mean))
local rmse_u_0 = sqrt(r(sum)/(r(N) - `df_m' - 1))

// Save fold, n_t (n of training set), n (n of unused group), rmse and rmse_u (unbiased) in frame results:

frame post results (0) (`n_t') (r(N)) (`rmse_0') (`rmse_u_0') (`ll_0') (AIC_0) (BIC_0)

frame results: list, noob

Last edited by An Dao; 20 Feb 2024, 23:26.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#15

21 Feb 2024, 11:17

Unfortunately, i do not know how to call AIC and BIC values after "estat ic".

-estat ic- leaves it results in a matrix called r(S). Being in -r()-, rather than -e()-, these results are volatile, so you have to pick them up in a permanent matrix immediately.

So something like this:

Code:

// INSIDE YOUR LOOP qui zinb fiveyoddmfs SES4 babysex1 age_fdv plauque_2y Nodenvisit_L5Y motherage if split != 0, inflate(SES4) level(97.5) quietly estat ic local aic = r(S)[1, 5] ... frame post results ...(`aic')...

By the way, I think what is wrong with your original calculation is that you were treating e(ll) as if it were the likelihood, and then you were trying to take its logarithm. But e(ll) is, in fact, already the log likelihood. So I think if you change your original equation to

Code:

gen AIC = -2 * `ll_0' + (2 * `parameter_0')

you will get correct results. But, of course, there really is no reason for you to do this calculation at all, as I have just shown you how to pull AIC directly from the -estat ic- output.
Comment

Announcement