How to generate a variable based on conditions on other variables?

Titir Bhattacharya

Join Date: Mar 2019
Posts: 226

How to generate a variable based on conditions on other variables?

30 Jul 2022, 08:57

Hi statalist,

Apologies for the vague question title. Hopefully the following details would be able to elaborate on the question properly.

Please consider the following example data

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str15 caseid byte(interview_year birthorder_01 birthorder_02 birthorder_03 birthyear_01 birthyear_02 birthyear_03 csection_1 csection_2 csection_3)
"    2  1   9  1" 92 2 1 . 88 85  . . . .
"    2  1  17  2" 92 4 3 2 92 89 85 . . .
"    2  1  33  1" 92 3 2 1 77 73 67 . . .
"    2  1  33  3" 92 1 . . 89  .  . . . .
"    2  1  33  2" 92 . . .  .  .  . . . .
"    2  1  41  2" 92 3 2 1 91 86 82 . . .
"    2  1  49  2" 92 3 2 1 64 62 58 . . .
"    2  1  57  2" 92 5 4 3 88 85 84 0 . .
"    2  1  65  2" 92 3 2 1 91 86 85 . . .
"    2  1  73  2" 92 4 3 2 91 89 86 . . .
"    2  1  89  2" 92 6 5 4 71 70 67 . . .
"    2  1  89  4" 92 1 . . 91  .  . . . .
"    2  1  89  7" 92 . . .  .  .  . . . .
"    2  1 105  2" 92 . . .  .  .  . . . .
"    2  1 113  2" 92 3 2 1 86 85 82 . . .
"    2  1 121  4" 92 2 1 . 91 88  . . . .
"    2  1 121  2" 92 6 5 4 76 71 69 . . .
"    2  1 129  1" 92 1 . . 83  .  . . . .
"    2  1 137  3" 92 1 . . 91  .  . . . .
"    2  1 145  2" 92 3 2 1 92 89 86 . . .
"    2  1 153  3" 92 3 2 1 87 82 79 . . .
"    2  1 177  2" 92 1 . . 65  .  . . . .
"    2  1 201  2" 92 1 . . 78  .  . . . .
"    2  1 209  6" 92 7 6 5 65 64 63 . . .
"    2  1 209  2" 92 3 2 1 86 83 79 . . .
"    2  1 217  2" 92 2 1 . 88 84  . . . .
"    2  1 233  2" 92 3 2 1 88 84 81 . . .
"    2  1 241  2" 92 1 . . 88  .  . . . .
"    2  1 249  2" 92 . . .  .  .  . . . .
"    2  1 249  3" 92 3 2 1 89 86 83 . . .
"    2  1 257  2" 92 6 5 4 82 80 71 . . .
"    2  1 265  2" 92 . . .  .  .  . . . .
"    2  1 273  1" 92 6 5 4 78 73 66 . . .
"    2  1 281  2" 92 3 2 1 87 84 83 . . .
"    2  1 289  2" 92 4 3 2 87 85 82 . . .
"    2  2  24  2" 92 . . .  .  .  . . . .
"    2  2  54  2" 92 5 4 3 74 73 70 . . .
"    2  2  54  4" 92 3 2 1 90 88 87 . . .
"    2  2  54  7" 92 1 . . 91  .  . 0 . .
"    2  2  84  2" 92 4 3 2 91 88 87 0 0 .
"    2  2  99  3" 92 2 1 . 90 87  . . . .
"    2  2 114  2" 92 6 5 4 80 77 75 . . .
"    2  2 129  2" 92 . . .  .  .  . . . .
"    2  2 159  2" 92 6 5 4 81 78 75 . . .
"    2  2 174  2" 92 4 3 2 81 78 76 . . .
"    2  2 189  3" 92 1 . . 92  .  . . . .
"    2  2 189  1" 92 1 . . 67  .  . . . .
"    2  2 219  2" 92 . . .  .  .  . . . .
"    2  2 234  2" 92 5 4 3 77 75 74 . . .
"    2  2 249  2" 92 4 3 2 86 83 81 . . .
"    2  2 264  7" 92 3 2 1 78 75 72 . . .
"    2  2 264  8" 92 . . .  .  .  . . . .
"    2  2 294  1" 92 2 1 . 76 73  . . . .
"    2  2 309  2" 92 4 3 2 77 70 68 . . .
"    2  2 309  5" 92 4 3 2 89 87 86 0 . .
"    2  2 324  5" 92 3 2 1 85 82 79 . . .
"    2  2 339  2" 92 3 2 1 85 83 80 . . .
"    2  2 354  2" 92 2 1 . 88 85  . . . .
"    2  2 384  2" 92 1 . . 90  .  . . . .
"    2  2 414  3" 92 1 . . 90  .  . 0 . .
"    2  2 429  2" 92 5 4 3 88 87 85 . . .
"    2  2 444  3" 92 4 3 2 86 82 81 . . .
"    2  2 459  2" 92 1 . . 90  .  . 0 . .
"    2  2 474  2" 92 3 2 1 76 73 70 . . .
"    2  3   1  2" 92 3 2 1 90 87 84 . . .
"    2  3  19  2" 92 . . .  .  .  . . . .
"    2  3  25  2" 92 . . .  .  .  . . . .
"    2  3  37  2" 92 6 5 4 83 81 75 . . .
"    2  3  37  4" 92 . . .  .  .  . . . .
"    2  3  43  4" 92 . . .  .  .  . . . .
"    2  3  49  2" 92 3 2 1 78 76 72 . . .
"    2  3  55  2" 92 2 1 . 82 78  . . . .
"    2  3  67  2" 92 5 4 3 91 87 85 . . .
"    2  3  73  2" 92 3 2 1 86 85 82 . . .
"    2  3  79  1" 92 4 3 2 81 79 73 . . .
"    2  3  85  5" 92 4 3 2 89 88 86 . . .
"    2  3  91  5" 92 1 . . 91  .  . 0 . .
"    2  3  97  2" 92 1 . . 89  .  . . . .
"    2  3 109  2" 92 3 2 1 86 84 82 . . .
"    2  3 115  2" 92 . . .  .  .  . . . .
"    2  3 121  2" 92 2 1 . 82 71  . . . .
"    2  3 127  2" 92 . . .  .  .  . . . .
"    2  3 133  2" 92 . . .  .  .  . . . .
"    2  3 139  2" 92 4 3 2 81 79 77 . . .
"    2  3 145  2" 92 2 1 . 90 85  . 0 . .
"    2  3 151  2" 92 3 2 1 91 85 81 . . .
"    2  3 163  2" 92 2 1 . 85 82  . . . .
"    2  3 169  2" 92 2 1 . 87 85  . . . .
"    2  3 175  2" 92 4 3 2 70 68 66 . . .
"    2  3 181  1" 92 1 . . 85  .  . . . .
"    2  3 187  6" 92 1 . . 91  .  . 0 . .
"    2  3 187  2" 92 . . .  .  .  . . . .
"    2  3 193  2" 92 4 3 2 92 91 84 . . .
"    2  3 199  2" 92 3 2 1 84 80 77 . . .
"    2  3 205  1" 92 3 2 1 76 73 69 . . .
"    2  3 211  2" 92 2 1 . 70 67  . . . .
"    2  3 211  4" 92 1 . . 91  .  . . . .
"    2  3 217  6" 92 . . .  .  .  . . . .
"    2  3 229  2" 92 5 4 3 73 71 66 . . .
"    2  3 241  1" 92 2 1 . 72 70  . . . .
end
label values csection_1 LABH
label values csection_2 LABH
label values csection_3 LABH
label def LABH 0 "no", modify

This data provides record of a woman's birth history for the last 3 births, where suffix _01 indicates the latest birth. I want to consider births that have taken place in the last 5 years. I'll consider birthorder_01 if

Code:

interview_year-birthyear_01<=5

, consider birthorder_02 if

Code:

interview_year-birthyear_02<=5

and so on.

Now amongst these births that occur in the last 5 years, I want to see if any of the births were a c-section delivery. So ideally I'd like to generate a variable "ever_csection" which would take value 1 if for any of the eligible births , there was csection delivery. If none of the eligible births in the last 5 years was csection delivery, then the variable would be 0.

Some problems I'm facing:
1. how to limit the conditions for generating "ever_csection" variable to only births that happened in the last 5 years

2. for some women, births have happened in the last 5 years, but the corresponding csection variable is missing. For instance, for caseid 2191, there is one eligible birth. But the corresponding csection variable is missing. I would like the "ever_csection" variable to have value 0. Similarly, caseid 21172 has had 2 eligible births, but the corresponding csection information is missing, in which case the "ever_csection" variable will take value 0. There can be other such combinations where one eligible birth has no csection and the other eligible birth has has misisng on csection, in which case "ever_csection would be 0". Some women might not have any eligible births in which case "ever_csection" would be missing. I could try writing out all these individual conditions but in the full data I have a longer birth history and it becomes time consuming to write out all the individual conditions.

To be precise conditions would be as follows:
1. any eligible birth has csection, ever_csection=1
2. none of the eligible births has csection, ever_csection=0
3. some eligible births don't have csection, some eligible births have missing on csection, ever_csection=0
4. for all eligible births, csection is missing, ever_csection=0
5. no eligible births, ever_csection=missing

Is there any concise way of incorporating all these conditions in a code?

I would be grateful if anyone here can help me out.

Please let me know if you need further clarification on any part of the question.

Thanks,

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

30 Jul 2022, 15:02

The major obstacle here is that your data are in wide layout. Like most things in Stata data management and analysis, this is best done with the data converted to long. In fact, it's just a one liner then.

Code:

reshape long birthorder_0 birthyear_0 csection_, i(caseid) j(seq) rename *_0 * rename csection_ csection by caseid, sort: egen ever_csection = /// max(cond(interview_year-birthyear <= 5, csection == 1, .))
Comment
Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#3

31 Jul 2022, 06:11

Originally posted by Clyde Schechter View Post

The major obstacle here is that your data are in wide layout. Like most things in Stata data management and analysis, this is best done with the data converted to long. In fact, it's just a one liner then.

Code:

reshape long birthorder_0 birthyear_0 csection_, i(caseid) j(seq) rename *_0 * rename csection_ csection by caseid, sort: egen ever_csection = /// max(cond(interview_year-birthyear <= 5, csection == 1, .))

Thanks Clyde for replying. I tried this in my full data but facing a little problem. In my full data, I have 20 birth orders, and 20 birth years, and 6 csection variables named like
csection_1...csection_6 The birth order/birth year variables are named like
birthorder_01... birthorder_20 and birthyear_01... birthyear_20. The code works perfectly for variables like csection, but for birth order and birth year variables, it registers only till birthorder_09/birthyear_09 Could you suggest some modifications to the code? Thanks again
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#4

31 Jul 2022, 10:53

The code works perfectly for variables like csection, but for birth order and birth year variables, it registers only till birthorder_09/birthyear_09

I don't understand what this means. The code does not work "for" any particular variables. The variable ever_csection is, ultimately, a property of the caseid: it takes on the same value for all observations of the same caseid because it is identifying whether or not any csection took place among births in the last five years.

The best interpretation I can place on your question is that you have data on up to 20 births for these women, but you have information about csections for only the first 6 of them. If the data have nothing to say about c-sections for the rest of the births, there is no way that any code will "work for" the rest of those births.

I suspect that I am misunderstanding something here. So please post back showing the kind of results you are getting, and show where and how they are different from what you want, as well as explaining as best you can how you arrive at the results you want.

Added:
Cancel that: I see what you mean. The problem is that the starting variables have inconsistent naming patterns: for birthyear and birthorder, the serial number is 0-padded, whereas for csection it is not. The code I wrote dealt with that in a way that would leave birthyears and birthorders 10+ hanging, because it assumed that birthyears and birthorders variables always have a leading 0. So the solution is to make the original naming pattern consistent. The easiest way to do that is to eliminate the leading zeroes from the original variable names.

Code:

rename birthyear_0* birthyear_* rename birthorder_0* birthorder_* reshape long birthorder_ birthyear_ csection_, i(caseid) j(seq) rename *_ * by caseid, sort: egen ever_csection = /// max(cond(interview_year-birthyear <= 5, csection == 1, .))

Last edited by Clyde Schechter; 31 Jul 2022, 11:04.
Comment
Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#5

01 Aug 2022, 01:18

Originally posted by Clyde Schechter View Post

I don't understand what this means. The code does not work "for" any particular variables. The variable ever_csection is, ultimately, a property of the caseid: it takes on the same value for all observations of the same caseid because it is identifying whether or not any csection took place among births in the last five years.

The best interpretation I can place on your question is that you have data on up to 20 births for these women, but you have information about csections for only the first 6 of them. If the data have nothing to say about c-sections for the rest of the births, there is no way that any code will "work for" the rest of those births.

I suspect that I am misunderstanding something here. So please post back showing the kind of results you are getting, and show where and how they are different from what you want, as well as explaining as best you can how you arrive at the results you want.

Added:
Cancel that: I see what you mean. The problem is that the starting variables have inconsistent naming patterns: for birthyear and birthorder, the serial number is 0-padded, whereas for csection it is not. The code I wrote dealt with that in a way that would leave birthyears and birthorders 10+ hanging, because it assumed that birthyears and birthorders variables always have a leading 0. So the solution is to make the original naming pattern consistent. The easiest way to do that is to eliminate the leading zeroes from the original variable names.

Code:

rename birthyear_0* birthyear_* rename birthorder_0* birthorder_* reshape long birthorder_ birthyear_ csection_, i(caseid) j(seq) rename *_ * by caseid, sort: egen ever_csection = /// max(cond(interview_year-birthyear <= 5, csection == 1, .))

Thanks Clyde. This worked perfectly.

I have a small doubt about the logic behind max(cond()). I understand that cond(x,a,b) returns a if x is true and returns b if x is not true. However, I'm having difficulty understanding how it works when combined with max. It would be great if you could help me understand the logic behind the code or suggest resources which explain how the code works.

Thanks
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#6

01 Aug 2022, 09:31

Here's how it works. Consider all the observations of a single caseid. There are those where interview year is more than 5 years beyond birthyear. For those, cond(interview_year-birthyear <= 5, csection == 1, .) returns missing value. For those with interview_year within 5 years of birth year but with csection either missing or 0, cond(interview_year-birthyear <= 5, csection == 1, .) has a true condition so it returns the value of csection ==1, which is false, so the return value is 0. (Stata returns 0 for false logical expressions.). Finally, for those with interview_year witin 5 years of birthyear and csection == 1, the condition is again true, and this time csection == 1 is true. When calculating the result of a logical expression, Stata returns 1 for true. So cond(interview_year-birthyear <= 5, csection == 1, .) returns 1.

So among all the observations of a caseid, the results of evaluating cond(interview_year-birthyear <= 5, csection == 1, .) can be a mixture of missing values, zeroes, and ones. Now, if there are any observations that have interview year within 5 years of birth year and csection == 1, the value for that observation, as already noted, will be 1. And therefore the largest value among all of the caseid's observations will be 1, since 0, 1, and missing are the only possibilities. So max(cond(interview_year-birthyear <= 5, csection == 1, .)) will be 1.

Now, what if none of them have csection == 1 in an interview year within 5 years of birth year, but there is at least one observation with timing in that 5 year range. There will be no one's in this scenario, only some mix of zeroes and missing value, or perhaps all zeroes. (It won't be all missing values because we have stipulated there is at least one observation with an interview year in the 5 year range.) So the largest value will be 0, and max(cond(interview_year-birthyear <= 5, csection == 1, .)) will be 0.

Finally, there is the case where no observations have interview year within 5 years of birthyear. In that case, the condition of cond() is always false, and therefore all of the values of cond(interview_year-birthyear <= 5, csection == 1, .) are missing values. Then max(cond(interview_year-birthyear <= 5, csection == 1, .)) returns missing value.
Comment
Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#7

01 Aug 2022, 12:46

Originally posted by Clyde Schechter View Post

Here's how it works. Consider all the observations of a single caseid. There are those where interview year is more than 5 years beyond birthyear. For those, cond(interview_year-birthyear <= 5, csection == 1, .) returns missing value. For those with interview_year within 5 years of birth year but with csection either missing or 0, cond(interview_year-birthyear <= 5, csection == 1, .) has a true condition so it returns the value of csection ==1, which is false, so the return value is 0. (Stata returns 0 for false logical expressions.). Finally, for those with interview_year witin 5 years of birthyear and csection == 1, the condition is again true, and this time csection == 1 is true. When calculating the result of a logical expression, Stata returns 1 for true. So cond(interview_year-birthyear <= 5, csection == 1, .) returns 1.

So among all the observations of a caseid, the results of evaluating cond(interview_year-birthyear <= 5, csection == 1, .) can be a mixture of missing values, zeroes, and ones. Now, if there are any observations that have interview year within 5 years of birth year and csection == 1, the value for that observation, as already noted, will be 1. And therefore the largest value among all of the caseid's observations will be 1, since 0, 1, and missing are the only possibilities. So max(cond(interview_year-birthyear <= 5, csection == 1, .)) will be 1.

Now, what if none of them have csection == 1 in an interview year within 5 years of birth year, but there is at least one observation with timing in that 5 year range. There will be no one's in this scenario, only some mix of zeroes and missing value, or perhaps all zeroes. (It won't be all missing values because we have stipulated there is at least one observation with an interview year in the 5 year range.) So the largest value will be 0, and max(cond(interview_year-birthyear <= 5, csection == 1, .)) will be 0.

Finally, there is the case where no observations have interview year within 5 years of birthyear. In that case, the condition of cond() is always false, and therefore all of the values of cond(interview_year-birthyear <= 5, csection == 1, .) are missing values. Then max(cond(interview_year-birthyear <= 5, csection == 1, .)) returns missing value.

Thank you so much Clyde for patiently explaining the working of the code. I really appreciate it.
Comment

Announcement

How to generate a variable based on conditions on other variables?

Comment

Comment

Comment

Comment

Comment

Comment