Looping (forvalues) to calculate pack-years of cigarette smoked

Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#1

Looping (forvalues) to calculate pack-years of cigarette smoked

11 Dec 2015, 12:31

Dear members of Statalist,

I am trying to derive pack years of cigarette from information on peoples smoking history using forvalue looping command. My main challenge is to take into account the different types of cigarette smoked (1= filtered, 2= handrolled) during same years of smoking. Example data set below shows 2 smoking patterns in 5 peoples lives

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str5 ID float(cig1_from cig1_to cig1_type cig1_amt cig2_from cig2_to cig2_type cig2_amt packday1 packday2) "131-1" 18 28 1 15 888 888 88 88 .75 0 "131-2" 14 18 1 5 19 25 1 10 .25 .5 "132-1" 16 25 1 5 16 25 2 5 .25 1.25 "132-2" 15 16 1 5 17 21 1 20 .25 1 "133-1" 14 23 1 20 20 30 2 24 1 6 end

The data for 131-2 is read as person smoked 5 filtered cigarettes from 14 to 18 years (including both years=5 yrs in total) and smoked 10 filtered cigarettes (cigarette type is coded 1 if filtered and 2 if handrolled) [Pattern 1) and filtered from 19 to 25 years including both periods [Pattern 2]. Packday1 and packday2 are derived by running code for step 1, as given below.I am following 3 steps for deriving pack-years which is intensity of smoking (packs per day) multiplied by total duration in years over the persons whole life. 1) Create pack intensity of smoking for each pattern (packday1, packday2 ) , 2) Create age specific intensities and 3) Its not a problem.

Code for Step 1 is given below , giving me packday1 and packday2 ( as shown in above data.)

forvalues a=1/2{
gen packday`a' =.
replace packday`a' = (cig`a'_amt/20) if cig`a'_type==1 // 1 pack of filtered cig=20 cigs
replace packday`a' = (cig`a'_amt/4) if cig`a'_type==2 // 1 pack of handrolled cig=4 cigs
replace packday`a'= 0 if cig`a'_type==88 // 88 or 888 represents absence of smoking
}

Issue is in running the code for the second step in people like 132-1 and 133-1. Code I used is given below.

*Create 30 variables=0, representing each year of life over 30 years.

forvalues i = 1(1)30 {
gen a`i' =0
}

*Replace the life year variables with corresponding packday(intensity) of smoking, if the person smoked (i.e, Gives me intensity per year of life)

forval b=1 (1) 30{
replace a`b' = packday1 if cig1_from <= `b'& (cig1_to+1) >`b' & cig1_type !=88
}
forval b=1 (1) 30{
replace a`b' = packday2 if cig2_from <= `b'& (cig2_to+1) >`b' & cig2_type !=88
}

For 132-1 whose data reads as "s/he smoked 5 filtered cigs between from 16 to 25 years. But S/he smoked 5 handrolled cigs in this same time period as well". For this person, after running the step 2 code, age variables from 16 to 25 has to be replaced by total intensity of 1.5 (intensity of filtered + handrolled) but running the above code is giving me only 1.25, the value of unfiltered cig. A similar issue for 133-1. How should I fix this error due to difference in type of cig smoked?

Any help is appreciated.

Thanks
Thekke purakkal
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 29953

11 Dec 2015, 13:53

Well, as is usually the case, you are handicapped in your efforts by use of a wide data layout. This is much easier to do in long. Also, having missing values or implicit zeroes coded as 88 or 888 just makes trouble. The following code gets your more or less what you asked for, although the a* variables it generates only begin with 14 as that is the youngest age referenced in your data. It is not hard to generate a1-a13 and set them to zero if you really need that.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str5 ID float(cig1_from cig1_to cig1_type cig1_amt cig2_from cig2_to cig2_type cig2_amt packday1 packday2)
"131-1" 18 28 1 15 888 888 88 88 .75    0
"131-2" 14 18 1  5  19  25  1 10 .25   .5
"132-1" 16 25 1  5  16  25  2  5 .25 1.25
"132-2" 15 16 1  5  17  21  1 20 .25    1
"133-1" 14 23 1 20  20  30  2 24   1    6
end

drop packday* // THIS IS TO BE GENERATED!

preserve // SAVE ORIGINAL DATA

// RESHAPE TO LONG AND REPLACE ENCODED MISSINGS WITH MISSING
reshape long cig@_from cig@_to cig@_type cig@_amt, i(ID) j(episode)
mvdecode cig_*, mv(88 888)

// CALCULATE PACKDAYS PER FORMULA
gen packday = cig_amt/20 if cig_type == 1
replace packday = cig_amt/4 if cig_type == 2
replace packday = 0 if missing(packday)

// CREATE A SEPARATE OBSERVATION FOR EACH YEAR OF AGE
expand cig_to-cig_from + 1
by ID episode, sort: gen int age = cig_from - 1 + _n
drop if missing(age)

// ADD UP PACKDAYS IN EACH YEAR OF AGE
// CREATE A FILE WITH JUST THIS DATA
// RESHAPED WIDE SO IT CAN MERGE BACK TO ORIGINAL
collapse (sum) a=packday, by(ID age)
reshape wide a, i(ID) j(age)
tempfile age_intensity
save `age_intensity'

// BRING BACK ORIGINAL AND MERGE
restore
merge 1:m ID using `age_intensity'

That said, wherever you're going next with this data analysis, it's probably better to keep the data in long rather than reverting to the wide layout that caused so much trouble in the first place. So, if it were me, I'd probably stop after the -collapse- statement and just work from there. But that's up to you. If you really do go back to the long form, I strongly encourage you to replace those 88 and 888 codes with true missing values or zeroes as appropriate. They're never going to help you, and they will often get in the way.

Comment

Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#3

11 Dec 2015, 16:19

Hello Prof. Schechter,

Thank you very much for the quick response. I will try this out and get back to you. Basically the data is from a case-control study. Hence the long data format was not used initially. I myself had had to reshape to long for some time to event model analysis. So I understand what you mean.

Thanks
Thekke Purakkal
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#4

08 Jan 2016, 22:37

Hello Prof. Schechter,

Thank you very much for your help with the above codes. I made some modifications (due to soem specifications in my data set) and calculated the cumulative life time consumption of pack_years in my case-control data set. I also had an alcohol variable in similar form and I calculated the lifetime cumulative alcohol consumption for each participant in my data. I have been running various analyses with these and my next task is testing interaction. For example I want to know if there is any difference in risk for my binary outcome among alcohol consumers stratified by another variable(binary). I have posted related queries on this in the topic Plotting restricted cubic-spline graph, stratified by another binary variable. I had seen you and some others members here comment on using the margins command for visualizing interactions. I tried these. The graphs were excellent. However, I came to understand that mine is a case-control data and because I do not have an idea about the baseline risk in the population, using margins and dy/dx commands for interactions won't be a feasible option (What is your take on this?). Furthermore, although alcohol is not associated with the binary outcome in its linear form in my data, a 'statistically significant' non linear association has been identified using restricted cubic splines. Hence, I am using RC splines in my interaction analysis as well. Furthermore, I am spline transforming the centered (by mean) alcohol variable. My plans are,
1) First assess deviation from multiplicativity or additivity performing interaction between binary form of alcohol variable (0,1)) and the binary third variable (Z).
2) See how a restricted cubic splines curve for alcohol behaves in the 2 groups of the z variable (2 different graphs). The graphs would have log odds in the y axis and alcohol in liters (continuous) on the X axis.

In this scenario, what is your best advise on how to proceed with the interaction analysis?

thanks

Last edited by Thekke Purakkal; 08 Jan 2016, 22:41.
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#5

01 Feb 2016, 08:51

Hi,
I have a complication related to question asked in #1. Following advise by Clyde in #2, I am using the long data format (given below). The question asked in #1 was for calculation of pack years. However, now, for each individual, I want to calculate the total duration of smoking over the lifetime and frequency [average number of cigarette (if tobacco type is 2, it will be converted to cigarette) smoked per day].
The complication here is that for individuals like 131-2 who has continuous episodes of smoking over life, e.g., 14-18, 19-25 etc over their total smoking period of 14 to 25, the duration of smoking during episode 1 must be 18-14=4 years and duration of 2nd episode must include both 19 and 25, that is 7 years, and not 25-19=6 years. For an individual like 133-2, the duration of smoking over life from 20 to 45 with 2 discontinuous episodes, i.e 20 to 30 and 40 to 45 with no smoking between these 2 periods, the duration for episode 1 must be 30-10=10 years and for 2 must be 45-40=5 years.
Also, for calculating frequency , which I define as number of cigarettes smoked per day, individuals have different frequencies of smoking in different episodes. So how can I get an approximation of average cigarette smoked per day by each person over the life time? In my full data set, I have 8 episodes and I have information of smoking history from 1 to 95 years of age. 88 and 888 are for a purpose and cannot be coded as missing data. They have to be retained as it is.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str5 ID byte episode float(cig_from cig_to cig_type cig_amt) "131-1" 1 18 28 1 15 "131-1" 2 88 88 888 88 "131-2" 1 14 18 1 5 "131-2" 2 19 25 1 10 "132-1" 1 16 25 1 5 "132-1" 2 16 25 2 5 "132-2" 1 15 16 1 5 "132-2" 2 17 21 1 20 "133-1" 1 14 23 1 20 "133-1" 2 20 30 2 24 "133-2" 1 20 30 1 10 "133-2" 2 40 45 1 15 end

thanks for your consideration.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#6

01 Feb 2016, 09:24

I don't understand why for some episodes you want to include both endpoints and not for others. If there is some rule for identifying from the data which is which, please explain it. It seems to me that you should always include both endpoints. For example, 131-2 smoked at one intensity from 14-18 and another from 19-25. But this means he/she smoked continuously from 14 through 25, which is 25-14 + 1 = 12 years, which is not 7 + 4, but 7 + 5.

Anyway, for now I'm going to ignore that issue and just assume that the correct way forward is to count both endpoints of every interval. Based on that assumption, I think the code below will do what you want. If you really need to cut the end year off of some of the intervals and can identify how to tell which is which, post back and I'll try to modify it accordingly.

Code:

mvdecode cig_from cig_to cig_amt, mv(88) mvdecode cig_type, mv(888) gen episode_duration = cig_to - cig_from + 1 egen total_smoke_years = total(episode_duration), by(ID cig_type) gen episode_pack_years = episode_duration*cig_amt egen total_pack_years = total(episode_pack_years), by(ID cig_type) gen avg_pack_intensity = total_pack_years/total_smoke_years

Note: You don't say if you want this separately by cigarette type or all combined. I've assumed separately in the above. If you want it all combined, just remove cig_type from the -by(ID cig_type)- options in both places and you'll have it.

By the way, I cannot think of a single Stata calculation that will work better (or even correctly) with those 88 and 888 codes for missing values (which are, in your case, implicitly zeroes). I strongly recommend that you create a new data set with all of those changed to Stata missing values, or perhaps to zeroes where applicable. Sooner or later you are likely to run some calculation, forgetting to deal with the 88's as missing/0, and you will get bizarre results (or worse, plausible but seriously incorrect results).
1 like
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#7

01 Feb 2016, 10:18

Thank you Clyde for the quick response.
Regarding the inclusion of both endpoints, I had ran into some trouble by using both endpoints with some observations in my data. I will check it in detail and get back to you if the problem still persist. For now, I will try to use both endpoints as you advised.
In our data set, we have codes of 99/999 and 88/888 for various variables. The 99/999 are missing values and will be coded as so (.). 88/888 carry a meaning of "not applicable" which as yous aid will be coded to zero. Hence, I cannot code 88/888 as missing values because their meaning is different from the real missing values in the data set (99/999), I have been adding a recode of 88 or 888 to zero in all my calculations where i encounter 88/888.
Regarding types, I have 3, of which cigarette types 1 and 2 are the same qualitatively and type 3 is a hand rolled cigarette Ultimately I want all combined.
In the example above, I have given only my question regarding cigarettes, In my full data, I have information on various types of tobacco products in the above manner(cigarettes, cigars and pipes). I need to combine all the products to one, i.e, tobacco smoked. , All cigars and pipes will be converted to equivalents of cigarettes (i am using these conversions: 1 pack of commercial cigarette =20 commercial cigarettes, 1 hand rolled cigarette= cigar=5 commercial cigarettes.1 pipe =4 commercial cigarettes).So I have to ultimately calculate duration, frequency (average number of cigarettes smoked per day) and pack-years of cigarettes for tobacco smoked over life time,
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#8

01 Feb 2016, 10:54

So, to turn the 88/888 values to zero, the simplest code is similar to what I did with -mvdecode- above:

Code:

recode cig_from cig_to cig_amt (88 = 0) recode cig_type (888 = 0)
Comment

Thekke Purakkal

Join Date: Dec 2015
Posts: 95

01 Feb 2016, 10:55

Thank you.
A more elaborate sample data set on 10 observations with info on 3 episodes each for cigarette, cigar n pipe smoking is given below if you would like to have a look at it and caution me of any thing that need to be taken care of while combining cigarettes, cigars n pipes while computing total duration of indulgence in smoking tobacco (any of the products) over life time, average tobacco smoked per day over life time and cumulative pack-years over life time.
thanks

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str5 ID byte(episode status) int(cig_from cig_to) byte cig_type float cig_amt int(cgr_from cgr_to) float cgr_amt int(pip_from pip_to) float pip_amt
"011-2" 1 0 13 16  1 10  17  35  8  29  30  1
"011-2" 2 0 17 31  1 25 888   . 88 888   . 88
"011-2" 3 0 32 50  1 40 888   . 88 888   . 88
"042-2" 1 0 14 62  1  6  15  62  1  13  17  5
"042-2" 3 0  .  . 88 88 888   . 88 888   . 88
"042-2" 2 0  .  . 88 88 888   . 88 888   . 88
"050-2" 2 0 10 34  1 20 888   . 88 888   . 88
"050-2" 1 0  8  9  3  3  28  34 .5  25  35  5
"050-2" 3 0 39 69  1 25 888   . 88 888   . 88
"087-2" 2 0 18 37  1 40 888 888 88 888 888 88
"087-2" 3 0 38 48  1 60 888 888 88 888 888 88
"087-2" 1 0 12 17  3 20 888 888 88 888 888 88
"088-2" 1 0  .  . 88 88 888 888 88 888 888 88
"088-2" 3 0  .  . 88 88 888 888 88 888 888 88
"088-2" 2 0  .  . 88 88 888 888 88 888 888 88
"089-1" 3 1  .  . 88 88 888 888 88 888   . 88
"089-1" 2 1  .  . 88 88 888 888 88 888   . 88
"089-1" 1 1 14 34  1 30 888 888 88  34  71  4
"089-2" 3 0 56 71  1 22 888 888 88 888 888 88
"089-2" 2 0 48 53  1 18 888 888 88 888 888 88
"089-2" 1 0 23 40  1 18 888 888 88 888 888 88
"090-1" 3 1  .  . 88 88 888 888 88 888 888 88
"090-1" 2 1  .  . 88 88 888 888 88 888 888 88
"090-1" 1 1 19 58  1 20 888 888 88 888 888 88
"090-2" 2 0  .  . 88 88 888 888 88 888 888 88
"090-2" 3 0  .  . 88 88 888 888 88 888 888 88
"090-2" 1 0  .  . 88 88 888 888 88 888 888 88
"094-1" 1 1 15 70  1 20  50  52  4 888 888 88
"094-1" 2 1  .  . 88 88 888   . 88 888 888 88
"094-1" 3 1  .  . 88 88 888   . 88 888 888 88
end
label values status Stat
label def Stat 0 "Control", modify
label def Stat 1 "Case", modify

Last edited by Thekke Purakkal; 01 Feb 2016, 10:58.

Comment

Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#10

01 Feb 2016, 11:50

Hi,

I ran the code in #6. Removed cig_type as I wanted combined results. Things look good for all observations except 131-1. Shouldn't his/her total duration be 12 instead of 11 years as duration in episode 2 is 0?. Because of this, the average cigarettes smoked per day is calculated as 13.75 instead of 15.

Also I performed the same calculations for cigar and pipe and calculated the ave_pack_intensities of cigar and pipe smoked per day over the life time. To get the combined ave_pack_intensities of tobacco smoked (i.e., cigarette plus cigar plus pipe), do I just add up all the three ave_pack_intensities or are there any caveates?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#11

01 Feb 2016, 12:55

There are a few things that confuse me in the data in #9. You have a hybrid between long and wide data with respect to the type of tobacco. On the one hand, type of cigarette is reflected in long data: you have separate observations for types 1, 2, and 3. But then the cigar and pipe data is set out beside these in wide format. Moreover, the age intervals for cigar and pipe data don't correspond to the intervals for the cigarettes, so I don't understand how those particular data got matched up with each other. I think this hybrid layout will get you into trouble sooner or later. So I would first convert the entire data set into a single long layout:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str5 ID byte(episode status) int(cig_from cig_to) byte cig_type float cig_amt int(cgr_from cgr_to) float cgr_amt int(pip_from pip_to) float pip_amt "011-2" 1 0 13 16 1 10 17 35 8 29 30 1 "011-2" 2 0 17 31 1 25 888 . 88 888 . 88 "011-2" 3 0 32 50 1 40 888 . 88 888 . 88 "042-2" 1 0 14 62 1 6 15 62 1 13 17 5 "042-2" 3 0 . . 88 88 888 . 88 888 . 88 "042-2" 2 0 . . 88 88 888 . 88 888 . 88 "050-2" 2 0 10 34 1 20 888 . 88 888 . 88 "050-2" 1 0 8 9 3 3 28 34 .5 25 35 5 "050-2" 3 0 39 69 1 25 888 . 88 888 . 88 "087-2" 2 0 18 37 1 40 888 888 88 888 888 88 "087-2" 3 0 38 48 1 60 888 888 88 888 888 88 "087-2" 1 0 12 17 3 20 888 888 88 888 888 88 "088-2" 1 0 . . 88 88 888 888 88 888 888 88 "088-2" 3 0 . . 88 88 888 888 88 888 888 88 "088-2" 2 0 . . 88 88 888 888 88 888 888 88 "089-1" 3 1 . . 88 88 888 888 88 888 . 88 "089-1" 2 1 . . 88 88 888 888 88 888 . 88 "089-1" 1 1 14 34 1 30 888 888 88 34 71 4 "089-2" 3 0 56 71 1 22 888 888 88 888 888 88 "089-2" 2 0 48 53 1 18 888 888 88 888 888 88 "089-2" 1 0 23 40 1 18 888 888 88 888 888 88 "090-1" 3 1 . . 88 88 888 888 88 888 888 88 "090-1" 2 1 . . 88 88 888 888 88 888 888 88 "090-1" 1 1 19 58 1 20 888 888 88 888 888 88 "090-2" 2 0 . . 88 88 888 888 88 888 888 88 "090-2" 3 0 . . 88 88 888 888 88 888 888 88 "090-2" 1 0 . . 88 88 888 888 88 888 888 88 "094-1" 1 1 15 70 1 20 50 52 4 888 888 88 "094-1" 2 1 . . 88 88 888 . 88 888 888 88 "094-1" 3 1 . . 88 88 888 . 88 888 888 88 end label values status Stat label def Stat 0 "Control", modify label def Stat 1 "Case", modify // CREATE DATA SET IN PURE LONG FORMAT preserve keep ID episode cgr* pip* reshape long @_to @_from @_amt, i(ID episode) j(_type) string label define _type 1 "cig1" 2 "cig2" 3 "cig3" 4 "cgr" 5 "pip" encode _type, gen(type) label(type) drop _type rename _* * tempfile cgr_pip des save `cgr_pip' restore drop cgr* pip* rename cig_* * append using `cgr_pip' recode amt (88 = 0) mvdecode from to type, mv(88 888) // SPREAD STATUS WHERE MISSING by ID episode (status), sort: assert status == status[1] if !missing(status) by ID episode (status): replace status = status[1] if missing(status)

Next, I still find the use of 88 and 888 problematic. Perhaps you are very careful in always dealing with them at every calculation, but most mortals will slip up on this sooner or later. As I understand it (correct me if I'm wrong) the use of these codes signifies that no smoking (of the given type) was taking place. In that case it seems to me that the amount should be coded as 0, and the to, from, and type should be coded as missing because they are simply undefined. This is reflected in my -recode- and -mvdecode- commands above.

The data now lend themselves to creating a new variable encoding the "cigarette equivalents" of the various types of tobacco:

Code:

// CONVERSION FACTORS AMONG TOBACCO TYPES gen cig_eq = 1 if inlist(type, 1, 2) replace cig_eq = 5 if inlist(type, 3, 4) replace cig_eq = 4 if type == 5

Finally, if you want total smoking years, total cigarette-equivalent use, and average intensity, those would be:

Code:

// CALCULATE TOTAL SMOKING AND TOTAL DURATIONS: ALL TYPES COMBINED gen duration = to - from + 1 egen total_smoke_years = total(duration), by(ID) egen total_cig_equivalents = total(cig_eq*amt), by(ID) gen average_intensity = total_cig_equivalents/total_smoke_years

As for

Things look good for all observations except 131-1. Shouldn't his/her total duration be 12 instead of 11 years as duration in episode 2 is 0?. Because of this, the average cigarettes smoked per day is calculated as 13.75 instead of 15.

I don't agree, but perhaps I'm missing something. The second "episode" for that person is a non-smoking episode. There are no cigarettes and there is no duration. So I do not see why you want to add one more year to the total smoking duration for this person. If episodes of non-smoking like this are always to be counted as 1 year, then you can achieve that by inserting -replace episode_duration = 1 if amt == 0- in the code just after the -gen duration...- command. But if you are trying to include periods of non-smoking in the denominator (which may be reasonable to do, depending on your research question), I don't understand the basis for assuming that such periods are always 1 year when there is no information on the start and stop ages.

Finally, I'll just note that you cannot get average intensities for all types combined by adding up (nor by averaging) the type-specific intensities. You have to separately add up the numerators and the denominators, and then take the ratio of those totals. This is what the code in my final code block above does.
1 like
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#12

01 Feb 2016, 15:11

Thank you very much Clyde. First of all I am extremely sorry and I sincerely apologize for the following comment in 10.

Things look good for all observations except 131-1. Shouldn't his/her total duration be 12 instead of 11 years as duration in episode 2 is 0?. Because of this, the average cigarettes smoked per day is calculated as 13.75 instead of 15.

.

My intention was to convey that the duration should be 11 but the code is giving me 12 for the data in #5. I am sorry for confusing you. I should have tripple checked it,

Secondly, my original data set is in wide format (as I mentioned in #1) but following your suggestion in #2, I reshaped the example codes in #5 and #9 into long format for just these sections in my full wide data set. The mistake you noted (hybrid between long and wide) happened when I was reshaping the data. I am sorry for this as well.

I will run your codes and see if its working for me in total.

Thank you very much once again for your time.
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#13

01 Feb 2016, 16:41

Hello Clyde,
I found out the mistake that occurred while reshaping the data from wide to long. Actually, the cigarette related information ( from, to , type , amount) are part of the question f02, those of cigar is related to f03 and pipe to f04. Some how these question numbers got deleted while reshaping. I am sorry for this. I don't know if you have anymore time to spare on this. However, I am attaching the wide data ( Original Wide format Sample file.dta) with this message ( as running dataex command on wide format gives me the error "input statement exceeds line size limit. Try specifying fewer variables" ). You can use it to cross check the reshape code I used to convert it to long format and the code for resulting long data by dataex, given below.

Code:

reshape long f02cig@_from f02cig@_to f02cig@_type f02cig@_amt f03cgr@_from f03cgr@_to f03cgr@_amt f04pip@_from f04pip@_to f04pip@_amt , i(ID) j(episode)

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str5 ID byte(episode status age) int(f02cig_from f02cig_to) byte f02cig_type float f02cig_amt int(f03cgr_from f03cgr_to) float f03cgr_amt int(f04pip_from f04pip_to) float f04pip_amt "011-2" 1 0 57 13 16 1 10 17 35 8 29 30 1 "011-2" 2 0 57 17 31 1 25 888 . 88 888 . 88 "011-2" 3 0 57 32 50 1 40 888 . 88 888 . 88 "042-2" 1 0 62 14 62 1 6 15 62 1 13 17 5 "042-2" 2 0 62 . . 88 88 888 . 88 888 . 88 "042-2" 3 0 62 . . 88 88 888 . 88 888 . 88 "050-2" 1 0 69 8 9 3 3 28 34 .5 25 35 5 "050-2" 2 0 69 10 34 1 20 888 . 88 888 . 88 "050-2" 3 0 69 39 69 1 25 888 . 88 888 . 88 "087-2" 1 0 58 12 17 3 20 888 888 88 888 888 88 "087-2" 2 0 58 18 37 1 40 888 888 88 888 888 88 "087-2" 3 0 58 38 48 1 60 888 888 88 888 888 88 "088-2" 1 0 66 . . 88 88 888 888 88 888 888 88 "088-2" 2 0 66 . . 88 88 888 888 88 888 888 88 "088-2" 3 0 66 . . 88 88 888 888 88 888 888 88 "089-1" 1 1 71 14 34 1 30 888 888 88 34 71 4 "089-1" 2 1 71 . . 88 88 888 888 88 888 . 88 "089-1" 3 1 71 . . 88 88 888 888 88 888 . 88 "089-2" 1 0 77 23 40 1 18 888 888 88 888 888 88 "089-2" 2 0 77 48 53 1 18 888 888 88 888 888 88 "089-2" 3 0 77 56 71 1 22 888 888 88 888 888 88 "090-1" 1 1 60 19 58 1 20 888 888 88 888 888 88 "090-1" 2 1 60 . . 88 88 888 888 88 888 888 88 "090-1" 3 1 60 . . 88 88 888 888 88 888 888 88 "090-2" 1 0 58 . . 88 88 888 888 88 888 888 88 "090-2" 2 0 58 . . 88 88 888 888 88 888 888 88 "090-2" 3 0 58 . . 88 88 888 888 88 888 888 88 "094-1" 1 1 73 15 70 1 20 50 52 4 888 888 88 "094-1" 2 1 73 . . 88 88 888 . 88 888 888 88 "094-1" 3 1 73 . . 88 88 888 . 88 888 888 88 end label values status Stat label def Stat 0 "Control", modify label def Stat 1 "Case", modify

I hope the present data is perfectly in long format with distinction between questions f02 (cigarettes), f03 (cigars) and f04 (pipes). I hope this also solves the issue you raised in #11 through your comment,

Moreover, the age intervals for cigar and pipe data don't correspond to the intervals for the cigarettes, so I don't understand how those particular data got matched up with each other.

Since the history of cigarette smoking , cigar and pipe are taken one after the other through different sets of questions, there can be overlaps between the intervals.

I don't know if the code you provided me with in #11 would apply to this data nor do I know if you have anymore time to spend on this. But being hopeful to get help in calculating the average intensities for all types combined from this dataset. To explain the variables, ID is participant id, status is case control status, f02 contains questions regarding cigarette [from age, to age, type (1=2= equivalent to normal cigarette, 3= handrolled) and amount in number of cigs per day], f03 pertains to info on from, to and amount of cigar smoked and f04 on those of pipe smoked. I have also included an additional variable age, that represents the age of the individual during enrollment into study. It will be useful in crosschecking if the episodes are within the value of age specified for a particular individual.

Thank you for your consideration.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29953

#14

01 Feb 2016, 18:10

So, bringing the data into the simplest shape for analysis is a little different, but the rest of the code works pretty much as is. Here it is beginning to end:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str5 ID byte(episode status age) int(f02cig_from f02cig_to) byte f02cig_type float f02cig_amt int(f03cgr_from f03cgr_to) float f03cgr_amt int(f04pip_from f04pip_to) float f04pip_amt
"011-2" 1 0 57 13 16  1 10  17  35  8  29  30  1
"011-2" 2 0 57 17 31  1 25 888   . 88 888   . 88
"011-2" 3 0 57 32 50  1 40 888   . 88 888   . 88
"042-2" 1 0 62 14 62  1  6  15  62  1  13  17  5
"042-2" 2 0 62  .  . 88 88 888   . 88 888   . 88
"042-2" 3 0 62  .  . 88 88 888   . 88 888   . 88
"050-2" 1 0 69  8  9  3  3  28  34 .5  25  35  5
"050-2" 2 0 69 10 34  1 20 888   . 88 888   . 88
"050-2" 3 0 69 39 69  1 25 888   . 88 888   . 88
"087-2" 1 0 58 12 17  3 20 888 888 88 888 888 88
"087-2" 2 0 58 18 37  1 40 888 888 88 888 888 88
"087-2" 3 0 58 38 48  1 60 888 888 88 888 888 88
"088-2" 1 0 66  .  . 88 88 888 888 88 888 888 88
"088-2" 2 0 66  .  . 88 88 888 888 88 888 888 88
"088-2" 3 0 66  .  . 88 88 888 888 88 888 888 88
"089-1" 1 1 71 14 34  1 30 888 888 88  34  71  4
"089-1" 2 1 71  .  . 88 88 888 888 88 888   . 88
"089-1" 3 1 71  .  . 88 88 888 888 88 888   . 88
"089-2" 1 0 77 23 40  1 18 888 888 88 888 888 88
"089-2" 2 0 77 48 53  1 18 888 888 88 888 888 88
"089-2" 3 0 77 56 71  1 22 888 888 88 888 888 88
"090-1" 1 1 60 19 58  1 20 888 888 88 888 888 88
"090-1" 2 1 60  .  . 88 88 888 888 88 888 888 88
"090-1" 3 1 60  .  . 88 88 888 888 88 888 888 88
"090-2" 1 0 58  .  . 88 88 888 888 88 888 888 88
"090-2" 2 0 58  .  . 88 88 888 888 88 888 888 88
"090-2" 3 0 58  .  . 88 88 888 888 88 888 888 88
"094-1" 1 1 73 15 70  1 20  50  52  4 888 888 88
"094-1" 2 1 73  .  . 88 88 888   . 88 888 888 88
"094-1" 3 1 73  .  . 88 88 888   . 88 888 888 88
end
label values status Stat
label def Stat 0 "Control", modify
label def Stat 1 "Case", modify

reshape long @_from @_to @_type @_amt, i(ID episode) j(what_smoked) string
recode _amt (88 888 = 0)
mvdecode _from _to _type, mv(88 888)
assert inlist(_type, 1, 2, 3) if !missing(_type)
gen type = _type
replace type = 4 if what_smoked == "f03cgr"
replace type = 5 if what_smoked == "pipe"
label define type    1    "Type 1 cigarette (f02)" ///
                    2    "Type 2 cigarette (f02)" ///
                    3    "Hand-rolled cigarette (f02)" ///
                    4    "Cigar (f03)" ///
                    5    "Pipe (f04)"
label values type type
drop _type
rename _* *

// NOW CALCULATE DURATION OF EACH OBSERVATION
gen duration = to - from + 1

// CREATE CIGARETTE EQUIVALENTS
gen cig_eq = 1 if inlist(type, 1, 2)
replace cig_eq = 5 if inlist(type, 3, 4)
replace cig_eq = 4 if type == 5
label var cig_eq "Number of cigarettes equivalent to 1 of this type"

// CALCULATE TOTAL SMOKING, TOTAL DURATION, AND AVERAGE
// SMOKING INTENSITY FOR ALL TYPES COMBINED
egen total_cig_eq_years = total(cig_eq*amt*duration), by(ID)
egen total_years = total(duration), by(ID)
gen smoking_intensity = total_cig_eq_years/total_years

By the way, thank you for using -dataex-. What I have done would have taken much longer had you not done so, and, in truth, I probably would not have bothered.

Comment

Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#15

01 Feb 2016, 18:17

Thank-you very much for your time and help with this Clyde. I really appreciate it and once again, sorry for all inconvenience/ errors from my side.
Comment

Announcement