Looping (forvalues) to calculate pack-years of cigarette smoked

Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#16

02 Feb 2016, 04:06

Hi, I ran into some problem again.
Please find attached a snap shot (.png format) of data editor view of Stata (13 SE version) after running the codes on the example data. It contains details on individual 011-2, 042-2 and 050-2.Their ages are 57, 62 and 69 respectively (as shown by the age variable). However, their duration of tobacco calculated are, 59, 102 and 76 respectively. The code is disregarding the overlap in ages where the individual indulged in a particular type of tobacco smoked and is adding them all up separately. For example, 011-2 has duration 4, 19, 2, 15 and 19 calculated against it for each episode and type of tobacco. And the code for duration adds up all these to get tota_years=59 years. However, the first episode of cigar smoking from 17 to 35 years over laps with first episode of pipe 29-30, 2nd episode of cigarette smoking (17-31) and 3rd episode of cigarette smoking (32-50). In effect this person smoked any form of tobacco from 13 to 50 yrs of age. So the duration must be 50-13+1. This will also affect the intensity calculated. How can we fix this or am I missing something?
Attached Files
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#17

02 Feb 2016, 05:47

Also, in my data set, there are individuals who had abstained from any type of tobacco smoking for some years between active periods of smoking. For any such individual,such years of abstinence has to be subtracted. For example if episode 1 is 13-20 yrs cigarette, and episode 1 for cigar is 25- 30 yrs, and the person did not smoke any tobacco between 21-24 years, the duration for tobacco smoke should not be 30-13+1 but (20-13+1) +(30-25+1). Right? 050-2, 089-2 are examples in the sample data.

Last edited by Thekke Purakkal; 02 Feb 2016, 05:55.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#18

02 Feb 2016, 09:20

I understand the problem in #16 and it is solvable. But in the data I have seen from you, I don't understand how you can tell how long the abstinence intervals are. The only examples I have seen in your data of abstinence is where everything is coded 88 or 888, which means that there is no indication of when the abstinence interval started and stopped. Please clarify.
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#19

02 Feb 2016, 10:16

Ok, I will try to explain. 011-2 smoked cigarette between 13-50 yrs of age with varying frequencies. S/he also smoked cigar and pipe but those years were within the range of 13-50. So no period of abstinence. 050-2 is a 69 year old person. S/he has 3 episodes of cigarette smoking, 8-9 years, 10-34 years and 39 to 69 years. This person also indulged in cigar smoking from 28 to 34 years. and pipe from 25 to 35 years.So between 35 and 39 (or 36 and 38), the individual abstained from smoking any form of tobacco. Resumed cigarette smoking from 39 yrs and was still continuing when the data was collected. A similar case with individual 089-2, a 77 yr old individual who smoked only cigarette during their life time. But the three episodes are like 23-40, 48 to 53 and 56 to 71. This person abstained from smoking between 40 and 48 years and 53 to 56 years. I do not know but probably its easy to visualize this by seeing the data on the three tobacco products separately as given in the below format? Or probably its just that I am more used to seeing this data in the wide format rather than long. I don't know any other way to explain the presence of abstinence. In total I have actually upto 7 episodes of cigarette data on each individual in my data set, 4 for cigar and 4 for pipe and a total for around 800 individuals. Here I have given only 3 episodes each for each product as an example data. I do have variable on status of smoking ( past, present or non smoker, for all three types of products. Non smoker is coded 0 and all entries of f02, f03 and f04 are coded as 88/888 for these individuals. So this variable can also be used in code form like replace "x"=0 if smoking status==0 where ever required ). I do not know what other info you would like to have to explain the situation more. Please let me know and I will be able to provide (e.g. info on other variables, original data in the wide format etc).

Code:

Example generated by -dataex-. To install: ssc install dataex clear input str5 ID byte(episode status age) int(f02cig_from f02cig_to) byte f02cig_type float f02cig_amt int(f03cgr_from f03cgr_to) float f03cgr_amt int(f04pip_from f04pip_to) float f04pip_amt "011-2" 1 0 57 13 16 1 10 17 35 8 29 30 1 "011-2" 2 0 57 17 31 1 25 888 . 88 888 . 88 "011-2" 3 0 57 32 50 1 40 888 . 88 888 . 88 "042-2" 1 0 62 14 62 1 6 15 62 1 13 17 5 "042-2" 2 0 62 . . 88 88 888 . 88 888 . 88 "042-2" 3 0 62 . . 88 88 888 . 88 888 . 88 "050-2" 1 0 69 8 9 3 3 28 34 .5 25 35 5 "050-2" 2 0 69 10 34 1 20 888 . 88 888 . 88 "050-2" 3 0 69 39 69 1 25 888 . 88 888 . 88 "087-2" 1 0 58 12 17 3 20 888 888 88 888 888 88 "087-2" 2 0 58 18 37 1 40 888 888 88 888 888 88 "087-2" 3 0 58 38 48 1 60 888 888 88 888 888 88 "088-2" 1 0 66 . . 88 88 888 888 88 888 888 88 "088-2" 2 0 66 . . 88 88 888 888 88 888 888 88 "088-2" 3 0 66 . . 88 88 888 888 88 888 888 88 "089-1" 1 1 71 14 34 1 30 888 888 88 34 71 4 "089-1" 2 1 71 . . 88 88 888 888 88 888 . 88 "089-1" 3 1 71 . . 88 88 888 888 88 888 . 88 "089-2" 1 0 77 23 40 1 18 888 888 88 888 888 88 "089-2" 2 0 77 48 53 1 18 888 888 88 888 888 88 "089-2" 3 0 77 56 71 1 22 888 888 88 888 888 88 "090-1" 1 1 60 19 58 1 20 888 888 88 888 888 88 "090-1" 2 1 60 . . 88 88 888 888 88 888 888 88 "090-1" 3 1 60 . . 88 88 888 888 88 888 888 88 "090-2" 1 0 58 . . 88 88 888 888 88 888 888 88 "090-2" 2 0 58 . . 88 88 888 888 88 888 888 88 "090-2" 3 0 58 . . 88 88 888 888 88 888 888 88 "094-1" 1 1 73 15 70 1 20 50 52 4 888 888 88 "094-1" 2 1 73 . . 88 88 888 . 88 888 888 88 "094-1" 3 1 73 . . 88 88 888 . 88 888 888 88 end label values status Stat label def Stat 0 "Control", modify label def Stat 1 "Case", modify

Ok I checked the data once again. All missing's "." and 88's can be considered as zero. That's how the data is structured for the time being. I checked for real missing info in the tobacco related section in the data set and there are no actual "missing" values unlike other sections where if there is actual missing information, it will be coded as 99/999

Last edited by Thekke Purakkal; 02 Feb 2016, 10:57. Reason: Added additional info after the code regarding 88/888's and missing values
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30095

#20

02 Feb 2016, 11:17

OK, so my understanding now is that periods of abstinence are recognized as gaps in the age intervals shown in the data. So the total period over which you want to calculate the average smoking intensity is from the youngest "from" age through the oldest "to" age (inclusive), and without double-counting years where there was more than one type of smoking going on.

If you are used to working with spreadsheets, I can understand how wide layout might seem easier to understand. But going to wide layout is seldom the answer in Stata. The only commands I encounter with any frequency that work better with wide data are -ttest- (which I hardly ever use any more) and various -graph- commands. Anyway, my approach still puts your data into long format. This time, I go on to expand each observation into a group of observations, 1 for each year between its from and to ages. I then calculate the cigarette-equivaelnt smoking intensity in each year. Then I add that up for all observations on each person, and identify their youngest from and oldest too ages. (This is what the -collapse- command below does.) The rest is easy.

Code:

*Example generated by -dataex-. To install: ssc install dataex
clear
input str5 ID byte(episode status age) int(f02cig_from f02cig_to) byte f02cig_type float f02cig_amt int(f03cgr_from f03cgr_to) float f03cgr_amt int(f04pip_from f04pip_to) float f04pip_amt
"011-2" 1 0 57 13 16  1 10  17  35  8  29  30  1
"011-2" 2 0 57 17 31  1 25 888   . 88 888   . 88
"011-2" 3 0 57 32 50  1 40 888   . 88 888   . 88
"042-2" 1 0 62 14 62  1  6  15  62  1  13  17  5
"042-2" 2 0 62  .  . 88 88 888   . 88 888   . 88
"042-2" 3 0 62  .  . 88 88 888   . 88 888   . 88
"050-2" 1 0 69  8  9  3  3  28  34 .5  25  35  5
"050-2" 2 0 69 10 34  1 20 888   . 88 888   . 88
"050-2" 3 0 69 39 69  1 25 888   . 88 888   . 88
"087-2" 1 0 58 12 17  3 20 888 888 88 888 888 88
"087-2" 2 0 58 18 37  1 40 888 888 88 888 888 88
"087-2" 3 0 58 38 48  1 60 888 888 88 888 888 88
"088-2" 1 0 66  .  . 88 88 888 888 88 888 888 88
"088-2" 2 0 66  .  . 88 88 888 888 88 888 888 88
"088-2" 3 0 66  .  . 88 88 888 888 88 888 888 88
"089-1" 1 1 71 14 34  1 30 888 888 88  34  71  4
"089-1" 2 1 71  .  . 88 88 888 888 88 888   . 88
"089-1" 3 1 71  .  . 88 88 888 888 88 888   . 88
"089-2" 1 0 77 23 40  1 18 888 888 88 888 888 88
"089-2" 2 0 77 48 53  1 18 888 888 88 888 888 88
"089-2" 3 0 77 56 71  1 22 888 888 88 888 888 88
"090-1" 1 1 60 19 58  1 20 888 888 88 888 888 88
"090-1" 2 1 60  .  . 88 88 888 888 88 888 888 88
"090-1" 3 1 60  .  . 88 88 888 888 88 888 888 88
"090-2" 1 0 58  .  . 88 88 888 888 88 888 888 88
"090-2" 2 0 58  .  . 88 88 888 888 88 888 888 88
"090-2" 3 0 58  .  . 88 88 888 888 88 888 888 88
"094-1" 1 1 73 15 70  1 20  50  52  4 888 888 88
"094-1" 2 1 73  .  . 88 88 888   . 88 888 888 88
"094-1" 3 1 73  .  . 88 88 888   . 88 888 888 88
end
label values status Stat
label def Stat 0 "Control", modify
label def Stat 1 "Case", modify

reshape long @_from @_to @_type @_amt, i(ID episode) j(what_smoked) string
recode _amt (88 888 = 0)
mvdecode _from _to _type, mv(88 888)
assert inlist(_type, 1, 2, 3) if !missing(_type)
gen type = _type
replace type = 4 if what_smoked == "f03cgr"
replace type = 5 if what_smoked == "pipe"
label define type    1    "Type 1 cigarette (f02)" ///
                    2    "Type 2 cigarette (f02)" ///
                    3    "Hand-rolled cigarette (f02)" ///
                    4    "Cigar (f03)" ///
                    5    "Pipe (f04)"
label values type type
drop _type
rename _* *

// NOW EXPAND DATA TO ONE OBSERVATION PER YEAR
assert missing(from, to) if amt == 0
drop if amt == 0 // THESE HAVE NO INFORMATION ABOUT DURATION
gen long obs_no = _n
expand to - from + 1
by obs_no, sort: gen current_age = from + _n - 1
by obs_no (current_age), sort: assert current_age[_N] == to

// CREATE CIGARETTE EQUIVALENTS & SMOKING INTENSITY
// FOR EACH YEAR OF DATA
gen cig_eq = 1 if inlist(type, 1, 2)
replace cig_eq = 5 if inlist(type, 3, 4)
replace cig_eq = 4 if type == 5
gen smoking_intensity = amt*cig_eq

// AND COLLAPSE TO A SUMMARY RECORD FOR EACH PERSON
// WITH TOTAL CIGARETTE EQUIVALENTS SMOMKED (ALL TYPES ALL YEARS)
// AND EARLIEST START AND LATEST END DATES
collapse (firstnm) survey_age = age (sum) smoking_total = smoking_intensity (min) from (max) to, by(ID)

// CALCULATE TOTAL YEARS OBSERVED, INCLUDING ABSTINENT PERIODS
gen years_observed = to - from + 1

// CALCULATE AVERAGE SMOKING INTENSITY
gen average_smoking_intensity = smoking_total/years_observed

Notes:

1. This code leaves in memory a data set with just one observation per ID, showing his/her age when surveyed, total smoking in cigarette equivalents over all observed periods, the youngest from age, the oldest to age, the years observed and the average smoking intensity. If you want to save this and -merge- it back to your original data (or, what would probably be more productive for further analysis is to merge it back to the data as it was right after the -rename _* *- command) you can.

2. There is one possible way in which this does not quite get what you intend. In some cases, for example 011-2, 087-2, 089-2, 090-1, and 094-1, the variable age has a larger value than the oldest "to" age reported. For these people, I have not included the time from the oldest "to" age to their age as of survey as a period of abstinence--I have treated it as a period in which they were not observed. However, you may have wanted it counted as abstinence. If that is the case, just change -gen years_observed = to - from + 1- to -gen years_observed = survey_age - from + 1-.

I hope this is finally what you were looking for.

Last edited by Clyde Schechter; 02 Feb 2016, 11:19.

Comment

Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#21

02 Feb 2016, 11:42

Thank you very much Clyde. I will run the code and get back to you soon.
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#22

02 Feb 2016, 12:38

Hello Clyde,

I don't know if I missed something but can you help me understand this? I guess what I want is, your below comment, minus periods of abstinence.

So the total period over which you want to calculate the average smoking intensity is from the youngest "from" age through the oldest "to" age (inclusive), and without double-counting years where there was more than one type of smoking going on.

This is what I was trying to explain in #17. Wouldn't that be making sense? I mean we shouldn't be using periods of abstinence towards duration calculation, and therefore intensity right? Periods of abstinence doesn't have any intensity associated.

The following code, as you mentioned is calculating years observed including periods of abstinence.

Code:

// CALCULATE TOTAL YEARS OBSERVED, INCLUDING ABSTINENT PERIODS gen years_observed = to - from + 1

But what we want is years observed minus abstinent period. For example, for 089-2, since there are abstinent periods between 40 and 48 years and 53 and 56 years, the total duration cannot be 71-23+1= 49 as received after running the above code, but (40-23+1)+(53-48+1)+(71-56+1) = 40. Similar situation with 050-2 where there is an abstinent period between 35 and 39 yrs of age and duration cannot be 69-8+1.

Can this be fixed?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#23

02 Feb 2016, 13:54

Thekke, perhaps it is just misunderstanding, but it feels to me like you are changing your mind and contradicting what you said in #17.

Whether it makes sense to include or exclude periods of abstinence in your assessment of average smoking intensity depends on how you understand and are planning to use the resulting rates. If you want to know the average intensity over the entire period of time reported in the surveys, then you would include abstinent periods. If you want the average intensity over the time periods when there was some smoking going on, then you would exclude abstinent periods. Both ways of calculating the rate are legitimate and meaningful--but they mean different things. While it can make sense to exclude periods of abstinence, it makes no sense to me at all to consider subtracting them.

Think of it this way, in a different context. Suppose somebody is rating how many lines of code I write per hour. If my employer pays me by the month and does not dock me for holidays, sick time, etc., he or she would rate my efficiency using a denominator that includes every hour of every potential working day of the month, whether I was actually working those hours or not. On the other hand, if my employer does dock me for time away from work, he/she would rate my efficiency based only on those hours that I was actually "clocked in and on the job." Either one is a valid estimate of my efficiency as a coder, but they mean different things and have different uses.

Only you know where you are going with this research. So let me modify the code again to give you both of these options. I show here only the "endgame," everything up to this point in the code being the same as in #20. Replace all the code in #20 after -gen smoking_intensity = amt*cig_eq- with this:

Code:

// COLLAPSE TO ONE OBSERVATION PER ID EACH YEAR // SHOWING TOTAL SMOKING INTENSITY FOR THE YEAR collapse (firstnm) survey_age = age (sum) smoking_intensity (min) from (max) to, by(ID current_age) gen byte active_year = (smoking_intensity > 0) // AND COLLAPSE FURTHER TO A SUMMARY RECORD FOR EACH PERSON // WITH TOTAL CIGARETTE EQUIVALENTS SMOMKED (ALL TYPES ALL YEARS) // AND EARLIEST START AND LATEST END DATES collapse (first) survey_age (sum) smoking_total = smoking_intensity /// (sum) total_years_smoking = active_year (min) from (max) to, by(ID) // CALCULATE TOTAL YEARS OBSERVED, INCLUDING ABSTINENT PERIODS gen years_observed = to - from + 1 // CALCULATE AVERAGE SMOKING INTENSITY gen avg_intensity_all_time = smoking_total/years_observed gen avg_intensity_active_only = smoking_total/total_years_smoking
1 like
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#24

02 Feb 2016, 14:21

Thank you Clyde for showing me a different angle to the strategy and their qualitative meanings. Probably the words "subtract" or "minus" induced misunderstandings as qualitatively I was referring to active smoking years. Pardon my use of language. My first option is to go with the total active years in smoking because, in my data set, there are individuals who have more than 10-15 years of abstinence as well as individuals with more than 5 abstinence periods. I guess not considering them would induce some kind of misclassification. And actually, as now your codes give me both options, I can compute my results using both strategies and quantitatively compare the difference.

I ran the codes and works perfectly now. Will extrapolate this to my full data set. Thank you very much for all your time and effort on this. Really appreciate it.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment