Hi
I would like some advice on the best way to combine (or merge) several datasets to create one large dataset with longitudinal data. I have ten datasets, each corresponding to a calendar year between 2005 and 2014. A dataset may contain between 12,000 and 25,000 subjects and up to 15 variables. When combined, one should be able to follow a subject over the years. There will be new subjects included each year and several will also drop out. There is an ID variable that identifies subjects as well as a variable indicating the year. The datasets contain both time-constant and time varying variables. Some time-constant variables such as birthdate, date of diagnosis, ethnicity are recorded each year and should be comparable over the years once all datasets are merged together.
Examples of three datasets (made-up):
dataset 1:
dataset 2:
dataset 3:
I would like to merge the above three datasets to have one dataset in the long format so I can run longitudinal analysis (for e.g. multilevel or GEE models).
Ideally, I would like each subject to have as many records (or 'rows') as data measurements over time:
Combined dataset:
1. My first approach was to use append but this results in a dataset being added below the previous (master) dataset and I don't obtain the above structure I would like.
2.Additionally, I'm guessing the names of the variables have to be identical across all the datasets (they currently differ in the suffix - perhaps this is what is creating the problem with append?).
Thanks
/Amal
I would like some advice on the best way to combine (or merge) several datasets to create one large dataset with longitudinal data. I have ten datasets, each corresponding to a calendar year between 2005 and 2014. A dataset may contain between 12,000 and 25,000 subjects and up to 15 variables. When combined, one should be able to follow a subject over the years. There will be new subjects included each year and several will also drop out. There is an ID variable that identifies subjects as well as a variable indicating the year. The datasets contain both time-constant and time varying variables. Some time-constant variables such as birthdate, date of diagnosis, ethnicity are recorded each year and should be comparable over the years once all datasets are merged together.
Examples of three datasets (made-up):
dataset 1:
Code:
input ID birthdate05 diagdate05 bmi05 chol05 year05 1 1981 1981 18 180 2005 2 1979 1997 21 169 2005 3 2001 2004 17 188 2005 4 2000 2005 23 179 2005 5 1982 1994 . 159 2005 6 1999 2001 22 173 2005 7 1996 2000 21 181 2005
Code:
input ID birthdate07 diagdate07 bmi07 chol07 year07 1 1981 1981 18 180 2007 2 1979 1997 21 169 2007 3 2001 2004 17 188 2007 5 1982 1994 19 159 2007 6 1999 2001 22 173 2007 16 2000 2004 21 181 2007
Code:
input ID birthdate11 diagdate11 bmi11 chol11 year011 1 1981 1981 18 180 2011 2 1979 1997 21 169 2011 3 2001 2004 17 188 2011 4 2000 2005 23 . 2011 5 1982 1994 19 159 2011 16 1999 2001 22 173 2011 19 1996 2000 21 181 2011 21 2009 2010 21 . 2011
I would like to merge the above three datasets to have one dataset in the long format so I can run longitudinal analysis (for e.g. multilevel or GEE models).
Ideally, I would like each subject to have as many records (or 'rows') as data measurements over time:
Combined dataset:
Code:
input ID birthdate diagdate bmi chol year 1 1981 1981 18 180 2005 1 1981 1981 18 180 2007 1 1981 1981 18 180 2011 2 1979 1997 21 169 2005 2 1979 1997 21 169 2007 2 1979 1997 21 169 2011 3 2001 2004 17 188 2005 3 2001 2004 17 188 2007 3 2001 2004 17 188 2011 4 2000 2005 23 179 2005 4 2000 2005 23 . 2011 5 1982 1994 . 159 2005 5 1982 1994 19 159 2007 5 1982 1994 19 159 2011 6 1999 2001 22 173 2005 6 1999 2001 22 173 2007 7 1996 2000 21 181 2005 16 2000 2004 21 181 2007 16 1999 2001 22 173 2011 19 1996 2000 21 81 2011 21 2009 2010 21 . 2011
1. My first approach was to use append but this results in a dataset being added below the previous (master) dataset and I don't obtain the above structure I would like.
2.Additionally, I'm guessing the names of the variables have to be identical across all the datasets (they currently differ in the suffix - perhaps this is what is creating the problem with append?).
Thanks
/Amal
Comment