Hello,
I have a data set where for each observation, there are unique codes in string variables D1 through D15 (not every observation will have a value for all 15, but almost all have at least a D1 value).
There are almost 10,000 different unique codes represented among the string variables D1-D15.
I am trying to tabulate the most common D* codes among the observations that exist along side my particular code of interest.
That is to say, my particular D* code of interest is 'ABCD'. Others might be 'CDEF' or 'FGHI', etc.... my code of interest may appear in D1 or any other D* variable from D1 to D15, and any code should only appear once for each observation if it appears at all for that observation.
I want to find out a list of the 10 or so most common D-codes for all observations *only* of those observations that have diagnostic code 'ABCD' as a value in at least one of the variables D1 to D15.
First, I did:
keep if D1==”ABCD” | D2==”ABCD” | … all the way to |D15==”ABCD”, and that thankfully shrunk my observations down to 10s of thousands, down from many millions. But this is still too big for table commands or tabm commands using that module.
I was able to run tabulate commands:
tabulate D1, sort ---to get the list of the most common codes within D1 sorted in descending order, but I need a simple way to combine all the ABCDs and CDEFs and FGHIs that would come up for 'tabulate D2, sort' and 'tabulate D3, sort' etc... for this data file, and I’m not sure the best way to do this.
I ran [tabulate D1, sort … tabulate D2, sort …etc…] and put the results in an excel spreadsheet to be able to get a rough idea of how each list looks side-by-side, and see which codes were coming up most often, but it seems like it would be a pain to sum the frequencies of each code in excel, especially since I have to do this for different sets of variables as well, across several different data files.
Someone suggested generating non-string variables for each of the most common D-codes I see coming up from the tabulate command and assigning them 1 or 0 depending on if they come up or not and that might be an easy way to count the most common D-codes, but is that the easiest way?
It would mean a do file consisting of something like:
gen ABCD = 0
forvalues i =1/15 {
replace ABCD=1 if D`i'== "ABCD"
set more off
}
gen CDEF= 0
forvalues i =1/15 {
replace badass=1 if D`i'== "CDEF"
set more off
}
and so on, and then tabulating the variables that come up most often with a 1 among the newly generated variables, but this feels a bit tedious and round-about to me...
Can someone suggest another way to find the most common string values that come up for my variables D1-D15 with a dataset of some thousands upon thousands of observations?
Thanks
I have a data set where for each observation, there are unique codes in string variables D1 through D15 (not every observation will have a value for all 15, but almost all have at least a D1 value).
There are almost 10,000 different unique codes represented among the string variables D1-D15.
I am trying to tabulate the most common D* codes among the observations that exist along side my particular code of interest.
That is to say, my particular D* code of interest is 'ABCD'. Others might be 'CDEF' or 'FGHI', etc.... my code of interest may appear in D1 or any other D* variable from D1 to D15, and any code should only appear once for each observation if it appears at all for that observation.
I want to find out a list of the 10 or so most common D-codes for all observations *only* of those observations that have diagnostic code 'ABCD' as a value in at least one of the variables D1 to D15.
First, I did:
keep if D1==”ABCD” | D2==”ABCD” | … all the way to |D15==”ABCD”, and that thankfully shrunk my observations down to 10s of thousands, down from many millions. But this is still too big for table commands or tabm commands using that module.
I was able to run tabulate commands:
tabulate D1, sort ---to get the list of the most common codes within D1 sorted in descending order, but I need a simple way to combine all the ABCDs and CDEFs and FGHIs that would come up for 'tabulate D2, sort' and 'tabulate D3, sort' etc... for this data file, and I’m not sure the best way to do this.
I ran [tabulate D1, sort … tabulate D2, sort …etc…] and put the results in an excel spreadsheet to be able to get a rough idea of how each list looks side-by-side, and see which codes were coming up most often, but it seems like it would be a pain to sum the frequencies of each code in excel, especially since I have to do this for different sets of variables as well, across several different data files.
Someone suggested generating non-string variables for each of the most common D-codes I see coming up from the tabulate command and assigning them 1 or 0 depending on if they come up or not and that might be an easy way to count the most common D-codes, but is that the easiest way?
It would mean a do file consisting of something like:
gen ABCD = 0
forvalues i =1/15 {
replace ABCD=1 if D`i'== "ABCD"
set more off
}
gen CDEF= 0
forvalues i =1/15 {
replace badass=1 if D`i'== "CDEF"
set more off
}
and so on, and then tabulating the variables that come up most often with a 1 among the newly generated variables, but this feels a bit tedious and round-about to me...
Can someone suggest another way to find the most common string values that come up for my variables D1-D15 with a dataset of some thousands upon thousands of observations?
Thanks
Comment