Finding the most common string variable values among a range of string variables that also include a specific string value.

Pramma Elayaperumal

Join Date: Feb 2018

Posts: 5
#1

Finding the most common string variable values among a range of string variables that also include a specific string value.

08 Feb 2018, 18:04

Hello,

I have a data set where for each observation, there are unique codes in string variables D1 through D15 (not every observation will have a value for all 15, but almost all have at least a D1 value).

There are almost 10,000 different unique codes represented among the string variables D1-D15.

I am trying to tabulate the most common D* codes among the observations that exist along side my particular code of interest.

That is to say, my particular D* code of interest is 'ABCD'. Others might be 'CDEF' or 'FGHI', etc.... my code of interest may appear in D1 or any other D* variable from D1 to D15, and any code should only appear once for each observation if it appears at all for that observation.

I want to find out a list of the 10 or so most common D-codes for all observations *only* of those observations that have diagnostic code 'ABCD' as a value in at least one of the variables D1 to D15.

First, I did:

keep if D1==”ABCD” | D2==”ABCD” | … all the way to |D15==”ABCD”, and that thankfully shrunk my observations down to 10s of thousands, down from many millions. But this is still too big for table commands or tabm commands using that module.

I was able to run tabulate commands:

tabulate D1, sort ---to get the list of the most common codes within D1 sorted in descending order, but I need a simple way to combine all the ABCDs and CDEFs and FGHIs that would come up for 'tabulate D2, sort' and 'tabulate D3, sort' etc... for this data file, and I’m not sure the best way to do this.

I ran [tabulate D1, sort … tabulate D2, sort …etc…] and put the results in an excel spreadsheet to be able to get a rough idea of how each list looks side-by-side, and see which codes were coming up most often, but it seems like it would be a pain to sum the frequencies of each code in excel, especially since I have to do this for different sets of variables as well, across several different data files.

Someone suggested generating non-string variables for each of the most common D-codes I see coming up from the tabulate command and assigning them 1 or 0 depending on if they come up or not and that might be an easy way to count the most common D-codes, but is that the easiest way?

It would mean a do file consisting of something like:

gen ABCD = 0
forvalues i =1/15 {
replace ABCD=1 if D`i'== "ABCD"
set more off
}

gen CDEF= 0
forvalues i =1/15 {
replace badass=1 if D`i'== "CDEF"
set more off
}

and so on, and then tabulating the variables that come up most often with a 1 among the newly generated variables, but this feels a bit tedious and round-about to me...

Can someone suggest another way to find the most common string values that come up for my variables D1-D15 with a dataset of some thousands upon thousands of observations?

Thanks
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#2

08 Feb 2018, 21:06

Well, when you ask for help with code but don't provide example data to work with, you are taking a gamble. As I have no data to test it on, I can't assure you that my solution will work with your actual data; only that it works with my imaginary data. So in the future, always show example data when asking for help with code. And show it in a way that makes it possible for those who want to help you to use it effectively, which means using the -dataex- command. (See Forum FAQ #12 if you don't know about -dataex-.)

I think this job, like most things in Stata, is easier if you go to long data layout.

Code:

gen id = _n // SKIP THIS STEP IF YOU HAVE AN ID VARIABLE ALREADY reshape long D, i(id) j(_j) drop if missing(D) by id, sort: egen keeper = max(D == "ABCD") keep if keeper keep D by D, sort: gen occurrences = _N duplicates drop gsort -occurrences list in 1/10

The result will be a list of the 10 most frequently occurring codes that appear in observations that contain ABCD somewhere.

If you want more (or less) than the top 10, change the last line of the code accordingly. Note also that if there are two codes tied for tenth place, one of them will show up and the other will not--the tie will be broken arbitrarily and will not be reproducible.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35241

09 Feb 2018, 02:21

To Clyde's code, prefix this sandbox:

Code:

clear 
set obs 50000 
set seed 1776 
qui forval j = 1/15 { 
   gen D`j' = "" 
   forval k = 1/4 { 
        replace D`j' = D`j' + substr("ABCD", runiformint(1, 4), 1)  
   }
}

Comment

Pramma Elayaperumal

Join Date: Feb 2018

Posts: 5
#4

10 Feb 2018, 02:01

Thank you Clyde and Nick very much!

Using Clyde's code was sufficient for what I needed, and I was able to tweak it to perform a similar operation for another range of variables.

Sorry for not adding my data to the post, I'll be sure to do that if I need help in the future.

Thank you thank you thank you!
Comment

Salome-Joelle Gass

Join Date: Oct 2021
Posts: 30

04 May 2022, 08:06

I have a similar issue. Each patient has an observation for each day (but they are added on a rolling basis, so the number of patients changes each day). For each day I have created a variable that shows how many pills were taken that day. My end goal is to find the mode across those variables of the number of pills taken each day for each patient so that at the end there is only one observation for each client with the mode of the number of pills.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10 date str7 idcode float(intake22761 intake22762 intake22763 intake22764 intake22765 intake22766 intake22767 intake22768)
"2022-04-26" "013640" 1 . . . . . . .
"2022-04-27" "014202" . 1 . . . . . .
"2022-04-27" "022228" . 1 . . . . . .
"2022-04-27" "026318" . 1 . . . . . .
"2022-04-27" "028036" . 1 . . . . . .
"2022-04-27" "013640" . 1 . . . . . .
"2022-04-27" "012819" . 1 . . . . . .
"2022-04-27" "034389" . 1 . . . . . .
"2022-04-27" "016370" . 0 . . . . . .
"2022-04-28" "014202" . . 1 . . . . .
"2022-04-28" "022228" . . 1 . . . . .
"2022-04-28" "026318" . . 1 . . . . .
"2022-04-28" "028036" . . 1 . . . . .
"2022-04-28" "013640" . . 1 . . . . .
"2022-04-28" "012819" . . 2 . . . . .
"2022-04-28" "034389" . . 1 . . . . .
"2022-04-28" "011796" . . 1 . . . . .
"2022-04-28" "016370" . . 0 . . . . .
"2022-04-28" "041704" . . 2 . . . . .
"2022-04-29" "014202" . . . 1 . . . .
"2022-04-29" "003984" . . . 2 . . . .
"2022-04-29" "025110" . . . 1 . . . .
"2022-04-29" "022228" . . . 1 . . . .
"2022-04-29" "026318" . . . 1 . . . .
"2022-04-29" "037986" . . . 1 . . . .
"2022-04-29" "028036" . . . 1 . . . .
"2022-04-29" "024318" . . . 1 . . . .
"2022-04-29" "028124" . . . 1 . . . .
"2022-04-29" "026644" . . . 2 . . . .
"2022-04-29" "020905" . . . 2 . . . .
"2022-04-29" "013640" . . . 2 . . . .
"2022-04-29" "012819" . . . 2 . . . .
"2022-04-29" "007176" . . . 1 . . . .
"2022-04-29" "028487" . . . 2 . . . .
"2022-04-29" "029325" . . . 1 . . . .
"2022-04-29" "023833" . . . 1 . . . .
"2022-04-29" "034389" . . . 1 . . . .
"2022-04-29" "011796" . . . 1 . . . .
"2022-04-29" "034893" . . . 1 . . . .
"2022-04-29" "005646" . . . 1 . . . .
"2022-04-29" "034788" . . . 1 . . . .
"2022-04-29" "019488" . . . 2 . . . .
"2022-04-29" "016370" . . . 0 . . . .
"2022-04-29" "014945" . . . 1 . . . .
"2022-04-29" "041704" . . . 2 . . . .
"2022-04-30" "014202" . . . . 1 . . .
"2022-04-30" "003984" . . . . 1 . . .
"2022-04-30" "025110" . . . . 1 . . .
"2022-04-30" "022228" . . . . 1 . . .
"2022-04-30" "026318" . . . . 1 . . .
"2022-04-30" "037986" . . . . 1 . . .
"2022-04-30" "028036" . . . . 2 . . .
"2022-04-30" "024318" . . . . 1 . . .
"2022-04-30" "017144" . . . . 1 . . .
"2022-04-30" "005371" . . . . 2 . . .
"2022-04-30" "028124" . . . . 1 . . .
"2022-04-30" "026644" . . . . 1 . . .
"2022-04-30" "020905" . . . . 1 . . .
"2022-04-30" "013640" . . . . 1 . . .
"2022-04-30" "012819" . . . . 1 . . .
"2022-04-30" "007176" . . . . 1 . . .
"2022-04-30" "028487" . . . . 2 . . .
"2022-04-30" "029325" . . . . 1 . . .
"2022-04-30" "030197" . . . . 1 . . .
"2022-04-30" "023833" . . . . 1 . . .
"2022-04-30" "034621" . . . . 1 . . .
"2022-04-30" "034389" . . . . 1 . . .
"2022-04-30" "023969" . . . . 1 . . .
"2022-04-30" "011796" . . . . 1 . . .
"2022-04-30" "007670" . . . . 1 . . .
"2022-04-30" "036948" . . . . 1 . . .
"2022-04-30" "016124" . . . . 1 . . .
"2022-04-30" "034893" . . . . 1 . . .
"2022-04-30" "005646" . . . . 1 . . .
"2022-04-30" "034788" . . . . 1 . . .
"2022-04-30" "019488" . . . . 2 . . .
"2022-04-30" "028959" . . . . 1 . . .
"2022-04-30" "016370" . . . . 0 . . .
"2022-04-30" "014945" . . . . 2 . . .
"2022-04-30" "041258" . . . . 1 . . .
"2022-04-30" "041704" . . . . 1 . . .
"2022-05-01" "014202" . . . . . 2 . .
"2022-05-01" "003984" . . . . . 1 . .
"2022-05-01" "025110" . . . . . 1 . .
"2022-05-01" "022228" . . . . . 1 . .
"2022-05-01" "026318" . . . . . 1 . .
"2022-05-01" "037986" . . . . . 1 . .
"2022-05-01" "028036" . . . . . 1 . .
"2022-05-01" "024318" . . . . . 1 . .
"2022-05-01" "017144" . . . . . 1 . .
"2022-05-01" "005371" . . . . . 1 . .
"2022-05-01" "028124" . . . . . 1 . .
"2022-05-01" "026644" . . . . . 1 . .
"2022-05-01" "020905" . . . . . 1 . .
"2022-05-01" "013640" . . . . . 2 . .
"2022-05-01" "012819" . . . . . 2 . .
"2022-05-01" "007176" . . . . . 1 . .
"2022-05-01" "028487" . . . . . 2 . .
"2022-05-01" "029325" . . . . . 1 . .
"2022-05-01" "030197" . . . . . 1 . .
end
label values intake22761 intakes
label values intake22762 intakes
label values intake22763 intakes
label values intake22764 intakes
label values intake22765 intakes
label values intake22766 intakes
label values intake22767 intakes
label values intake22768 intakes
label def intakes 1 "Once per day", modify
label def intakes 0 "2 times per day", modify
label def intakes 2 "Took more than once per day", modify

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35241
#6

04 May 2022, 10:04

Your data example implies that egen, rowtotal() would give you what you want.

I don't know a row mode function or command. If you really needed a row mode, I think you might need to reshape long, use egen, mode() and then reshape wide.
Comment

Announcement