How to select values of a variable according to discrepancies in appearance between 2 groups?

Mattia Di Segni

Join Date: Dec 2024

Posts: 8
#1

How to select values of a variable according to discrepancies in appearance between 2 groups?

31 Dec 2024, 10:55

Dear Sirs,
I'm working with Stata 18 for Mac (Intel 64-bit).
I'm using a dataset with 100 variables and 481,000 records, extracted from a relational database. The subjects in the database have been dichotomized according to an integer cut-off, which divides them in 2 groups, representing 88.87% and 11,13% of total sample, respectively. Is there any means to use a variable with a lot of values (such as ICD-9 diagnoses) to select those variables in which the expected distribution for each category are greater than the expected value for the less represented group and to have it reported in a table?
Thank you for your support and Happy New Year
Mattia
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2407
#2

31 Dec 2024, 11:53

What you ask for is not entirely clear, at least not to me and perhaps to others. I'm guessing that you want to find out which variables among your 100 have distributions that differ between your two groups. Is that correct?

If that is what you want, there will be many more issues about which kinds and amounts of such differences would be most important to you because, among other things, having "a lot of values" for each variable will considerably complicate things. Another thought is that judging "different" on the basis of some hypothesis test is not likely to be helpful, as with 481,000 records, many (most? all?) differences are likely to appear as "significant." Knowing how big "a lot" is could be useful here, as would knowing what purpose you want information about such differences to serve.

My apologies if I have interpreted your situation and interests incorrectly.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29976
#3

31 Dec 2024, 11:55

Is there any means to use a variable with a lot of values (such as ICD-9 diagnoses) to select those variables in which the expected distribution for each category are greater than the expected value for the less represented group

I don't understand what this means.

If you do not get a response from somebody else today, I suggest posting back with a) example data, b) a detailed explanation of the desired calculation, and c) an example of the calculation (done by hand) applied to a few observations from your example data. In doing that, be particularly careful to explain what you mean by "select those variables." Does this call for a list of the names of the variables? Or purging the data set of all the other variables? Or creating a new data set containing only the variables meeting the criterion? Setting characteristics for those variables? Renaming them in some systematic way? Something else? Also explain what is meant by "expected distribution." And explain "for each category... ." Each category of what?

For showing the example data, please use the -dataex- command. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Added: Crossed with #2. Mike Lacy's questions raise additional unclear aspects of the question that I had not yet even perceived.
Comment
Mattia Di Segni

Join Date: Dec 2024

Posts: 8
#4

31 Dec 2024, 12:31

Dear Mike and Clyde,
Thank you for your availability. I'll try to make my explanation and my needs more clear.
This is a dataset from an ER. It contains all access to ER in 5 years.
Among the variables there are:
ID number of patient according to the source software;

ID number of access to ER;

patient age at access;

gender;

reason of access (39 unique values);

diagnosis;

related ICD-9 diagnosis code (5,975 unique values);

frequency of access for each patient each year (computed through egen);

category of patient according to the number of access.

The latter is dichotomous; relative frequencies are 88,87% and 11,13%, respectively.
I need to explore and select those values of exit diagnosis and reason of access that differ between the two groups. In my opinion, the best way to identify this significance would be to find a threshold of expected values related to the prevalence of each group. Then I'd test the significance of difference of their distribution between the two groups for each unique value.
Is there any means to automate (and accelerate in comparison with personally checking one by one) identification of such values? Could it be done without employing programming languages other than Stata?
Thank you for your support
Happy New Year
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29976
#5

31 Dec 2024, 13:51

I think I understand what you want now. Mike Lacy already pointed out in #2 that testing for significant differences in a data set this large is probably a waste of time. And repeatedly performing statistical significance test dozens of times does violence to the very concept of statistical significance and will net you plenty of "false positives" even in a data set of moderate size.

That said, the kind of automated comparison of the prevalence of each value of a variable between two subsets of the data defined some other variable, can be done in Stata. But to develop and test the code, example data to work with is needed. So please use the -dataex- command to create a suitable example and post back with that. If you are concerned about confidentiality issues, I suggest omitting from the example the patient and access ID numbers, age, and gender as these, as far as I can tell from the description, are of no relevance to the question being asked. And with those variables omitted, the data contains no identifiers, so there are no confidentiality problems with posting it.
Comment

Mattia Di Segni

Join Date: Dec 2024
Posts: 8

01 Jan 2025, 03:44

Dear Clyde,
I understand and share what you stated about the significance issue of such a volume of data. For this reason I wanted to select those who may differ.
Please fine attached an example taken from the dataset, hoping that Italian does not make any issue.
I selected Year (Anno), (Causaaccesso) Reason of Access, Diagnosi (Diagnosis), Accesso (access), Frequenza (frequency of access each year for each ID), frequent_user (the dichotomous variable to identify those patients by yearly access to the ER.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int Anno str63 Causaaccesso str185 Diagnosi float(accesso Frequenza frequent_user)
2019 "Altri sintomi o disturbi"                 " " 1 1 0
2018 "Altri sintomi o disturbi"                 " " 1 3 0
2018 "Altri sintomi o disturbi"                 " " 1 1 0
2019 "Alterazione del ritmo"                    " " 1 1 0
2018 "Altri sintomi o disturbi"                 " " 1 1 0
2018 "Dispnea"                                  " " 1 2 0
2015 "Altri sintomi o disturbi"                 " " 1 3 0
2018 "Incidenti in altri luoghi"                " " 1 1 0
2020 "Altri sintomi o disturbi"                 " " 1 1 0
2017 "Altri sintomi o disturbi"                 " " 1 3 0
2020 "Altri sintomi o disturbi"                 " " 1 2 0
2018 "Altri sintomi o disturbi"                 " " 1 2 0
2015 "Altri sintomi o disturbi"                 " " 1 2 0
2017 "Altri sintomi o disturbi"                 " " 1 2 0
2016 "Altri sintomi o disturbi"                 " " 1 2 0
2022 "Sintomi o disturbi odontostomatologici"   " " 1 2 0
2017 "Sintomi o disturbi otorinolaringoiatrici" " " 1 3 0
2018 "Sintomi o disturbi dermatologici"         " " 1 1 0
2017 ""                                         " " 1 2 0
2016 "Altri sintomi o disturbi"                 " " 1 1 0
2015 "Altri sintomi o disturbi"                 " " 1 1 0
2021 "Alterazione del ritmo"                    " " 1 4 0
2016 "Altri sintomi o disturbi"                 " " 1 3 0
2015 "Altri sintomi o disturbi"                 " " 1 3 0
2017 "Altri sintomi o disturbi"                 " " 1 1 0
2018 "Sintomi o disturbi urologici"             " " 1 2 0
2022 "Altri sintomi o disturbi"                 " " 1 3 0
2019 "Altri sintomi o disturbi"                 " " 1 1 0
2021 ""                                         " " 1 1 0
2019 "incidente sul lavoro"                     " " 1 2 0
2019 "Incidenti in altri luoghi"                " " 1 2 0
2022 "Altri sintomi o disturbi"                 " " 1 3 0
2016 "Altri sintomi o disturbi"                 " " 1 4 0
2017 "Altri sintomi o disturbi"                 " " 1 1 0
2018 "Altri sintomi o disturbi"                 " " 1 4 0
2021 "Altri sintomi o disturbi"                 " " 1 1 0
2018 ""                                         " " 1 2 0
2015 "Altri sintomi o disturbi"                 " " 1 1 0
2016 ""                                         " " 1 2 0
2020 "Psichiatrico"                             " " 1 1 0
2018 "Altri sintomi o disturbi"                 " " 1 1 0
2019 "Altri sintomi o disturbi"                 " " 1 3 0
2016 "Altri sintomi o disturbi"                 " " 1 3 0
2017 "Altri sintomi o disturbi"                 " " 1 2 0
2018 "Altri sintomi o disturbi"                 " " 1 2 0
2017 "Altri sintomi o disturbi"                 " " 1 1 0
2020 ""                                         " " 1 1 0
2018 ""                                         " " 1 4 0
2018 "Altri sintomi o disturbi"                 " " 1 2 0
2019 "Altri sintomi o disturbi"                 " " 1 3 0
end
label values frequent_user frequent_user
label def frequent_user 0 "Non-frequent user", modify
label var Anno "Anno" 
label var Causaaccesso "Causa accesso" 
label var Diagnosi "Diagnosi" 
label var Frequenza "Frequenza annuale di accessi" 
label var frequent_user "Tipo di utenza"

Thank you for your help!

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2407
#7

01 Jan 2025, 11:27

My understanding (?) would be that Causaaccesso and Diagnosi would represent two of the 100 variables whose distributions you want to compare. Is that correct? If so, your example is not helpful regarding Diagnosi, as that variable is missing for all observations. Also, I'd point out that no comparison of two groups can be made with your example, as all observations are 0 for "frequent_user." Can you offer an improved example? (Again, perhaps I misunderstand.)

Two more questions: Do you want Anno (year) taken into account in the analysis, perhaps by making your comparisons separately for each year? I can understanding reasons why you might or might not want to do this. Also, do you want Frequenza take into account, perhaps as some kind of weight? Or, was it just used to create the frequent_user variable, and is not otherwise to be included in the analysis?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29976

01 Jan 2025, 11:48

Thanks. Your example data isn't quite suitable for developing this because all of the observations have frequent_user == 0, and because all values of variable Diagnosi are missing. So I developed the code from a modified version of your data example that overcomes these limitations.

Code:

clear*
* Example generated by -dataex-. For more info, type help dataex
clear
input int Anno str63 Causaaccesso str4 Diagnosi float(accesso Frequenza frequent_user)
2019 "Altri sintomi o disturbi"                 "Dx 1" 1 1 0
2018 "Altri sintomi o disturbi"                 "Dx 2" 1 3 1
2018 "Altri sintomi o disturbi"                 "Dx 2" 1 1 0
2019 "Alterazione del ritmo"                    "Dx 3" 1 1 0
2018 "Altri sintomi o disturbi"                 "Dx 1" 1 1 0
2018 "Dispnea"                                  "Dx 2" 1 2 0
2015 "Altri sintomi o disturbi"                 "Dx 5" 1 3 0
2018 "Incidenti in altri luoghi"                "Dx 5" 1 1 0
2020 "Altri sintomi o disturbi"                 "Dx 4" 1 1 1
2017 "Altri sintomi o disturbi"                 "Dx 1" 1 3 0
2020 "Altri sintomi o disturbi"                 "Dx 4" 1 2 0
2018 "Altri sintomi o disturbi"                 "Dx 1" 1 2 0
2015 "Altri sintomi o disturbi"                 "Dx 4" 1 2 0
2017 "Altri sintomi o disturbi"                 "Dx 3" 1 2 0
2016 "Altri sintomi o disturbi"                 "Dx 2" 1 2 0
2022 "Sintomi o disturbi odontostomatologici"   "Dx 3" 1 2 0
2017 "Sintomi o disturbi otorinolaringoiatrici" "Dx 3" 1 3 1
2018 "Sintomi o disturbi dermatologici"         "Dx 1" 1 1 1
2017 ""                                         "Dx 1" 1 2 0
2016 "Altri sintomi o disturbi"                 "Dx 4" 1 1 0
2015 "Altri sintomi o disturbi"                 "Dx 5" 1 1 0
2021 "Alterazione del ritmo"                    "Dx 4" 1 4 0
2016 "Altri sintomi o disturbi"                 "Dx 3" 1 3 0
2015 "Altri sintomi o disturbi"                 "Dx 4" 1 3 0
2017 "Altri sintomi o disturbi"                 "Dx 1" 1 1 1
2018 "Sintomi o disturbi urologici"             "Dx 4" 1 2 0
2022 "Altri sintomi o disturbi"                 "Dx 4" 1 3 0
2019 "Altri sintomi o disturbi"                 "Dx 4" 1 1 1
2021 ""                                         "Dx 1" 1 1 0
2019 "incidente sul lavoro"                     "Dx 2" 1 2 0
2019 "Incidenti in altri luoghi"                "Dx 2" 1 2 0
2022 "Altri sintomi o disturbi"                 "Dx 3" 1 3 1
2016 "Altri sintomi o disturbi"                 "Dx 5" 1 4 0
2017 "Altri sintomi o disturbi"                 "Dx 3" 1 1 0
2018 "Altri sintomi o disturbi"                 "Dx 3" 1 4 0
2021 "Altri sintomi o disturbi"                 "Dx 2" 1 1 0
2018 ""                                         "Dx 3" 1 2 0
2015 "Altri sintomi o disturbi"                 "Dx 3" 1 1 0
2016 ""                                         "Dx 4" 1 2 0
2020 "Psichiatrico"                             "Dx 2" 1 1 0
2018 "Altri sintomi o disturbi"                 "Dx 5" 1 1 0
2019 "Altri sintomi o disturbi"                 "Dx 2" 1 3 0
2016 "Altri sintomi o disturbi"                 "Dx 4" 1 3 0
2017 "Altri sintomi o disturbi"                 "Dx 1" 1 2 0
2018 "Altri sintomi o disturbi"                 "Dx 1" 1 2 0
2017 "Altri sintomi o disturbi"                 "Dx 2" 1 1 0
2020 ""                                         "Dx 3" 1 1 0
2018 ""                                         "Dx 5" 1 4 1
2018 "Altri sintomi o disturbi"                 "Dx 5" 1 2 0
2019 "Altri sintomi o disturbi"                 "Dx 2" 1 3 0
end
label values frequent_user frequent_user
label def frequent_user 0 "Non-frequent user", modify
label def frequent_user 1 "Frequent user", modify
label def frequent_user 0 "Non-frequent user", modify
label var Anno "Anno"
label var Causaaccesso "Causa accesso"
label var Diagnosi "Diagnosi"
label var Frequenza "Frequenza annuale di accessi"
label var frequent_user "Tipo di utenza"

frame create differences_found    str32 variable str144 value
foreach v of varlist Causaaccesso Diagnosi {
    levelsof `v', local(levels)
    foreach l of local levels {
        gen dv = (`v' == `"`l'"') if !missing(`v')
        quietly cs dv frequent_user
        if r(p) < 0.05 {
            frame post differences_found ("`v'") (`"`l'"')
        }
        drop dv
    }
}

frame differences_found: list, noobs clean

Note: This code assumes that all of the variables you interested in doing this with are string variables, as Causaacesso and Diagnosi are in your example. This code will not work for numeric variables being contrasted across the two groups. If you do have such variables, post back with a new example that includes at least one such, if you want help modifying the code to accommodate that.

I have followed your lead by testing for statistically significant differences, but, at the risk of beating a dead horse, I don't think this is a good way to do it. I would be more likely to do it by imposing a threshold on the prevalence difference, or prevalence ratio. If you become persuaded to follow one of those paths, change -if r(p) < 0.05- to -if r(rd) > #- or -if(rr) > #-, replacing # by the threshold value you choose for the difference or ratio, respectively.

Note that in addition to listing all of the variables and values for which differences (by your chosen criterion) are found, the code creates a data set of the same information in frame differences_found, which you might want to save or perform additional work with.

Added: Crossed with #7.

Comment

Mattia Di Segni

Join Date: Dec 2024
Posts: 8

01 Jan 2025, 14:41

Dear Mike and Clyde,
Thank you for your patience. I have added more records (now 100) with filled Diagnosi values, and more variables such as age, patientID, urgency (Urgenza), accesso (number of access generated through egen: count(accesso), by (Anno, PatientID), Frequenza (frequency) and user type. Label instructions have been reported.
Please find attached the example.
To answer Mike's question, Anno was used together with PatientID to get the frequency of access for each patient and categorize them according to the frequency of access. I'm not sure that Frequenza may not return useful later in the analysis.

Thank you for your support

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int Anno str9 PatIDSANCore int Eta str9 Urgenza str63 Causaaccesso str5 CodDiag str185 Diagnosi float(accesso Frequenza frequent_user)
2015 "S10400387" 74 "Media"     "incidente domestico"                                             "82100" "FRATTURA DI PARTE NON SPECIFICATA DEL FEMORE"                                                                 1  1 0
2015 "S129680"   59 "Normale"   "Dolore toracico"                                                 "78650" "DOLORE TORACICO NON SPECIFICATO"                                                                              1  2 0
2015 "S579969"   29 "Normale"   "Altri sintomi o disturbi"                                        "7296"  "CORPO ESTRANEO RESIDUO NEI TESSUTI MOLLI"                                                                     1  2 0
2015 "S385482"   33 "Normale"   "Sintomi o disturbi otorinolaringoiatrici"                        "38840" "PERCEZIONI UDITIVE ABNORMI, NON SPECIFICATE"                                                                  1  1 0
2015 "S277706"   40 "Normale"   "Altri sintomi o disturbi"                                        "7233"  "SINDROME CERVICOBRACHIALE (DIFFUSA)"                                                                          1  2 0
2015 "S680871"   20 "Medioalta" "Dolore addominale"                                               "78906" "DOLORE ADDOMINALE EPIGASTRICO"                                                                                1  3 0
2015 "S114281"   55 "Normale"   "Sintomi o disturbi urologici"                                    "7296"  "CORPO ESTRANEO RESIDUO NEI TESSUTI MOLLI"                                                                     1  2 0
2015 "S189764"   53 "Normale"   "Altri sintomi o disturbi"                                        "6010"  "PROSTATITE ACUTA"                                                                                             1  1 0
2015 "S1190221"  58 "Normale"   "Altri sintomi o disturbi"                                        "71516" "ARTROSI LOCALIZZATA PRIMARIA, GINOCCHIO"                                                                      1 42 1
2015 "S842185"   75 "Media"     "Altri sintomi o disturbi"                                        "7802"  "SINCOPE E COLLASSO"                                                                                           1  2 0
2015 "S178446"   84 "Media"     "Altri sintomi o disturbi"                                        "82100" "FRATTURA DI PARTE NON SPECIFICATA DEL FEMORE"                                                                 1  5 1
2015 "S10395760" 74 "Media"     "Altri sintomi o disturbi"                                        "5589"  "ALTRA E NON SPECIFICATA GASTROENTERITE E COLITE NON INFETTIVA"                                                1  1 0
2015 "S230574"   85 "Media"     "Altri sintomi o disturbi"                                        "7806"  "FEBBRE"                                                                                                       1  1 0
2015 "S1479506"   4 "Media"     "Altri sintomi o disturbi"                                        "9221"  "CONTUSIONE DELLA PARTE TORACICA"                                                                              1  1 0
2015 "S352113"   38 "Media"     "incidente in strada"                                             "92300" "CONTUSIONE DELLA REGIONE DELLA SPALLA"                                                                        1  1 0
2015 "S1743018"   2 "Media"     "incidente domestico"                                             "8500"  "CONCUSSIONE CON NESSUNA PERDITA DI COSCIENZA"                                                                 1  2 0
2015 "S712584"   26 "Normale"   "Alterazione del ritmo"                                           "81383" "FRATTURA CHIUSA RADIO E ULNA, NON SPECIFICATA"                                                                1  1 0
2015 "S563670"   30 "Normale"   "incidente domestico"                                             "7241"  "RACHIALGIA DORSALE"                                                                                           1  1 0
2015 "S516790"   88 "Normale"   "Altri sintomi o disturbi"                                        "7847"  "EPISTASSI"                                                                                                    1  3 0
2015 "S1749259"   1 "Normale"   "Altri sintomi o disturbi"                                        "38870" "OTALGIA, NON SPECIFICATA"                                                                                     1 15 1
2015 "S517599"   46 "Normale"   "Altri sintomi o disturbi"                                        "78096" "DOLENZIA GENERALIZZATA"                                                                                       1  2 0
2015 "S150223"   88 "Media"     "Altri sintomi o disturbi"                                        "33811" "DOLORE ACUTO DA TRAUMA"                                                                                       1  3 0
2015 "S539104"   32 "Normale"   "Altri sintomi o disturbi"                                        "71511" "ARTROSI LOCALIZZATA PRIMARIA, SPALLA"                                                                         1  1 0
2015 "S376190"   34 "Media"     "Altri sintomi o disturbi"                                        "7242"  "LOMBALGIA"                                                                                                    1  3 0
2015 "S10390879" 53 "Media"     "Altri sintomi o disturbi"                                        "7802"  "SINCOPE E COLLASSO"                                                                                           1  1 0
2015 "S604212"   26 "Normale"   "incidente sul lavoro"                                            "9300"  "CORPO ESTRANEO DELLA CORNEA"                                                                                  1  5 1
2015 "S130897"   54 "Normale"   "Altri sintomi o disturbi"                                        "4011"  "IPERTENSIONE ESSENZIALE BENIGNA"                                                                              1  1 0
2015 "S69040"    67 "Normale"   "Altri sintomi o disturbi"                                        "5921"  "CALCOLOSI URETERALE"                                                                                          1  2 0
2015 "S24960"    57 "Medioalta" "Altri sintomi o disturbi"                                        "7802"  "SINCOPE E COLLASSO"                                                                                           1  2 0
2015 "S123213"   56 "Normale"   "Altri sintomi o disturbi"                                        "8441"  "DISTORSIONE E DISTRAZIONE LEGAMENTO COLLATERALE MEDIALE DEL GINOCCHIO"                                        1  2 0
2015 "S715540"   18 "Normale"   "Sintomi o disturbi otorinolaringoiatrici"                        "38010" "OTITE ESTERNA INFETTIVA, NON SPECIFICATA"                                                                     1  1 0
2015 "S54554"    77 "Normale"   "Altri sintomi o disturbi"                                        "8250"  "FRATTURA DEL CALCAGNO, CHIUSA"                                                                                1  3 0
2015 "S191562"   63 "Normale"   "Sintomi o disturbi odontostomatologici"                          "52560" "RESTAURO INSODDISFACENTE DI UN DENTE NON SPECIFICATO"                                                         1  1 0
2015 "S711304"   19 "Media"     "incidente in montagna (sciistico,slittino,snowboard, alpinismo)" "85011" "CONCUSSIONE CON BREVE PERDITA DI COSCIENZA CON PERDITA DI COSCIENZA DI DURATA INFERIORE O UGUALE A 30 MINUTI" 1  2 0
2015 "S10402030" 38 "Media"     "Altri sintomi o disturbi"                                        "78907" "DOLORE ADDOMINALE GENERALIZZATO"                                                                              1  1 0
2015 "S704215"   18 "Normale"   "Altri sintomi o disturbi"                                        "83104" "LUSSAZIONE CHIUSA, ACROMIOCLAVICOLARE (ARTICOLAZIONE)"                                                        1  3 0
2015 "S1560512"  23 "Normale"   "Altri sintomi o disturbi"                                        "78900" "DOLORE ADDOMINALE DI SEDE NON SPECIFICATA"                                                                    1  2 0
2015 "S302795"   39 "Normale"   "Altri sintomi o disturbi"                                        "5589"  "ALTRA E NON SPECIFICATA GASTROENTERITE E COLITE NON INFETTIVA"                                                1  1 0
2015 "S275165"   42 "Nessuna"   "Altri sintomi o disturbi"                                        "8930"  "FERITA DELLE DITA DEL PIEDE SENZA MENZIONE DI COMPLICAZIONI"                                                  1  3 0
2015 "S1066026"  11 "Media"     "Altri sintomi o disturbi"                                        "341"   "SCARLATTINA"                                                                                                  1  4 0
2015 "S158737"   73 "Normale"   "Altri sintomi o disturbi"                                        "44101" "DISSEZIONE DELL'AORTA, TORACICA"                                                                              1  2 0
2015 "S10399633" 50 "Normale"   "Altri sintomi o disturbi"                                        "8260"  "FRATTURA DI UNA O PIU FALANGI DEL PIEDE, CHIUSA"                                                              1  1 0
2015 "S609627"   48 "Media"     "Altri sintomi o disturbi"                                        "78659" "ALTRO DOLORE TORACICO"                                                                                        1  2 0
2015 "S365135"   50 "Normale"   "Sintomi o disturbi urologici"                                    "V1302" "ANAMNESI PERSONALE DI INFEZIONE URINARIA (DEL TRATTO)"                                                        1  3 0
2015 "S26555"    66 "Media"     "Altri sintomi o disturbi"                                        "78650" "DOLORE TORACICO NON SPECIFICATO"                                                                              1  1 0
2015 "S709622"   18 "Normale"   "altro incidente sportivo (non in montagna)"                      "7296"  "CORPO ESTRANEO RESIDUO NEI TESSUTI MOLLI"                                                                     1  3 0
2015 "S876997"   45 "Normale"   "Altri sintomi o disturbi"                                        "8500"  "CONCUSSIONE CON NESSUNA PERDITA DI COSCIENZA"                                                                 1  1 0
2015 "S10395467" 84 "Normale"   "Altri sintomi o disturbi"                                        "8930"  "FERITA DELLE DITA DEL PIEDE SENZA MENZIONE DI COMPLICAZIONI"                                                  1  1 0
2015 "S36981"    75 "Normale"   "Altri sintomi o disturbi"                                        "92300" "CONTUSIONE DELLA REGIONE DELLA SPALLA"                                                                        1  3 0
2015 "S10362405" 71 "Normale"   "Sintomi o disturbi otorinolaringoiatrici"                        "46400" "LARINGITE ACUTA SENZA MENZIONE DI OSTRUZIONE"                                                                 1  1 0
2015 "S192595"   79 "Normale"   "Alterazione del ritmo"                                           "7851"  "PALPITAZIONI"                                                                                                 1  2 0
2015 "S251385"   78 "Normale"   "Altri sintomi o disturbi"                                        "8260"  "FRATTURA DI UNA O PIU FALANGI DEL PIEDE, CHIUSA"                                                              1  2 0
2015 "S80972"    91 "Normale"   "Altri sintomi o disturbi"                                        "8730"  "ALTRE FERITE DEL CUOIO CAPELLUTO SENZA MENZIONE DI COMPLICAZIONI"                                             1  1 0
2015 "S1092998"  58 "Normale"   "altro incidente sportivo (non in montagna)"                      "8360"  "LACERAZIONE DELLA CARTILAGINE O DEL MENISCO MEDIALE DEL GINOCCHIO, RECENTE"                                   1  1 0
2015 "S1240124"  44 "Media"     "Altri sintomi o disturbi"                                        "4660"  "BRONCHITE ACUTA"                                                                                              1  3 0
2015 "S1668162"  29 "Normale"   "Altri sintomi o disturbi"                                        "4660"  "BRONCHITE ACUTA"                                                                                              1  3 0
2015 "S217269"   51 "Media"     "Altri sintomi o disturbi"                                        "3609"  "MALATTIE DEL GLOBO OCULARE NON SPECIFICATI"                                                                   1  1 0
2015 "S10404937" 67 "Normale"   "Altri sintomi o disturbi"                                        "6869"  "INFEZIONI LOCALIZZATE NON SPECIFICATE DELLA CUTE E DEL TESSUTO SOTTOCUTANEO"                                  1  1 0
2015 "S264484"   51 "Normale"   "Sintomi o disturbi otorinolaringoiatrici"                        "931"   "CORPO ESTRANEO NELL'ORECCHIO"                                                                                 1  2 0
2015 "S10408771" 56 "Normale"   "Sintomi o disturbi oculistici"                                   "3719"  "ALTERAZIONI CORNEALI NON SPECIFICATE"                                                                         1  1 0
2015 "S1275432"  25 "Normale"   "Altri sintomi o disturbi"                                        "5259"  "MALATTIA NON SPECIFICATA DEI DENTI E DELLE STRUTTURE DI SUPPORTO"                                             1  3 0
2015 "S1189186"  33 "Normale"   "Altri sintomi o disturbi"                                        "8470"  "DISTORSIONE E DISTRAZIONE DEL COLLO"                                                                          1  2 0
2015 "S695557"   52 "Normale"   "Altri sintomi o disturbi"                                        "55001" "ERNIA INGUINALE MONOLATERALE O NON SPECIFICATA, RICORRENTE, CON GANGRENA"                                     1  2 0
2015 "S669624"   20 "Media"     "Altri sintomi o disturbi"                                        "78601" "IPERVENTILAZIONE"                                                                                             1  1 0
2015 "S1452825"  85 "Media"     "Dolore addominale"                                               "56211" "DIVERTICOLITE DEL COLON (SENZA"                                                                               1  1 0
2015 "S296248"   38 "Normale"   "incidente sul lavoro"                                            "3609"  "MALATTIE DEL GLOBO OCULARE NON SPECIFICATI"                                                                   1  1 0
2015 "S222001"   49 "Normale"   "Sintomi o disturbi urologici"                                    "5990"  "INFEZIONE DEL SISTEMA URINARIO, SITO NON SPECIFICATO"                                                         1  6 1
2015 "S10383729"  2 "Normale"   "incidente domestico"                                             "52181" "DENTE LESIONATO"                                                                                              1  1 0
2015 "S263110"   76 "Normale"   "Altri sintomi o disturbi"                                        "3384"  "SINDROME DA DOLORE CRONICO"                                                                                   1  1 0
2015 "S74475"    82 "Normale"   "Altri sintomi o disturbi"                                        "72700" "SINOVITE E TENOSINOVITE NON SPECIFICATE"                                                                      1  1 0
2015 "S235926"   85 "Normale"   "incidente domestico"                                             "920"   "CONTUSIONE DELLA FACCIA, DEL CUOIO CAPELLUTO E DEL COLLO ESCLUSO L'OCCHIO"                                    1  6 1
2015 "S452975"   45 "Normale"   "Altri sintomi o disturbi"                                        "6010"  "PROSTATITE ACUTA"                                                                                             1  3 0
2015 "S1755422"   1 "Normale"   "Altri sintomi o disturbi"                                        "542"   "GENGIVOSTOMATITE ERPETICA"                                                                                    1  2 0
2015 "S584402"   54 "Normale"   "Altri sintomi o disturbi"                                        "8930"  "FERITA DELLE DITA DEL PIEDE SENZA MENZIONE DI COMPLICAZIONI"                                                  1  4 0
2015 "S477087"   43 "Normale"   "incidente domestico"                                             "92300" "CONTUSIONE DELLA REGIONE DELLA SPALLA"                                                                        1  2 0
2015 "S337351"   53 "Normale"   "incidente sul lavoro"                                            "8820"  "FERITA DELLA MANO, ESCLUSE LE DITA DA SOLE, SENZA MENZIONE DI COMPLICAZIONI"                                  1  3 0
2015 "S439485"   92 "Normale"   "Altri sintomi o disturbi"                                        "5589"  "ALTRA E NON SPECIFICATA GASTROENTERITE E COLITE NON INFETTIVA"                                                1  9 1
2015 "S110029"   55 "Normale"   "Sintomi o disturbi otorinolaringoiatrici"                        "462"   "FARINGITE ACUTA"                                                                                              1  2 0
2015 "S328663"   42 "Normale"   "incidente sul lavoro"                                            "84500" "DISTORSIONE E DISTRAZIONE DI SITO NON SPECIFICATO DELLA CAVIGLIA"                                             1  1 0
2015 "S193408"   56 "Normale"   "Sintomi o disturbi urologici"                                    "7880"  "COLICA RENALE"                                                                                                1  2 0
2015 "S1451205"  37 "Media"     "Altri sintomi o disturbi"                                        "33819" "ALTRI DOLORI ACUTI"                                                                                           1  4 0
2015 "S10408137" 20 "Normale"   "Altri sintomi o disturbi"                                        "78079" "ALTRO MALESSERE ED AFFATICAMENTO"                                                                             1  1 0
2015 "S468434"   61 "Normale"   "Sintomi o disturbi dermatologici"                                "6929"  "DERMATITE DA CAUSE NON SPECIFICATE"                                                                           1  2 0
2015 "S769077"   16 "Normale"   "Sintomi o disturbi otorinolaringoiatrici"                        "47400" "TONSILLITE CRONICA"                                                                                           1  3 0
2015 "S403895"   89 "Media"     "Altri sintomi o disturbi"                                        "56039" "ALTRO INTASAMENTO DELL'INTESTINO"                                                                             1  1 0
2015 "S355136"   59 "Normale"   "Altri sintomi o disturbi"                                        "71516" "ARTROSI LOCALIZZATA PRIMARIA, GINOCCHIO"                                                                      1  7 1
2015 "S269736"   42 "Normale"   "Altri sintomi o disturbi"                                        "92300" "CONTUSIONE DELLA REGIONE DELLA SPALLA"                                                                        1  1 0
2015 "S1130835"  35 "Media"     "Altri sintomi o disturbi"                                        "78650" "DOLORE TORACICO NON SPECIFICATO"                                                                              1  5 1
2015 "S10406331" 28 "Normale"   "Altri sintomi o disturbi"                                        "7231"  "CERVICALGIA"                                                                                                  1  2 0
2015 "S407589"   81 "Normale"   "Altri sintomi o disturbi"                                        "71516" "ARTROSI LOCALIZZATA PRIMARIA, GINOCCHIO"                                                                      1  3 0
2015 "S398571"   87 "Normale"   "Altri sintomi o disturbi"                                        "8419"  "DISTORSIONE E DISTRAZIONE DI SITO NON SPECIFICATO DEL GOMITO E DELL'AVAMBRACCIO"                              1  1 0
2015 "S538937"   36 "Normale"   "Dolore addominale"                                               "78900" "DOLORE ADDOMINALE DI SEDE NON SPECIFICATA"                                                                    1  1 0
2015 "S80601"    78 "Media"     "Altri sintomi o disturbi"                                        "81381" "FRATTURA CHIUSA DEL RADIO, NON SPECIFICATA"                                                                   1  1 0
2015 "S54252"    70 "Media"     "Altri sintomi o disturbi"                                        "30000" "STATO ANSIOSO NON SPECIFICATO"                                                                                1  5 1
2015 "S166881"   85 "Normale"   "Altri sintomi o disturbi"                                        "7243"  "SCIATALGIA"                                                                                                   1  3 0
2015 "S10391299" 21 "Normale"   "Altri sintomi o disturbi"                                        "6826"  "ARTO INFERIORE ECCETTO IL PIEDE"                                                                              1  1 0
2015 "S555088"   30 "Normale"   "Altri sintomi o disturbi"                                        "78903" "DOLORE ADDOMINALE DEL QUADRANTE INFERIORE DESTRO"                                                             1  5 1
2015 "S1672550"  26 "Normale"   "Altri sintomi o disturbi"                                        "68100" "FLEMMONE E ASCESSO, NON SPECIFICATO"                                                                          1  1 0
2015 "S504293"   43 "Media"     "Sintomi o disturbi otorinolaringoiatrici"                        "462"   "FARINGITE ACUTA"                                                                                              1  4 0
2015 "S371775"   35 "Normale"   "Altri sintomi o disturbi"                                        "8500"  "CONCUSSIONE CON NESSUNA PERDITA DI COSCIENZA"                                                                 1  3 0
end
label values Eta Eta_cat
label values frequent_user frequent_user
label def frequent_user 0 "Non-frequent user", modify
label def frequent_user 1 "Frequent user", modify
label var Anno "Anno" 
label var PatIDSANCore "Pat.ID SANCore" 
label var Eta "Eta" 
label var Urgenza "Urgenza" 
label var Causaaccesso "Causa accesso" 
label var CodDiag "Cod. Diag." 
label var Diagnosi "Diagnosi" 
label var Frequenza "Frequenza annuale di accessi" 
label var frequent_user "Tipo di utenza"

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29976
#10

01 Jan 2025, 15:08

OK. The following code can handle both numeric and string variables.

Code:

capture frame drop differences_found frame create differences_found str32 variable str144 value foreach v of varlist Eta-Diagnosi { display "Processing `v'" quietly levelsof `v', local(levels) foreach l of local levels { if substr("`:type `v''", 1, 3) == "str" { gen dv = (`v' == `"`l'"') if !missing(`v') } else { gen dv = `l'.`v' } quietly cs dv frequent_user if r(p) < 0.05 { frame post differences_found ("`v'") (`"`l'"') } drop dv } } frame differences_found: list, noobs clean

In the example data, each observation corresponds to a distinct ID. However, you now state that both ID and Anno were used to calculate Frequenza. I cannot tell from that statement whether this means that in the full data set the same ID may have separate observations for more than one year, or whether the overall frequency across years was later calculated and only a single observation per ID was retained (rendering the Anno variable less meaningful if not altogether meaningless). If it is true that the same ID can have multiple observations in the full data set, that is another argument against using the p-value as your criterion for difference, because in this situation, the p-values are incorrect because the observations from which they are calculated are not independent. (Yes, it is possible to revise the code to calculate them taking non-independence into account, but it leads to other complications in the code that don't strike me as worth working out given that the overall use of p-values for this purpose is, in my view, inappropriate.)
Comment
Mattia Di Segni

Join Date: Dec 2024

Posts: 8
#11

02 Jan 2025, 14:48

Dear Clyde,
Thank you for replying me.
I'm not sure I made clear what I want to select:
I want to find out what are the reasons why some users access more than 4 times in a year in the ER and I want to explore the variables Reason of Access (Causaaccesso) and in the Diagnosis (Diagnosi). Frequenza was calculated as I explained in #9, to dichotomize the users.

I want to select those values among Diagnosi and Causaaccesso that seem to be more frequently recorded for frequent_users==1 (labeled as Frequent User) than for frequent_user==0 (labeled as Non-frequent user). I would not use any p-value as my criterion of selection, rather than multiplying the prevalence of frequent_user by the total of each single row to get a cut-off and compare it with the row percentage of each reason or diagnosis of the frequent users to select them, hopefully to test them later with regression models. Does it sound reasonable?
I've not yet checked whether there are patient IDs with frequency of access changes during time. I still haven't thought of a way to check it (though I fear I'll have to create 16 new dichotomous variables (2 each year) unless using a category encoding -1, 0, +1 for each year and patientID.

Last edited by Mattia Di Segni; 02 Jan 2025, 15:07. Reason: Added more indications about aims
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29976
#12

02 Jan 2025, 15:31

multiplying the prevalence of frequent_user by the total of each single row to get a cut-off and compare it with the row percentage of each reason or diagnosis of the frequent users to select them, hopefully to test them later with regression models. Does it sound reasonable?

I don't understand it. What does "total of each single row" mean? What is the "row percentage of each reason or diagnosis?"

If the goal is to identify diagnoses or reasons of access that are associated with being a frequent user, I would do that by examining the risk ratios: the prevalence of the diagnosis or reason of access in the frequent user group to the prevalence of the same diagnosis or reason of access in the infrequent user group. Now, in a small sample such as the example data, this doesn't work out very well, because many of the conditions never occur at all in the frequent user group, and the prevalence of any of the diagnoses/reasons of access is in most cases only somewhere in the 1-2% range, representing a handful of cases. But in the very large total data set this should work reasonably well. In the code below, I have arbitrarily set a threshold of 1.1 for the frequent:infrequent user prevalence ratio. And I have also modified the code so that the risk ratios are part of the output.

Code:

capture frame drop differences_found frame create differences_found str32 variable str144 value float risk_ratio foreach v of varlist Diagnosi Causaaccesso { display "Processing `v'" quietly levelsof `v', local(levels) foreach l of local levels { if substr("`:type `v''", 1, 3) == "str" { gen dv = (`v' == `"`l'"') if !missing(`v') } else { gen dv = `l'.`v' } quietly cs dv frequent_user if r(rr) >= 1.1 & !missing(r(rr)) { frame post differences_found ("`v'") (`"`l'"') (`r(rr)') } drop dv } } frame differences_found: list, noobs clean

Added: To check which patients change frequency of access categories over time:

Code:

by PatIDSANCore (frequent_user), sort: gen byte category_change = /// frequent_user[1] != frequent_user[_N]

Note that this does not distinguish the direction of change, only whether a change, in either direction, occurred at some point.

Last edited by Clyde Schechter; 02 Jan 2025, 15:39.
1 like
Comment
Mattia Di Segni

Join Date: Dec 2024

Posts: 8
#13

02 Jan 2025, 15:45

Thank you!
Comment

Announcement