Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with Value Labels in Categorical Data

    Dear Statalist,

    I have several chi-square tests to be run on categorical variables, so to automate this process I am trying to take advantage of the data stored after the tabulate and sum commands to write those results to a tab delimited text file that can be opened in Excel. This works fine, except that depending on the variable (some are dichotomous, some have three or four categories) there are zeros for some rows. Rather than remove the zeros in the resultant Excel file, is there a way to only display the results for the non-missing value labels of each variable only? The code used within the write command is given below. Columns are written for N0 _var__num, N1_var_num and total_var_num (num representing each row of the categorical variable) and p. I am new to this, so there is probably a more sophisticated way of getting the same result. Thanks.

    foreach var of varlist Cat1 Cat2 Cat3 {
    qui tab `var' Obese, chi2 row
    local p=r(p)

    foreach num of numlist 0/4 {
    foreach i of numlist 0/1 {

    local forlab: value label `var'
    local label: label `forlab' `num'

    qui sum Obese if `var' ==`num'
    matrix N`i'_`num' =r(N)
    svmat N`i'_`num' , names(N`i'_`var'`num')
    }
    qui sum Obese if `var' ==`num'
    matrix total_`num' = r(N)
    svmat total_`num' , names(total_`var'`num')

    local cat=`num'+1

  • #2
    This shows some related technique. By construction Cat2 to Cat5 are categorical variables with 2 to 5 categories.

    My strong recommendation on strategy: Don't fiddle around with lots of matrices and extra variables. Just set up a few variables and then populate them with results.

    Code:
    clear
    
    * sandbox 
    
    set obs 100 
    set seed 2803 
    
    forval j = 2/5 { 
       gen Cat`j' = runiformint(1, `j') 
    }
    
    gen Obese = runiform() > 0.7 
    
    list 
    
    * technique 
    
    gen p = . 
    gen which = "" 
    local i = 0 
    
    quietly forval k = 1/5 { 
        gen N`k' = . 
    } 
    
    quietly forval j = 2/5 { 
        local ++i  
            
        tab Obese Cat`j', chi2  
        replace which = "Cat`j'" in `i' 
        replace p = r(p) in `i' 
        
        forval k = 1/5 { 
            count if Cat`j' == `k' 
            replace N`k' = r(N) in `i' 
        }
    }
    
    l which p N* in 1/`i' 
    
        
         +-------------------------------------------+
         | which          p   N1   N2   N3   N4   N5 |
         |-------------------------------------------|
      1. |  Cat2   .8279632   53   47    .    .    . |
      2. |  Cat3   .7307598   29   24   47    .    . |
      3. |  Cat4   .0666899   21   24   32   23    . |
      4. |  Cat5   .4489087   22   19   19   19   21 |
         +-------------------------------------------+

    Comment


    • #3
      Just as a side note, if you wish to tabulate and get the chi-square of one variable versus several other, you may prefer to use - tab2 Obese Cat1 Cat2 Cat3, firstonly chi2 - to reach the result under a single line. The option - missing - is only to "treat missing values like other values", hence the lack of this option won't provide missing values. I didn't understand why to remove zeroes, since this is something that "happens" really happened in this case and is part and parcel of the estimation process. To end, even though I haven't yet delved with it, matcell, matrow and matcol, as provided by tabulate, may be related to your query.
      Best regards,

      Marcos

      Comment


      • #4
        Dear Nick and Marcos,
        Thanks for your replies. This is a sample of the output I got when I ran the code of categorical variables versus gender. Num 0 /4 is specified in the code to accommodate region which has four categories. Obese is a dichotomous variable so the remainder of the rows show zeros. BMI also has rows with zeros as it has three categories. What I want to do is to only display results for the range of value labels for each variable (i.e. remove the greyed out rows) in the table.
        Variable Label Female Male Total p
        Obese Not Obese 16 16 16 0.82
        Obese Obese 4 4 4 0.82
        Obese 2 0 0 0 0.82
        Obese 3 0 0 0 0.82
        Obese 4 0 0 0 0.82
        Region_Coded 0 0 0 0 0.51
        Region_Coded South-East 8 8 8 0.51
        Region_Coded North-East 4 4 4 0.51
        Region_Coded Western 4 4 4 0.51
        Region_Coded Southern 4 4 4 0.51
        BMICat 0 0 0 0 0.79
        BMICat Under/Normal Weight 0 0 0 0.79
        BMICat Overweight 0 0 0 0.79
        BMICat Obese 0 0 0 0.79
        BMICat 4 0 0 0 0.79

        Comment


        • #5
          You have a "problem" with the labels. I preferred to classify it as a "problem", instead of using the term "mess" to describe this mishap. For example, I strongly believe an individual can be either obese or nonobese. Maybe 2,3 and 4 were codes for missing values, who knows. Why nobody was included in a BMI cat, this is something for you to check. Missing values overall? The same for Region_coded 0. Just tidy the labels up, and some issues will reach the appropriate solution. Surely, p-values will change. By the way, had such zeros been "true", you'd be supposed to avoid a chi-squared test. Hopefully that helps.
          Best regards,

          Marcos

          Comment


          • #6
            Thanks, Marcos,

            Please see the corrected code and output table, below. I am trying to limit the output to only the valid (min-max) value labels for each variable. Obese is coded as 0/1, so there are zeros for the Ns in the remaining rows.Region is coded as 1/4 and BMICat as 1/3. Is there a simple way to adjust the code to restrict the output to these values (i.e. only valid values for the value labels) ?

            foreach var of varlist Obese Region_Coded BMICat {
            qui tab `var' Sex, chi2 row
            local p =r(p)

            foreach num of numlist 0/4 {
            foreach i of numlist 0/2 {

            local forlab: value label `var'
            local label: label `forlab' `num'

            qui sum Sex if `var' ==`num' & Sex==`i'
            matrix N`i'_`num' =r(N)
            svmat N`i'_`num' , names(N`i'_`var'`num')
            }

            qui sum Sex if `var' ==`num'
            matrix total_`num' = r(N)
            svmat total_`num' , names(total_`var'`num')

            local cat=`num'+1

            Variable Label N0 N1 Total p
            Obese Not Obese 9 7 16 0.82
            Obese Obese 2 2 4 0.82
            Obese 2 0 0 0 0.82
            Obese 3 0 0 0 0.82
            Obese 4 0 0 0 0.82
            Region_Coded 0 0 0 0 0.51
            Region_Coded South-East 5 3 8 0.51
            Region_Coded North-East 3 1 4 0.51
            Region_Coded Western 1 3 4 0.51
            Region_Coded Southern 2 2 4 0.51
            BMICat 0 0 0 0 0.79
            BMICat Under/Normal Weight 5 5 10 0.79
            BMICat Overweight 4 2 6 0.79
            BMICat Obese 2 2 4 0.79
            BMICat 4 0 0 0 0.79

            Comment

            Working...
            X