Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help deciding what tests to use

    Hi all,
    I am comparing two vairables, ILLWK (whether or not a person was ill in a reference week) and DAYSILL (how many days they were ill if so), between 4 different occupations. ILLWK takes the value 1 if the person was ill in the week and 0 otherwise. DAYSILL can take any integer in between 0 and 7. I have over 48,000 observations as I am using quarterly data from between 2012 and 2017. The majority of responses for both variables are 0 but I have around 1,000 observations that report being ill and taking a value between 1 and 7. I was planning on running an ANOVA test however my data is not normally distributed and there is not equality of variances. This rules out running a Kruskal-Wallis test and also a Welch Test, leaving the Brown-Forsyth test. However, I have read that the ANOVA is quite robust when you have thousands of observations, leaving me in doubt about which tests to run.

    Thanks for your help

  • #2
    Jamie:
    whatever -anova- can do, -regress- can do it better.
    In your case, -ILLWK- is not a predictor of the lenght of sick-leave: if the person was not ill, she/he totaled zero days of sick-leave. What happens in different professional clusters can be interesting, though. Moreover, you seem to have panel data: assuming that you can treat -DAYSILL- as continuous regressand, you may want to try:
    Code:
    xtset panelid quarter
    xtreg DAYSILL i.professional_cluster
    I assume that you have other predictors or controls that you can plug in the right-hand side of yiour regerssion equation.
    Kind regards,
    Carlo
    (StataNow 19.0)

    Comment


    • #3
      Originally posted by Carlo Lazzaro View Post
      Jamie:
      whatever -anova- can do, -regress- can do it better.
      In your case, -ILLWK- is not a predictor of the lenght of sick-leave: if the person was not ill, she/he totaled zero days of sick-leave. What happens in different professional clusters can be interesting, though. Moreover, you seem to have panel data: assuming that you can treat -DAYSILL- as continuous regressand, you may want to try:
      Code:
      xtset panelid quarter
      xtreg DAYSILL i.professional_cluster
      I assume that you have other predictors or controls that you can plug in the right-hand side of yiour regerssion equation.
      Hi Carlo,
      Thanks for your reply. I have seen many people say this and it was my plan to also run some panel regressions. I was probably going to ask in another thread but whenever I try set up my data as panel I keep getting following errors as shown in the pictures.

      I do have other predictors such as sex, age range, region of place of work etc.
      Attached Files

      Comment


      • #4
        Jamie:
        what if you type:
        Code:
        xtset newid3 quarter
        ?
        Kind regards,
        Carlo
        (StataNow 19.0)

        Comment


        • #5
          This gives me error r(111), "variable quarter not found"

          Comment


          • #6
            Jamie:
            can you please post an excerpt of your data via -dataex-? Thanks.
            Kind regards,
            Carlo
            (StataNow 19.0)

            Comment


            • #7
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input byte DAYSILL str14 ILLWK str5 YEARQ long SOC10M float(newid newid2 newid3)
              0 "No" "y12q1" 3 3 0 1
              0 "No" "y12q1" 4 4 0 1
              0 "No" "y12q1" 4 4 0 1
              0 "No" "y12q1" 4 4 0 1
              0 "No" "y12q1" 1 1 0 1
              0 "No" "y12q1" 1 1 0 1
              0 "No" "y12q1" 4 4 0 1
              0 "No" "y12q1" 3 3 0 1
              0 "No" "y12q1" 3 3 0 1
              0 "No" "y12q1" 4 4 0 1
              0 "No" "y12q1" 1 1 0 1
              0 "No" "y12q1" 4 4 0 1
              0 "No" "y12q1" 4 4 0 1
              0 "No" "y12q1" 1 1 0 1
              0 "No" "y12q1" 1 1 0 1
              0 "No" "y12q1" 3 3 0 1
              0 "No" "y12q1" 3 3 0 1
              0 "No" "y12q1" 1 1 0 1
              0 "No" "y12q1" 3 3 0 1
              0 "No" "y12q1" 3 3 0 1
              0 "No" "y12q1" 3 3 0 1
              0 "No" "y12q1" 4 4 0 1
              0 "No" "y12q1" 3 3 0 1
              0 "No" "y12q1" 3 3 0 1
              end
              format %tq newid3
              label values DAYSILL DAYSILL
              label values SOC10M SOC10M0
              label def SOC10M0 1 "2211 Medical practitioners", modify
              label def SOC10M0 2 "2229 Therapy professionals nec", modify
              label def SOC10M0 3 "2231  Nurses", modify
              label def SOC10M0 4 "2314 Secondary education teaching professionals", modify
              I used egen on SOC10, ILLWK and YEARQ to create newid, newid2 and newid3

              Comment


              • #8
                Jamie:
                two geneal comments about your -dataex- excerpt (that makes things a bit clearer):
                - you actually do not have a -panelid- (or I cannot find it at glance);
                - assuming that -newid3- is your -timevar-, you have a repeated time values within panels problem that ca be easiliy avoided by -xtset-ting your dataset with -panelid- only. The usual warning is that the trick will work as long as you do not plan to use time-series related commands, such as lags and leads.
                Kind regards,
                Carlo
                (StataNow 19.0)

                Comment


                • #9
                  Maybe I am missing something but would newid3 not work as a panelid as it is in byte format?

                  Comment


                  • #10
                    newid3 cannot be a panel identifier. Its being a numeric variable -- with storage type (not format) byte -- is necessary but far from sufficient. The problem is that you tell us that

                    I used egen on SOC10, ILLWK and YEARQ to create newid, newid2 and newid3
                    so that means that all the observations for each quarter (and in some versions of your identifier those that have identical responses) map to the same identifier. Panel identifiers should be given in advance as part of the data, not defined with reference to equal outcomes and/or dates.

                    More positively, #3 indicates that newid3 being based on YEARQ is not a panel identifier but a time variable, it being fortunate that string values like y12q1 and y17q4 will sort in the right order and so be mapped by egen, group() to values 1 up. However, if all quarters are not represented in the dataset the time variable created will not respect gaps.

                    As far as I can tell from the variables you show us, each observation is for one person, as no variable otherwise identifies a person. Naturally you may have an identifier you're not showing us, but if so nothing in this thread so far makes that clear. Or, perhaps identifiers were suppressed for good reason, in which case you have no obvious scope for discovering one yourself.

                    Here is how to get a proper Stata quarterly date out of your date variable. The code could be shortened considerably with some risk of making it harder to follow.

                    Code:
                    * just to make a sandbox
                    clear
                    set obs 1
                    gen YEARQ = "y12q1"
                    
                    * start here
                    gen year = real(substr(YEARQ, 2, 2))
                    gen quarter = real(substr(YEARQ, -1, 1))
                    gen qdate = yq(2000 + year, quarter)
                    format qdate %tq
                    
                    list
                    
                         +---------------------------------+
                         | YEARQ   year   quarter    qdate |
                         |---------------------------------|
                      1. | y12q1     12         1   2012q1 |
                         +---------------------------------+
                    As your response is bounded between 0 and 7 that seems to me to call for a generalised linear model with binomial family, except we need to know about possible dependence between observations.
                    Last edited by Nick Cox; 28 Feb 2020, 13:22.

                    Comment


                    • #11
                      Hi Nick,

                      Thanks for your advice and I will look into running glm. I inputted the code however I am still told I have repeated time values within panel when I run xtset qdate DAYSILL.

                      And just to confirm, there is no variable for individual oberservations in the dataset.

                      Comment


                      • #12
                        Again, qdate can not possibly be a panel identifier as it is a time variable. Also
                        DAYSILL is not a time variable.


                        As the help for xtset explains. You can specify

                        xtset panelid

                        or


                        xtset panelid timevar

                        But absent a panel identifier neither is possible in your case.

                        Comment

                        Working...
                        X