Help deciding what tests to use

Jamie Webster

Join Date: Feb 2020

Posts: 16
#1

Help deciding what tests to use

27 Feb 2020, 15:43

Hi all,
I am comparing two vairables, ILLWK (whether or not a person was ill in a reference week) and DAYSILL (how many days they were ill if so), between 4 different occupations. ILLWK takes the value 1 if the person was ill in the week and 0 otherwise. DAYSILL can take any integer in between 0 and 7. I have over 48,000 observations as I am using quarterly data from between 2012 and 2017. The majority of responses for both variables are 0 but I have around 1,000 observations that report being ill and taking a value between 1 and 7. I was planning on running an ANOVA test however my data is not normally distributed and there is not equality of variances. This rules out running a Kruskal-Wallis test and also a Welch Test, leaving the Brown-Forsyth test. However, I have read that the ANOVA is quite robust when you have thousands of observations, leaving me in doubt about which tests to run.

Thanks for your help
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17672
#2

28 Feb 2020, 00:22

Jamie:
whatever -anova- can do, -regress- can do it better.
In your case, -ILLWK- is not a predictor of the lenght of sick-leave: if the person was not ill, she/he totaled zero days of sick-leave. What happens in different professional clusters can be interesting, though. Moreover, you seem to have panel data: assuming that you can treat -DAYSILL- as continuous regressand, you may want to try:

Code:

xtset panelid quarter xtreg DAYSILL i.professional_cluster

I assume that you have other predictors or controls that you can plug in the right-hand side of yiour regerssion equation.

Kind regards,
Carlo
(StataNow 19.0)
Comment
Jamie Webster

Join Date: Feb 2020

Posts: 16
#3

28 Feb 2020, 06:20

Originally posted by Carlo Lazzaro View Post

Jamie:
whatever -anova- can do, -regress- can do it better.
In your case, -ILLWK- is not a predictor of the lenght of sick-leave: if the person was not ill, she/he totaled zero days of sick-leave. What happens in different professional clusters can be interesting, though. Moreover, you seem to have panel data: assuming that you can treat -DAYSILL- as continuous regressand, you may want to try:

Code:

xtset panelid quarter xtreg DAYSILL i.professional_cluster

I assume that you have other predictors or controls that you can plug in the right-hand side of yiour regerssion equation.

Hi Carlo,
Thanks for your reply. I have seen many people say this and it was my plan to also run some panel regressions. I was probably going to ask in another thread but whenever I try set up my data as panel I keep getting following errors as shown in the pictures.

I do have other predictors such as sex, age range, region of place of work etc.
Attached Files
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17672
#4

28 Feb 2020, 06:59

Jamie:
what if you type:

Code:

xtset newid3 quarter

?

Kind regards,
Carlo
(StataNow 19.0)
Comment
Jamie Webster

Join Date: Feb 2020

Posts: 16
#5

28 Feb 2020, 07:18

This gives me error r(111), "variable quarter not found"
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17672
#6

28 Feb 2020, 07:43

Jamie:
can you please post an excerpt of your data via -dataex-? Thanks.

Kind regards,
Carlo
(StataNow 19.0)
Comment

Jamie Webster

Join Date: Feb 2020
Posts: 16

28 Feb 2020, 08:51

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte DAYSILL str14 ILLWK str5 YEARQ long SOC10M float(newid newid2 newid3)
0 "No" "y12q1" 3 3 0 1
0 "No" "y12q1" 4 4 0 1
0 "No" "y12q1" 4 4 0 1
0 "No" "y12q1" 4 4 0 1
0 "No" "y12q1" 1 1 0 1
0 "No" "y12q1" 1 1 0 1
0 "No" "y12q1" 4 4 0 1
0 "No" "y12q1" 3 3 0 1
0 "No" "y12q1" 3 3 0 1
0 "No" "y12q1" 4 4 0 1
0 "No" "y12q1" 1 1 0 1
0 "No" "y12q1" 4 4 0 1
0 "No" "y12q1" 4 4 0 1
0 "No" "y12q1" 1 1 0 1
0 "No" "y12q1" 1 1 0 1
0 "No" "y12q1" 3 3 0 1
0 "No" "y12q1" 3 3 0 1
0 "No" "y12q1" 1 1 0 1
0 "No" "y12q1" 3 3 0 1
0 "No" "y12q1" 3 3 0 1
0 "No" "y12q1" 3 3 0 1
0 "No" "y12q1" 4 4 0 1
0 "No" "y12q1" 3 3 0 1
0 "No" "y12q1" 3 3 0 1
end
format %tq newid3
label values DAYSILL DAYSILL
label values SOC10M SOC10M0
label def SOC10M0 1 "2211 Medical practitioners", modify
label def SOC10M0 2 "2229 Therapy professionals nec", modify
label def SOC10M0 3 "2231  Nurses", modify
label def SOC10M0 4 "2314 Secondary education teaching professionals", modify

I used egen on SOC10, ILLWK and YEARQ to create newid, newid2 and newid3

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17672
#8

28 Feb 2020, 09:31

Jamie:
two geneal comments about your -dataex- excerpt (that makes things a bit clearer):
- you actually do not have a -panelid- (or I cannot find it at glance);
- assuming that -newid3- is your -timevar-, you have a repeated time values within panels problem that ca be easiliy avoided by -xtset-ting your dataset with -panelid- only. The usual warning is that the trick will work as long as you do not plan to use time-series related commands, such as lags and leads.

Kind regards,
Carlo
(StataNow 19.0)
Comment
Jamie Webster

Join Date: Feb 2020

Posts: 16
#9

28 Feb 2020, 12:28

Maybe I am missing something but would newid3 not work as a panelid as it is in byte format?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#10

28 Feb 2020, 13:11

newid3 cannot be a panel identifier. Its being a numeric variable -- with storage type (not format) byte -- is necessary but far from sufficient. The problem is that you tell us that

I used egen on SOC10, ILLWK and YEARQ to create newid, newid2 and newid3

so that means that all the observations for each quarter (and in some versions of your identifier those that have identical responses) map to the same identifier. Panel identifiers should be given in advance as part of the data, not defined with reference to equal outcomes and/or dates.

More positively, #3 indicates that newid3 being based on YEARQ is not a panel identifier but a time variable, it being fortunate that string values like y12q1 and y17q4 will sort in the right order and so be mapped by egen, group() to values 1 up. However, if all quarters are not represented in the dataset the time variable created will not respect gaps.

As far as I can tell from the variables you show us, each observation is for one person, as no variable otherwise identifies a person. Naturally you may have an identifier you're not showing us, but if so nothing in this thread so far makes that clear. Or, perhaps identifiers were suppressed for good reason, in which case you have no obvious scope for discovering one yourself.

Here is how to get a proper Stata quarterly date out of your date variable. The code could be shortened considerably with some risk of making it harder to follow.

Code:

* just to make a sandbox clear set obs 1 gen YEARQ = "y12q1" * start here gen year = real(substr(YEARQ, 2, 2)) gen quarter = real(substr(YEARQ, -1, 1)) gen qdate = yq(2000 + year, quarter) format qdate %tq list +---------------------------------+ | YEARQ year quarter qdate | |---------------------------------| 1. | y12q1 12 1 2012q1 | +---------------------------------+

As your response is bounded between 0 and 7 that seems to me to call for a generalised linear model with binomial family, except we need to know about possible dependence between observations.

Last edited by Nick Cox; 28 Feb 2020, 13:22.
Comment
Jamie Webster

Join Date: Feb 2020

Posts: 16
#11

28 Feb 2020, 14:43

Hi Nick,

Thanks for your advice and I will look into running glm. I inputted the code however I am still told I have repeated time values within panel when I run xtset qdate DAYSILL.

And just to confirm, there is no variable for individual oberservations in the dataset.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#12

28 Feb 2020, 15:01

Again, qdate can not possibly be a panel identifier as it is a time variable. Also
DAYSILL is not a time variable.

As the help for xtset explains. You can specify

xtset panelid

or

xtset panelid timevar

But absent a panel identifier neither is possible in your case.
1 like
Comment

Announcement

Help deciding what tests to use

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment