Discrete explanatory variable

Varsha Vaishnav

Join Date: Oct 2022

Posts: 144
#1

Discrete explanatory variable

22 Nov 2023, 10:34

Hello,

I'm running a fractional response model. One of the explanatory variables of women's status is discrete in nature. Its values lie in the interval [0,6], and higher values imply a higher status. Can I treat it as a continuous variable?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#2

22 Nov 2023, 11:28

Maybe. If you do treat it as a continuous variable, you will be constraining the model to the following assumptions: the difference between outcomes associated with, say, values 3 and 4, is the same as the difference between outcomes associated with values 1 and 2, or 2 and 3, or 4 and 5, or 5 and 6. Same here means same direction and same size. If this is true, or close to true, then treating it as a continuous variable is fine. One way to check this is to first use it as a discrete variable and then examine the coefficients you get for each level. If each coefficient differs from the preceding one by roughly the same amount, then treating the variable as continuous will be fine. At the other extreme, if going from one level to the next sometimes takes the outcome in different directions, you would be ill advised to treat the variable as continuous.
Comment
Varsha Vaishnav

Join Date: Oct 2022

Posts: 144
#3

22 Nov 2023, 13:08

Sure, I'll check that. But I have four such variables and two of them take values from 0-6, one from 0-14, and the last one from 0-3. If I'm treating them discrete in nature (according to the results of treating them discrete versus continuous), there will be so many categories and the model might not be parsimonious. What should I be doing?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#4

22 Nov 2023, 13:37

Well, first, parsimony is nice, but you shouldn't trade accuracy for parsimony. After all, the ultimate in parsimony is a model with only a constant term. It's the analog of the stopped clock that is right once a day.

The four variables you describe add up to 29 degrees of freedom, assuming you do not include interaction terms. There are various rules of thumb about how many observations you need per degree of freedom. The most lenient such rule I know if is 30. So if you had a little under 900 observations to fit the model with, you'd be ok. If your data set is smaller than that, and you can't get more data, and if the coefficients don't look good for treating any of them as continuous, you could consider things like combining categories within the variables. Maybe you don't lose much predictive accuracy by reducing that 0-14 variables to just four or 5 categories. Maybe you can get those 0-6 categories down to 0-3 or something like that.

Also, try graphing the outcome variable against each of your discrete variables. You might see a non-linear relationship that you can nevertheless describe functionally with fewer degrees of freedom. For example, maybe the relationship between outcome and 0-6 fails the continuous variable trial, but on graphical exploration it looks U-shaped. Then maybe using the variable and its square will work.

At the end of the day, building a model is as much art as it is science. You have to try things out. Of course, there is always the risk that in your trials you will stumble upon a solution that beautifully fits the random noise in your data and the model fails spectacularly when you try it out in another data set. There's always that risk.
Comment
Varsha Vaishnav

Join Date: Oct 2022

Posts: 144
#5

22 Nov 2023, 14:20

Thank you for the insightful points, Clyde. I'll definitely work on them.
Comment

Varsha Vaishnav

Join Date: Oct 2022
Posts: 144

23 Nov 2023, 03:01

Hello, Clyde. I tried graphing my outcome of interest against each of my discrete explanatory variables (2 of them attached below), but I'm not able to figure out anything meaningful from the graphs. I also combined categories within variables and obtained the marginal effects after running a fractional response model as follows:

Code:

 margins, dydx(dm fm av fr)

Average marginal effects                        Number of obs     =     48,202
Model VCE    : Robust

Expression   : Conditional mean of c_vector, predict()
dy/dx w.r.t. : 2.dm 3.dm 2.fm 3.fm 2.av 3.av 4.av 5.av 2.fr

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          dm |
        2-3  |  -.0071423    .002056    -3.47   0.001     -.011172   -.0031127
        4-6  |  -.0005733   .0029887    -0.19   0.848    -.0064311    .0052844
             |
          fm |
        2-3  |  -.0050407   .0033678    -1.50   0.134    -.0116414      .00156
        4-6  |  -.0076704   .0033267    -2.31   0.021    -.0141907   -.0011501
             |
          av |
        3-5  |  -.0059851   .0043137    -1.39   0.165    -.0144398    .0024696
        6-8  |  -.0093236   .0032793    -2.84   0.004     -.015751   -.0028962
       9-11  |  -.0111638   .0037065    -3.01   0.003    -.0184284   -.0038993
      12-14  |  -.0202523   .0030152    -6.72   0.000     -.026162   -.0143427
             |
          fr |
        2-3  |  -.0284811   .0016918   -16.83   0.000     -.031797   -.0251651
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

Can you please suggest something based on these results? Also, how do I interpret them as my dependent variable is in the interval [0,1]?

Attached Files

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35438
#7

23 Nov 2023, 05:59

The problem with the graphs is over-plotting: it is impossible to know whether any individual marker represents one observation or many.

You'd get a better picture (literally) with (e.g.)

Code:

scatter c_vector attitude_violence, ms(Oh) jitter(2)

where 2 is just a suggestion to be tuned.

Beyond that, the conditional means can be seen through (e.g.)

Code:

tabstat c_vector, by(attitude_violence) egen mean_viol = mean(c_vector), by(attitude_violence) egen tag = tag(attitude_violence) line mean_viol attitude_violence if tag, sort

It's not obvious to me why you would want to coarsen these predictor variables. That would be just throwing information that might be helpful, and doing so arbitrarily.
Comment
Varsha Vaishnav

Join Date: Oct 2022

Posts: 144
#8

23 Nov 2023, 09:54

Hello, Nick.

I followed the codes for the scatter, it graphs the data well but I'm still confused about the interpretation. I also obtained the graph for the conditional means, and I am not able to figure out a clear-cut pattern. I'm really confused as to how I should treat the variables in the regression analysis.

Attached Files
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#9

23 Nov 2023, 09:56

Looks like that there is no pattern.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Varsha Vaishnav

Join Date: Oct 2022

Posts: 144
#10

23 Nov 2023, 10:18

Hello, Maarten. I'm getting some patterns for the other three variables, that is, for decision-making, freedom of mobility, and access to financial resources.
Attached Files
Comment

Announcement