Estimation Technique for Regression Equation with Continuous Dependent Variable and Categorical independent variables

Deborah Cipi

Join Date: May 2024

Posts: 10
#1

Estimation Technique for Regression Equation with Continuous Dependent Variable and Categorical independent variables

15 May 2024, 00:54

Hi everyone! I urgently need your help. I have a dataset with companies from UK. For each company, I have the difference in hourly pay between men and women, expressed in percentage, which is my dependent variable. I also have three independent variable, which are categorical. The first is industry category, represented from numbers from 1 to 22, the second is industry size categorized from 1 to 6, and the third is the region where the company is located, also coded from 1 t o 11. Now I want to study the relationship between the gender wage gap (the dependent variable) and these three independent variables, how they influence the gender wage gap. My professor says that I must index the categorical variables in STATA. I need your help how to do that? Thank you in advance for your time
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#2

15 May 2024, 01:28

First some general advise: If your professor says something you don't understand, then you need to tell your professor that you don't understand. Professors do not (in general) have the capability to read minds. So as long as you remain quit they will assume you understand. Most professors will want to help you learn, but they can only do so if you interact with them.

Since you did not tell us the variable names, I will make something up: the gender wage gap will be in a variable called gwg, industry will be in a variable called indus, size in a variable called size and region in a variable called region. In that case you would type:

reg gwg i.indus i.size i.region

For more information see help regress and help fvvarlist . Also, don't forget the pdf manuals: they contain more detailed information than the help files.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Deborah Cipi

Join Date: May 2024

Posts: 10
#3

15 May 2024, 01:39

You are right but he already got angry at me because I did the regression without indexing the categorical variables the first time and I didn't want to worsen his opinion about me. I know how to regress them but I don't know how to construct the indexes for the three categorical variables .
Comment
Deborah Cipi

Join Date: May 2024

Posts: 10
#4

15 May 2024, 01:54

Can someone help me how to construct the indexes of these categorical variables on STATA please?
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#5

15 May 2024, 01:56

Did you see the i. in front of the independent variables?

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#6

15 May 2024, 02:13

Deborah:
as recommended by the FAQ, an example/excerpt of your dataset (shared via -dataex-) can help. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Deborah Cipi

Join Date: May 2024

Posts: 10
#7

15 May 2024, 02:38

Thank you! Is the procedure the same for ordinal variables like the size of company, and for nominal variables like the type of industry and the region?
Comment
Deborah Cipi

Join Date: May 2024

Posts: 10
#8

15 May 2024, 02:49

thank you in advance for you time to anyone who reads. I have attached the file with only the variables that I have for the regression. The dependent variable: a continuous variable representing the gender wage gap expressed in percentage for each company. The independent variables: an ordinal variable which is the company size divided in categories from 1 to 6 and two nominal variables: the regions represented by codes from 1 to 11 and the industry sector represented by codes from 1 to 20.
Attached Files

index regression.xlsx (169.2 KB, 1 view)
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#9

15 May 2024, 03:28

Deborah:
I beg your pardon for being pedantic, but nobody on this list would download electronic spreadsheets coming from unknown sources (due to risk of active contents).
You are encouraged to use -datatex- or, at least, a .dta format file. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Deborah Cipi

Join Date: May 2024

Posts: 10
#10

15 May 2024, 03:41

I apologize. I didn't know it.
Attached Files

index regression.dta (80.9 KB, 1 view)
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#11

15 May 2024, 04:03

Originally posted by Deborah Cipi View Post

Thank you! Is the procedure the same for ordinal variables like the size of company, and for nominal variables like the type of industry and the region?

Yes

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 523
#12

15 May 2024, 04:23

In general, I would recommend you to read the advise you receive more closely. Maarten Buis advised you to look at Stata's help by using help regress and help fvvarlist. The help in Stata (not an acronym!) is excellent and you should also look at the PDF accessible via the immediate help you get via help. Would you have done so I assume that you wouldn't have asked your follow-up question in #3. Additionally, Carlo Lazzaro advised you to follow the recommendations in the Forum's FAQ. Would you have done so and read the FAQ (e.g. #0) you wouldn't have attached the file as in #8 and also in #10. Additionally, I recommend you to use help command for each command you don't understand completely.

I am writing this not because I am pedantic (which may be) but because I believe that closely following the advise you receive and by doing your share of work by reading what is suggested will help you more than the immediate advise you get by asking the Forum for help.

Last edited by Dirk Enzmann; 15 May 2024, 04:28.
2 likes
Comment
Deborah Cipi

Join Date: May 2024

Posts: 10
#13

15 May 2024, 04:33

Originally posted by Maarten Buis View Post

Yes

So the ordinal variables like the company size, and the nominal variables like industry category and region, are not indexed in different ways in this regression?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17673

#14

15 May 2024, 06:02

Deborah:
you should get yourself more familiar with the Statalist rules.
That said, elaborating on your .dta file (please, learn how to share data example/excerpt via -dataex- with no risk of downloading active contencts at the repliers' side), you can go as follows:

Code:

. gen ln_DiffMeanHourlyPercent=ln( DiffMeanHourlyPercent)

. regress ln_DiffMeanHourlyPercent i.RegionCode i.IndustrySectorCode i.EmployerSizecode, robust

Linear regression                               Number of obs     =      6,256
                                                F(34, 6221)       =      26.05
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1204
                                                Root MSE          =     .92171

------------------------------------------------------------------------------------
                   |               Robust
ln_DiffMeanHourl~t | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------------+----------------------------------------------------------------
        RegionCode |
                2  |   .1719134   .0718272     2.39   0.017     .0311073    .3127196
                3  |   .2133954   .0610139     3.50   0.000      .093787    .3330037
                4  |   .0436778   .0749884     0.58   0.560    -.1033254    .1906811
                5  |   .1382207   .0667634     2.07   0.038     .0073414    .2690999
                6  |  -.0827402   .1919834    -0.43   0.667    -.4590941    .2936136
                7  |   .0892374    .078761     1.13   0.257    -.0651614    .2436362
                8  |   .2251707   .0653417     3.45   0.001     .0970784     .353263
                9  |   .1172461   .0738838     1.59   0.113    -.0275916    .2620838
               10  |   .0058935   .0905899     0.07   0.948     -.171694     .183481
               11  |   .0947933   .0704681     1.35   0.179    -.0433486    .2329352
                   |
IndustrySectorCode |
                2  |   .3597039   .2165566     1.66   0.097    -.0648219    .7842297
                3  |  -.0439148   .1833722    -0.24   0.811    -.4033876     .315558
                4  |    .298753   .2023624     1.48   0.140    -.0979472    .6954532
                5  |  -.2177145    .235142    -0.93   0.355     -.678674     .243245
                6  |   .5294608   .1857151     2.85   0.004      .165395    .8935266
                7  |   .1806306   .1840547     0.98   0.326    -.1801801    .5414413
                8  |  -.1691013   .1900056    -0.89   0.374    -.5415779    .2033753
                9  |  -.6617052   .1918709    -3.45   0.001    -1.037838    -.285572
               10  |   .2723101   .1849734     1.47   0.141    -.0903016    .6349219
               11  |   .6965266   .1845571     3.77   0.000     .3347309    1.058322
               12  |   .3836583   .2072682     1.85   0.064    -.0226589    .7899755
               13  |   .3344334   .1844582     1.81   0.070    -.0271683    .6960351
               14  |   .0391356   .1850471     0.21   0.833    -.3236205    .4018917
               15  |  -.3197152   .2994744    -1.07   0.286    -.9067885     .267358
               16  |   .0697266   .1901965     0.37   0.714    -.3031243    .4425775
               17  |  -.3545895   .1944688    -1.82   0.068    -.7358154    .0266364
               18  |   .6443213   .2082458     3.09   0.002     .2360877    1.052555
               19  |   .0026552   .1959132     0.01   0.989    -.3814023    .3867127
               20  |   -.055708   .2865058    -0.19   0.846    -.6173583    .5059423
                   |
  EmployerSizecode |
                2  |  -.0706938   .0715729    -0.99   0.323    -.2110015    .0696138
                3  |  -.0769075   .0727363    -1.06   0.290    -.2194957    .0656808
                4  |  -.1846449   .0736806    -2.51   0.012    -.3290842   -.0402055
                5  |   -.218186    .090523    -2.41   0.016    -.3956423   -.0407296
                6  |  -.2457864   .1263792    -1.94   0.052    -.4935333    .0019605
                   |
             _cons |   2.297391   .1989316    11.55   0.000     1.907416    2.687366
------------------------------------------------------------------------------------

. estat ovtest

Ramsey RESET test for omitted variables
Omitted: Powers of fitted values of ln_DiffMeanHourlyPercent

H0: Model has no omitted variables

F(3, 6218) =   2.35
  Prob > F = 0.0706

. linktest

      Source |       SS           df       MS      Number of obs   =     6,256
-------------+----------------------------------   F(2, 6253)      =    428.27
       Model |  723.908211         2  361.954106   Prob > F        =    0.0000
    Residual |  5284.73535     6,253  .845151984   R-squared       =    0.1205
-------------+----------------------------------   Adj R-squared   =    0.1202
       Total |  6008.64357     6,255  .960614479   Root MSE        =    .91932

------------------------------------------------------------------------------
ln_DiffMea~t | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        _hat |   1.210942   .3420908     3.54   0.000      .540327    1.881558
      _hatsq |  -.0437125   .0705351    -0.62   0.535    -.1819856    .0945605
       _cons |  -.2493378   .4111786    -0.61   0.544    -1.055389    .5567135
------------------------------------------------------------------------------

Once ln-transformed, while the number of your observations drops (those <0 cannot be logged), your regression looks technically speaking fine (heteroskedasticity was accounted for via -robust- standard errors).
That said, please note that:
- now you have a log-linear regression (see coefficients intepretation in any decent econometrics textbook);
- your R-sq is not that sky-rocketing. This might be due to the lack of non-categorical predictors in the right-hand side of your regression equation;
- get yourself familiar with -estat hettest-; -estat ovtest- and -linktest- postestimation commands by reading the related Stata .pdf manual entries.

Kind regards,
Carlo
(Stata 19.0)

Comment

Deborah Cipi

Join Date: May 2024
Posts: 10

#15

15 May 2024, 06:16

Originally posted by Carlo Lazzaro View Post

Code:

. gen ln_DiffMeanHourlyPercent=ln( DiffMeanHourlyPercent)

. regress ln_DiffMeanHourlyPercent i.RegionCode i.IndustrySectorCode i.EmployerSizecode, robust

Linear regression Number of obs = 6,256
F(34, 6221) = 26.05
Prob > F = 0.0000
R-squared = 0.1204
Root MSE = .92171

------------------------------------------------------------------------------------
| Robust
ln_DiffMeanHourl~t | Coefficient std. err. t P>|t| [95% conf. interval]
-------------------+----------------------------------------------------------------
RegionCode |
2 | .1719134 .0718272 2.39 0.017 .0311073 .3127196
3 | .2133954 .0610139 3.50 0.000 .093787 .3330037
4 | .0436778 .0749884 0.58 0.560 -.1033254 .1906811
5 | .1382207 .0667634 2.07 0.038 .0073414 .2690999
6 | -.0827402 .1919834 -0.43 0.667 -.4590941 .2936136
7 | .0892374 .078761 1.13 0.257 -.0651614 .2436362
8 | .2251707 .0653417 3.45 0.001 .0970784 .353263
9 | .1172461 .0738838 1.59 0.113 -.0275916 .2620838
10 | .0058935 .0905899 0.07 0.948 -.171694 .183481
11 | .0947933 .0704681 1.35 0.179 -.0433486 .2329352
|
IndustrySectorCode |
2 | .3597039 .2165566 1.66 0.097 -.0648219 .7842297
3 | -.0439148 .1833722 -0.24 0.811 -.4033876 .315558
4 | .298753 .2023624 1.48 0.140 -.0979472 .6954532
5 | -.2177145 .235142 -0.93 0.355 -.678674 .243245
6 | .5294608 .1857151 2.85 0.004 .165395 .8935266
7 | .1806306 .1840547 0.98 0.326 -.1801801 .5414413
8 | -.1691013 .1900056 -0.89 0.374 -.5415779 .2033753
9 | -.6617052 .1918709 -3.45 0.001 -1.037838 -.285572
10 | .2723101 .1849734 1.47 0.141 -.0903016 .6349219
11 | .6965266 .1845571 3.77 0.000 .3347309 1.058322
12 | .3836583 .2072682 1.85 0.064 -.0226589 .7899755
13 | .3344334 .1844582 1.81 0.070 -.0271683 .6960351
14 | .0391356 .1850471 0.21 0.833 -.3236205 .4018917
15 | -.3197152 .2994744 -1.07 0.286 -.9067885 .267358
16 | .0697266 .1901965 0.37 0.714 -.3031243 .4425775
17 | -.3545895 .1944688 -1.82 0.068 -.7358154 .0266364
18 | .6443213 .2082458 3.09 0.002 .2360877 1.052555
19 | .0026552 .1959132 0.01 0.989 -.3814023 .3867127
20 | -.055708 .2865058 -0.19 0.846 -.6173583 .5059423
|
EmployerSizecode |
2 | -.0706938 .0715729 -0.99 0.323 -.2110015 .0696138
3 | -.0769075 .0727363 -1.06 0.290 -.2194957 .0656808
4 | -.1846449 .0736806 -2.51 0.012 -.3290842 -.0402055
5 | -.218186 .090523 -2.41 0.016 -.3956423 -.0407296
6 | -.2457864 .1263792 -1.94 0.052 -.4935333 .0019605
|
_cons | 2.297391 .1989316 11.55 0.000 1.907416 2.687366
------------------------------------------------------------------------------------

. estat ovtest

Ramsey RESET test for omitted variables
Omitted: Powers of fitted values of ln_DiffMeanHourlyPercent

H0: Model has no omitted variables

F(3, 6218) = 2.35
Prob > F = 0.0706

. linktest

Source | SS df MS Number of obs = 6,256
-------------+---------------------------------- F(2, 6253) = 428.27
Model | 723.908211 2 361.954106 Prob > F = 0.0000
Residual | 5284.73535 6,253 .845151984 R-squared = 0.1205
-------------+---------------------------------- Adj R-squared = 0.1202
Total | 6008.64357 6,255 .960614479 Root MSE = .91932

------------------------------------------------------------------------------
ln_DiffMea~t | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
_hat | 1.210942 .3420908 3.54 0.000 .540327 1.881558
_hatsq | -.0437125 .0705351 -0.62 0.535 -.1819856 .0945605
_cons | -.2493378 .4111786 -0.61 0.544 -1.055389 .5567135
------------------------------------------------------------------------------

I can't thank you enough! God bless you!

Announcement