Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estimation Technique for Regression Equation with Continuous Dependent Variable and Categorical independent variables

    Hi everyone! I urgently need your help. I have a dataset with companies from UK. For each company, I have the difference in hourly pay between men and women, expressed in percentage, which is my dependent variable. I also have three independent variable, which are categorical. The first is industry category, represented from numbers from 1 to 22, the second is industry size categorized from 1 to 6, and the third is the region where the company is located, also coded from 1 t o 11. Now I want to study the relationship between the gender wage gap (the dependent variable) and these three independent variables, how they influence the gender wage gap. My professor says that I must index the categorical variables in STATA. I need your help how to do that? Thank you in advance for your time

  • #2
    First some general advise: If your professor says something you don't understand, then you need to tell your professor that you don't understand. Professors do not (in general) have the capability to read minds. So as long as you remain quit they will assume you understand. Most professors will want to help you learn, but they can only do so if you interact with them.

    Since you did not tell us the variable names, I will make something up: the gender wage gap will be in a variable called gwg, industry will be in a variable called indus, size in a variable called size and region in a variable called region. In that case you would type:

    reg gwg i.indus i.size i.region

    For more information see help regress and help fvvarlist . Also, don't forget the pdf manuals: they contain more detailed information than the help files.
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      You are right but he already got angry at me because I did the regression without indexing the categorical variables the first time and I didn't want to worsen his opinion about me. I know how to regress them but I don't know how to construct the indexes for the three categorical variables .

      Comment


      • #4
        Can someone help me how to construct the indexes of these categorical variables on STATA please?

        Comment


        • #5
          Did you see the i. in front of the independent variables?
          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            Deborah:
            as recommended by the FAQ, an example/excerpt of your dataset (shared via -dataex-) can help. Thanks.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Thank you! Is the procedure the same for ordinal variables like the size of company, and for nominal variables like the type of industry and the region?

              Comment


              • #8
                thank you in advance for you time to anyone who reads. I have attached the file with only the variables that I have for the regression. The dependent variable: a continuous variable representing the gender wage gap expressed in percentage for each company. The independent variables: an ordinal variable which is the company size divided in categories from 1 to 6 and two nominal variables: the regions represented by codes from 1 to 11 and the industry sector represented by codes from 1 to 20.
                Attached Files

                Comment


                • #9
                  Deborah:
                  I beg your pardon for being pedantic, but nobody on this list would download electronic spreadsheets coming from unknown sources (due to risk of active contents).
                  You are encouraged to use -datatex- or, at least, a .dta format file. Thanks.
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    I apologize. I didn't know it.
                    Attached Files

                    Comment


                    • #11
                      Originally posted by Deborah Cipi View Post
                      Thank you! Is the procedure the same for ordinal variables like the size of company, and for nominal variables like the type of industry and the region?
                      Yes

                      ---------------------------------
                      Maarten L. Buis
                      University of Konstanz
                      Department of history and sociology
                      box 40
                      78457 Konstanz
                      Germany
                      http://www.maartenbuis.nl
                      ---------------------------------

                      Comment


                      • #12
                        In general, I would recommend you to read the advise you receive more closely. Maarten Buis advised you to look at Stata's help by using help regress and help fvvarlist. The help in Stata (not an acronym!) is excellent and you should also look at the PDF accessible via the immediate help you get via help. Would you have done so I assume that you wouldn't have asked your follow-up question in #3. Additionally, Carlo Lazzaro advised you to follow the recommendations in the Forum's FAQ. Would you have done so and read the FAQ (e.g. #0) you wouldn't have attached the file as in #8 and also in #10. Additionally, I recommend you to use help command for each command you don't understand completely.

                        I am writing this not because I am pedantic (which may be) but because I believe that closely following the advise you receive and by doing your share of work by reading what is suggested will help you more than the immediate advise you get by asking the Forum for help.
                        Last edited by Dirk Enzmann; 15 May 2024, 04:28.

                        Comment


                        • #13
                          Originally posted by Maarten Buis View Post

                          Yes
                          So the ordinal variables like the company size, and the nominal variables like industry category and region, are not indexed in different ways in this regression?

                          Comment


                          • #14
                            Deborah:
                            you should get yourself more familiar with the Statalist rules.
                            That said, elaborating on your .dta file (please, learn how to share data example/excerpt via -dataex- with no risk of downloading active contencts at the repliers' side), you can go as follows:
                            Code:
                            . gen ln_DiffMeanHourlyPercent=ln( DiffMeanHourlyPercent)
                            
                            . regress ln_DiffMeanHourlyPercent i.RegionCode i.IndustrySectorCode i.EmployerSizecode, robust
                            
                            Linear regression                               Number of obs     =      6,256
                                                                            F(34, 6221)       =      26.05
                                                                            Prob > F          =     0.0000
                                                                            R-squared         =     0.1204
                                                                            Root MSE          =     .92171
                            
                            ------------------------------------------------------------------------------------
                                               |               Robust
                            ln_DiffMeanHourl~t | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                            -------------------+----------------------------------------------------------------
                                    RegionCode |
                                            2  |   .1719134   .0718272     2.39   0.017     .0311073    .3127196
                                            3  |   .2133954   .0610139     3.50   0.000      .093787    .3330037
                                            4  |   .0436778   .0749884     0.58   0.560    -.1033254    .1906811
                                            5  |   .1382207   .0667634     2.07   0.038     .0073414    .2690999
                                            6  |  -.0827402   .1919834    -0.43   0.667    -.4590941    .2936136
                                            7  |   .0892374    .078761     1.13   0.257    -.0651614    .2436362
                                            8  |   .2251707   .0653417     3.45   0.001     .0970784     .353263
                                            9  |   .1172461   .0738838     1.59   0.113    -.0275916    .2620838
                                           10  |   .0058935   .0905899     0.07   0.948     -.171694     .183481
                                           11  |   .0947933   .0704681     1.35   0.179    -.0433486    .2329352
                                               |
                            IndustrySectorCode |
                                            2  |   .3597039   .2165566     1.66   0.097    -.0648219    .7842297
                                            3  |  -.0439148   .1833722    -0.24   0.811    -.4033876     .315558
                                            4  |    .298753   .2023624     1.48   0.140    -.0979472    .6954532
                                            5  |  -.2177145    .235142    -0.93   0.355     -.678674     .243245
                                            6  |   .5294608   .1857151     2.85   0.004      .165395    .8935266
                                            7  |   .1806306   .1840547     0.98   0.326    -.1801801    .5414413
                                            8  |  -.1691013   .1900056    -0.89   0.374    -.5415779    .2033753
                                            9  |  -.6617052   .1918709    -3.45   0.001    -1.037838    -.285572
                                           10  |   .2723101   .1849734     1.47   0.141    -.0903016    .6349219
                                           11  |   .6965266   .1845571     3.77   0.000     .3347309    1.058322
                                           12  |   .3836583   .2072682     1.85   0.064    -.0226589    .7899755
                                           13  |   .3344334   .1844582     1.81   0.070    -.0271683    .6960351
                                           14  |   .0391356   .1850471     0.21   0.833    -.3236205    .4018917
                                           15  |  -.3197152   .2994744    -1.07   0.286    -.9067885     .267358
                                           16  |   .0697266   .1901965     0.37   0.714    -.3031243    .4425775
                                           17  |  -.3545895   .1944688    -1.82   0.068    -.7358154    .0266364
                                           18  |   .6443213   .2082458     3.09   0.002     .2360877    1.052555
                                           19  |   .0026552   .1959132     0.01   0.989    -.3814023    .3867127
                                           20  |   -.055708   .2865058    -0.19   0.846    -.6173583    .5059423
                                               |
                              EmployerSizecode |
                                            2  |  -.0706938   .0715729    -0.99   0.323    -.2110015    .0696138
                                            3  |  -.0769075   .0727363    -1.06   0.290    -.2194957    .0656808
                                            4  |  -.1846449   .0736806    -2.51   0.012    -.3290842   -.0402055
                                            5  |   -.218186    .090523    -2.41   0.016    -.3956423   -.0407296
                                            6  |  -.2457864   .1263792    -1.94   0.052    -.4935333    .0019605
                                               |
                                         _cons |   2.297391   .1989316    11.55   0.000     1.907416    2.687366
                            ------------------------------------------------------------------------------------
                            
                            . estat ovtest
                            
                            Ramsey RESET test for omitted variables
                            Omitted: Powers of fitted values of ln_DiffMeanHourlyPercent
                            
                            H0: Model has no omitted variables
                            
                            F(3, 6218) =   2.35
                              Prob > F = 0.0706
                            
                            . linktest
                            
                                  Source |       SS           df       MS      Number of obs   =     6,256
                            -------------+----------------------------------   F(2, 6253)      =    428.27
                                   Model |  723.908211         2  361.954106   Prob > F        =    0.0000
                                Residual |  5284.73535     6,253  .845151984   R-squared       =    0.1205
                            -------------+----------------------------------   Adj R-squared   =    0.1202
                                   Total |  6008.64357     6,255  .960614479   Root MSE        =    .91932
                            
                            ------------------------------------------------------------------------------
                            ln_DiffMea~t | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                            -------------+----------------------------------------------------------------
                                    _hat |   1.210942   .3420908     3.54   0.000      .540327    1.881558
                                  _hatsq |  -.0437125   .0705351    -0.62   0.535    -.1819856    .0945605
                                   _cons |  -.2493378   .4111786    -0.61   0.544    -1.055389    .5567135
                            ------------------------------------------------------------------------------
                            Once ln-transformed, while the number of your observations drops (those <0 cannot be logged), your regression looks technically speaking fine (heteroskedasticity was accounted for via -robust- standard errors).
                            That said, please note that:
                            - now you have a log-linear regression (see coefficients intepretation in any decent econometrics textbook);
                            - your R-sq is not that sky-rocketing. This might be due to the lack of non-categorical predictors in the right-hand side of your regression equation;
                            - get yourself familiar with -estat hettest-; -estat ovtest- and -linktest- postestimation commands by reading the related Stata .pdf manual entries.
                            Kind regards,
                            Carlo
                            (Stata 19.0)

                            Comment


                            • #15
                              Originally posted by Carlo Lazzaro View Post
                              Deborah:
                              you should get yourself more familiar with the Statalist rules.
                              That said, elaborating on your .dta file (please, learn how to share data example/excerpt via -dataex- with no risk of downloading active contencts at the repliers' side), you can go as follows:
                              Code:
                              . gen ln_DiffMeanHourlyPercent=ln( DiffMeanHourlyPercent)
                              
                              . regress ln_DiffMeanHourlyPercent i.RegionCode i.IndustrySectorCode i.EmployerSizecode, robust
                              
                              Linear regression Number of obs = 6,256
                              F(34, 6221) = 26.05
                              Prob > F = 0.0000
                              R-squared = 0.1204
                              Root MSE = .92171
                              
                              ------------------------------------------------------------------------------------
                              | Robust
                              ln_DiffMeanHourl~t | Coefficient std. err. t P>|t| [95% conf. interval]
                              -------------------+----------------------------------------------------------------
                              RegionCode |
                              2 | .1719134 .0718272 2.39 0.017 .0311073 .3127196
                              3 | .2133954 .0610139 3.50 0.000 .093787 .3330037
                              4 | .0436778 .0749884 0.58 0.560 -.1033254 .1906811
                              5 | .1382207 .0667634 2.07 0.038 .0073414 .2690999
                              6 | -.0827402 .1919834 -0.43 0.667 -.4590941 .2936136
                              7 | .0892374 .078761 1.13 0.257 -.0651614 .2436362
                              8 | .2251707 .0653417 3.45 0.001 .0970784 .353263
                              9 | .1172461 .0738838 1.59 0.113 -.0275916 .2620838
                              10 | .0058935 .0905899 0.07 0.948 -.171694 .183481
                              11 | .0947933 .0704681 1.35 0.179 -.0433486 .2329352
                              |
                              IndustrySectorCode |
                              2 | .3597039 .2165566 1.66 0.097 -.0648219 .7842297
                              3 | -.0439148 .1833722 -0.24 0.811 -.4033876 .315558
                              4 | .298753 .2023624 1.48 0.140 -.0979472 .6954532
                              5 | -.2177145 .235142 -0.93 0.355 -.678674 .243245
                              6 | .5294608 .1857151 2.85 0.004 .165395 .8935266
                              7 | .1806306 .1840547 0.98 0.326 -.1801801 .5414413
                              8 | -.1691013 .1900056 -0.89 0.374 -.5415779 .2033753
                              9 | -.6617052 .1918709 -3.45 0.001 -1.037838 -.285572
                              10 | .2723101 .1849734 1.47 0.141 -.0903016 .6349219
                              11 | .6965266 .1845571 3.77 0.000 .3347309 1.058322
                              12 | .3836583 .2072682 1.85 0.064 -.0226589 .7899755
                              13 | .3344334 .1844582 1.81 0.070 -.0271683 .6960351
                              14 | .0391356 .1850471 0.21 0.833 -.3236205 .4018917
                              15 | -.3197152 .2994744 -1.07 0.286 -.9067885 .267358
                              16 | .0697266 .1901965 0.37 0.714 -.3031243 .4425775
                              17 | -.3545895 .1944688 -1.82 0.068 -.7358154 .0266364
                              18 | .6443213 .2082458 3.09 0.002 .2360877 1.052555
                              19 | .0026552 .1959132 0.01 0.989 -.3814023 .3867127
                              20 | -.055708 .2865058 -0.19 0.846 -.6173583 .5059423
                              |
                              EmployerSizecode |
                              2 | -.0706938 .0715729 -0.99 0.323 -.2110015 .0696138
                              3 | -.0769075 .0727363 -1.06 0.290 -.2194957 .0656808
                              4 | -.1846449 .0736806 -2.51 0.012 -.3290842 -.0402055
                              5 | -.218186 .090523 -2.41 0.016 -.3956423 -.0407296
                              6 | -.2457864 .1263792 -1.94 0.052 -.4935333 .0019605
                              |
                              _cons | 2.297391 .1989316 11.55 0.000 1.907416 2.687366
                              ------------------------------------------------------------------------------------
                              
                              . estat ovtest
                              
                              Ramsey RESET test for omitted variables
                              Omitted: Powers of fitted values of ln_DiffMeanHourlyPercent
                              
                              H0: Model has no omitted variables
                              
                              F(3, 6218) = 2.35
                              Prob > F = 0.0706
                              
                              . linktest
                              
                              Source | SS df MS Number of obs = 6,256
                              -------------+---------------------------------- F(2, 6253) = 428.27
                              Model | 723.908211 2 361.954106 Prob > F = 0.0000
                              Residual | 5284.73535 6,253 .845151984 R-squared = 0.1205
                              -------------+---------------------------------- Adj R-squared = 0.1202
                              Total | 6008.64357 6,255 .960614479 Root MSE = .91932
                              
                              ------------------------------------------------------------------------------
                              ln_DiffMea~t | Coefficient Std. err. t P>|t| [95% conf. interval]
                              -------------+----------------------------------------------------------------
                              _hat | 1.210942 .3420908 3.54 0.000 .540327 1.881558
                              _hatsq | -.0437125 .0705351 -0.62 0.535 -.1819856 .0945605
                              _cons | -.2493378 .4111786 -0.61 0.544 -1.055389 .5567135
                              ------------------------------------------------------------------------------
                              Once ln-transformed, while the number of your observations drops (those <0 cannot be logged), your regression looks technically speaking fine (heteroskedasticity was accounted for via -robust- standard errors).
                              That said, please note that:
                              - now you have a log-linear regression (see coefficients intepretation in any decent econometrics textbook);
                              - your R-sq is not that sky-rocketing. This might be due to the lack of non-categorical predictors in the right-hand side of your regression equation;
                              - get yourself familiar with -estat hettest-; -estat ovtest- and -linktest- postestimation commands by reading the related Stata .pdf manual entries.
                              I can't thank you enough! God bless you!

                              Comment

                              Working...
                              X