Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create factor dummy variables from a long variable

    Hello I have a variable called categories which is weirdly stored in numeric long (it was a string variable and I used the command encode)
    However what I want to do is create 4 category dummy variables: below30
    30to60
    60to90
    90above
    (and of course reference categorie below30)



    How can I do it?
    Attached Files

  • #2
    Lola:
    do you mean something along the following lines?
    Code:
    . set obs 10
    number of observations (_N) was 0, now 10
    
    . g cat=1 in 1/3
    (7 missing values generated)
    
    . replace cat=2 in 4/6
    (3 real changes made)
    
    . replace cat=3 in 7/8
    (2 real changes made)
    
    . replace cat=4 if cat==.
    (2 real changes made)
    
    
    . tab cat, gen(cat_dummies)
    
            cat |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |          3       30.00       30.00
              2 |          3       30.00       60.00
              3 |          2       20.00       80.00
              4 |          2       20.00      100.00
    ------------+-----------------------------------
          Total |         10      100.00
    
    . list
    
         +-------------------------------------------------+
         | cat   cat_du~1   cat_du~2   cat_du~3   cat_du~4 |
         |-------------------------------------------------|
      1. |   1          1          0          0          0 |
      2. |   1          1          0          0          0 |
      3. |   1          1          0          0          0 |
      4. |   2          0          1          0          0 |
      5. |   2          0          1          0          0 |
         |-------------------------------------------------|
      6. |   2          0          1          0          0 |
      7. |   3          0          0          1          0 |
      8. |   3          0          0          1          0 |
      9. |   4          0          0          0          1 |
     10. |   4          0          0          0          1 |
         +-------------------------------------------------+
    
    .
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      Originally posted by Carlo Lazzaro View Post
      Lola:
      do you mean something along the following lines?
      Code:
      . set obs 10
      number of observations (_N) was 0, now 10
      
      . g cat=1 in 1/3
      (7 missing values generated)
      
      . replace cat=2 in 4/6
      (3 real changes made)
      
      . replace cat=3 in 7/8
      (2 real changes made)
      
      . replace cat=4 if cat==.
      (2 real changes made)
      
      
      . tab cat, gen(cat_dummies)
      
      cat | Freq. Percent Cum.
      ------------+-----------------------------------
      1 | 3 30.00 30.00
      2 | 3 30.00 60.00
      3 | 2 20.00 80.00
      4 | 2 20.00 100.00
      ------------+-----------------------------------
      Total | 10 100.00
      
      . list
      
      +-------------------------------------------------+
      | cat cat_du~1 cat_du~2 cat_du~3 cat_du~4 |
      |-------------------------------------------------|
      1. | 1 1 0 0 0 |
      2. | 1 1 0 0 0 |
      3. | 1 1 0 0 0 |
      4. | 2 0 1 0 0 |
      5. | 2 0 1 0 0 |
      |-------------------------------------------------|
      6. | 2 0 1 0 0 |
      7. | 3 0 0 1 0 |
      8. | 3 0 0 1 0 |
      9. | 4 0 0 0 1 |
      10. | 4 0 0 0 1 |
      +-------------------------------------------------+
      
      .
      Hi Carlo thank you a lot for your reply!

      I don't really understand what you did but
      I hae a tab like this in a csv file
      Country gdpgrowth Debttogdpratio
      France 2.5 0-30%
      Uk 1.4 30-60%
      .....

      In my file debtogdpratio is saved as a string variable.

      So i want a regression which has reg gdpgrowth 30to60 60to90 above90

      Where
      30to60
      60to90
      above90

      Are the categories for debttogdpratio
      but i just don't know how to create these categories considering the nature of my variable debttogdpratio


      Comment


      • #4
        In #1 you showed a numeric variable called categories with four distinct values. So the implication of #2 from Carlo Lazzaro is that


        Code:
        tab categories, gen(categories)
        will generate 4 new indicator variables (you say dummy variables).

        Doing the same on the string variable debttogdpratio is a bad idea as this output shows. I see no (consistent) examples of this variable in posts to date but tabulate will use the alphanumeric sort order of the distinct strings, and it's likely that new variables will be named out of order:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str8 whatever
        "below 30"
        "30-60"  
        "60-90"  
        "above 90"
        end
        
        tab whatever, gen(whatever)
        
           whatever |      Freq.     Percent        Cum.
        ------------+-----------------------------------
              30-60 |          1       25.00       25.00
              60-90 |          1       25.00       50.00
           above 90 |          1       25.00       75.00
           below 30 |          1       25.00      100.00
        ------------+-----------------------------------
              Total |          4      100.00
        
        list
        
             +------------------------------------------------------+
             | whatever   whatev~1   whatev~2   whatev~3   whatev~4 |
             |------------------------------------------------------|
          1. | below 30          0          0          0          1 |
          2. |    30-60          1          0          0          0 |
          3. |    60-90          0          1          0          0 |
          4. | above 90          0          0          1          0 |
             +------------------------------------------------------+
        That said,

        1. Unless you have a very old version of Stata, in which case you should be telling us about it, use factor variable notation, not explicit indicator variables.

        2. Unless you have no such information, the precise debt/gdp ratio is better used as a single predictor. Why degrade it arbitrarily to four classes? (If need be, use its logarithm.) Also, the class boundaries are ambiguous. Which way would 60% jump?

        You'd get faster answers as you want them with

        * more careful proof-reading of your posts, which are messy (e.g. lacking in punctuation)

        * explicit data examples using dataex, as we request in the FAQ Advice.

        P.S. This is the 5th thread you've started since 7 November. You have yet to close any of the previous four threads.
        Last edited by Nick Cox; 12 Nov 2019, 03:37.

        Comment


        • #5
          Originally posted by Nick Cox View Post
          In #1 you showed a numeric variable called categories with four distinct values. So the implication of #2 from Carlo Lazzaro is that


          Code:
          tab categories, gen(categories)
          will generate 4 new indicator variables (you say dummy variables).

          Doing the same on the string variable debttogdpratio is a bad idea as this output shows. I see no (consistent) examples of this variable in posts to date but tabulate will use the alphanumeric sort order of the distinct strings, and it's likely that new variables will be named out of order:

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str8 whatever
          "below 30"
          "30-60"
          "60-90"
          "above 90"
          end
          
          tab whatever, gen(whatever)
          
          whatever | Freq. Percent Cum.
          ------------+-----------------------------------
          30-60 | 1 25.00 25.00
          60-90 | 1 25.00 50.00
          above 90 | 1 25.00 75.00
          below 30 | 1 25.00 100.00
          ------------+-----------------------------------
          Total | 4 100.00
          
          list
          
          +------------------------------------------------------+
          | whatever whatev~1 whatev~2 whatev~3 whatev~4 |
          |------------------------------------------------------|
          1. | below 30 0 0 0 1 |
          2. | 30-60 1 0 0 0 |
          3. | 60-90 0 1 0 0 |
          4. | above 90 0 0 1 0 |
          +------------------------------------------------------+
          That said,

          1. Unless you have a very old version of Stata, in which case you should be telling us about it, use factor variable notation, not explicit indicator variables.

          2. Unless you have no such information, the precise debt/gdp ratio is better used as a single predictor. Why degrade it arbitrarily to four classes? (If need be, use its logarithm.) Also, the class boundaries are ambiguous. Which way would 60% jump?

          You'd get faster answers as you want them with

          * more careful proof-reading of your posts, which are messy (e.g. lacking in punctuation)

          * explicit data examples using dataex, as we request in the FAQ Advice.

          P.S. This is the 5th thread you've started since 7 November. You have yet to close any of the previous four threads.
          Hi Nick.

          Thank you for your reply.

          Basically the only thing I am trying to do is do a regression with categories as I want to get the coefficient on each category. By that I meant I want to get how having a debt to gdp ratio between 0-30 affect GDP growth , how having a debt to gdp ratio between 30-60 affect GDP growth ..
          I don't know what is the factor variable notation?

          (I will close old post I didn't know I had to do it.. )

          Comment


          • #6
            Basically the only thing I am trying to do is do a regression with categories as I want to get the coefficient on each category.
            Maybe you wish something like:
            Code:
            sysuse auto
            regress mpg weight i.rep78 foreign
            margins rep78
            marginsplot

            Hopefully that helps
            Best regards,

            Marcos

            Comment


            • #7
              #5


              Basically the only thing I am trying to do is do a regression with categories as I want to get the coefficient on each category. By that I meant I want to get how having a debt to gdp ratio between 0-30 affect GDP growth , how having a debt to gdp ratio between 30-60 affect GDP growth ..
              I don't know what is the factor variable notation?
              That's evident. I am suggesting that there are better ways to use debt/GDP ratio as a predictor. It's not born categorical.

              Code:
              . search factor variable
              leads to documentation.


              (I will close old post I didn't know I had to do it.. )
              it's not compulsory, but it is a really good idea to tell people what worked (or didn't work) and to give thanks. Again, we do explain in the FAQ Advice:

              16.1 Close by giving a summary and thanks

              Trying to wrap up a thread you started is helpful, especially if you report what solved your problem. You can then thank those who tried to help. Conversely, ignoring answers is less sociable, even if those answers did not solve your problem. "Thanks in advance" does not absolve you from either expectation.

              Please note that a Like on a post is not publicly visible as coming from you and, while friendly, also does not absolve you from either expectation.

              Comment


              • #8
                Originally posted by Marcos Almeida View Post

                Maybe you wish something like:
                Code:
                sysuse auto
                regress mpg weight i.rep78 foreign
                margins rep78
                marginsplot

                Hopefully that helps

                Marcos hiii
                when I try to do that I get an error message : 'string variables may not be used as factor variables'

                Comment


                • #9
                  You are not giving us the exact code you used, but we can guess that you used the string variable debtlogratio.

                  So, use categories instead. You've already told us -- and more crucially Stata has told you -- that it is numeric. The information is there in #1.

                  Comment


                  • #10
                    If, as you mentioned in #5, you wish to perform a regression analysis "with categories", you shan't use string variables in the command.
                    Best regards,

                    Marcos

                    Comment


                    • #11
                      Lola:
                      -string- format cannot be included in -regress-.
                      Provided that I share Nick Cox 's wise comment that debt/GDP ratio was not born categorical (#7), if you still want to stick with the categorical flavour, you can do something along the following lines:
                      Code:
                      .  set obs 10
                      . g cat=1 in 1/3
                      . replace cat=2 in 4/6
                      . replace cat=3 in 7/8
                      . replace cat=4 if cat==.
                      . tab cat, gen(cat_dummies)
                      
                              cat |      Freq.     Percent        Cum.
                      ------------+-----------------------------------
                                1 |          3       30.00       30.00
                                2 |          3       30.00       60.00
                                3 |          2       20.00       80.00
                                4 |          2       20.00      100.00
                      ------------+-----------------------------------
                            Total |         10      100.00
                      
                      . g y=runiform()*1000
                      
                      . reg y i.cat_dummies1 1.cat_dummies2 i.cat_dummies3
                      
                            Source |       SS           df       MS      Number of obs   =        10
                      -------------+----------------------------------   F(3, 6)         =      1.49
                             Model |  345909.264         3  115303.088   Prob > F        =    0.3105
                          Residual |  465715.256         6  77619.2094   R-squared       =    0.4262
                      -------------+----------------------------------   Adj R-squared   =    0.1393
                             Total |   811624.52         9  90180.5022   Root MSE        =     278.6
                      
                      --------------------------------------------------------------------------------
                                   y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                      ---------------+----------------------------------------------------------------
                      1.cat_dummies1 |  -464.7458   254.3279    -1.83   0.117    -1087.064    157.5721
                      1.cat_dummies2 |  -299.4321   254.3279    -1.18   0.284      -921.75    322.8858
                      1.cat_dummies3 |  -518.3106   278.6022    -1.86   0.112    -1200.026    163.4046
                               _cons |   715.5471   197.0015     3.63   0.011     233.5017    1197.592
                      --------------------------------------------------------------------------------
                      Just an aside: you cannot include all 4 categorical variables (https://en.wikipedia.org/wiki/Dummy_...le_(statistics).
                      If you do, Stata will omit one of them due to extreme collinearity:
                      Code:
                      . reg y i.cat_dummies1 1.cat_dummies2 i.cat_dummies3 i.cat_dummies4
                      note: 1.cat_dummies4 omitted because of collinearity
                      
                            Source |       SS           df       MS      Number of obs   =        10
                      -------------+----------------------------------   F(3, 6)         =      1.49
                             Model |  345909.264         3  115303.088   Prob > F        =    0.3105
                          Residual |  465715.256         6  77619.2094   R-squared       =    0.4262
                      -------------+----------------------------------   Adj R-squared   =    0.1393
                             Total |   811624.52         9  90180.5022   Root MSE        =     278.6
                      
                      --------------------------------------------------------------------------------
                                   y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                      ---------------+----------------------------------------------------------------
                      1.cat_dummies1 |  -464.7458   254.3279    -1.83   0.117    -1087.064    157.5721
                      1.cat_dummies2 |  -299.4321   254.3279    -1.18   0.284      -921.75    322.8858
                      1.cat_dummies3 |  -518.3106   278.6022    -1.86   0.112    -1200.026    163.4046
                      1.cat_dummies4 |          0  (omitted)
                               _cons |   715.5471   197.0015     3.63   0.011     233.5017    1197.592
                      --------------------------------------------------------------------------------
                      
                      .
                      Kind regards,
                      Carlo
                      (StataNow 18.5)

                      Comment


                      • #12
                        Greetings all,
                        Please help!
                        In my data analysis i have generated seven dummy variables. I omitted one dummy on the model (only 6 entered into regression). But, output indicates 5 dummies are omitted due to collinearity. Why has happened and how to fix it. The correlation matrix table shows that all the dummy variables are significant

                        Comment


                        • #13
                          Julian:
                          without seeing what you typed and what Stata gave you back is virtually impossible to reply positively.
                          A temptative answer would consider that multicollinearity is simply a matter of fact related to your dataset (as per the correlation matrix): all you can do is to change your model specification, with no guarantees that the problem will be solved.
                          Kind regards,
                          Carlo
                          (StataNow 18.5)

                          Comment


                          • #14
                            thanks for your quick response, this is how i generated the dummy variables: tabulate industry, gen(inddummy) - i got eight dummy variables. on the regression i included 7 dummy variable (i.e. inddummy1-inddummy7) by this formular : xtreg aroa ln_TA tmtsize leverage inddummy1- inddummy7 agedivc edudiv tenuredivc av_tenure , fe
                            Output: inddummy1- inddummy7 omitted because of collinearity
                            Last edited by julian mwanana; 13 Nov 2019, 01:37.

                            Comment


                            • #15
                              Julian:
                              the issue is not the (right) way the dummies were generated, but their having hard times in living together in the right-hand side of your regression.
                              As an aside, posters are kindly requested to tell interested listers if the use a community-contributed programme (like -asdoc-) (please see the FAQ). Thanks.
                              Kind regards,
                              Carlo
                              (StataNow 18.5)

                              Comment

                              Working...
                              X