
No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Auto data: graphing encoded variable with a dummy variable

    Hello everyone!
    I'm a high-school student who is writing a research paper in applied statistics and who is also a complete newbie in Stata. Right now I'm faced with a problem to which I couldn't find a clear solution on this forum.
    My data set is rather simple. In fact, it's so simple that I'll use "auto.dta" on 16.0 Stata to hopefully explain my problem. I did following things: 1) I encoded the string "make"
    encode make, gen(make1)
    2) I created a dummy variable with condition of price being less than 5000
     gen price5000=cond(price<5000, 1, 0)
    3) I tried to create a bar graph of dummy variable
    catplot make1, by(price5000)
    but I get this
    Click image for larger version

Name:	Graph.png
Views:	1
Size:	194.9 KB
ID:	1643060

    I can see that models are all messed up, but I want to know why the frequency doesn't result in right way.
    Even the slightest bit of help is much appreciated.
    Best regards,
    Last edited by Tursynbay Yeskendir; 30 Dec 2021, 15:10.

  • #2
    Your dataset (that is, auto.dta) has precisely one observation for each model, which is what your plots are telling you.
    . sysuse auto, clear
    (1978 automobile data)
    . encode make, gen(make1)
    . gen price5000=cond(price<5000, 1, 0)
    . tab make1 price5000
                      |       price5000
       Make and model |         0          1 |     Total
          AMC Concord |         0          1 |         1 
            AMC Pacer |         0          1 |         1 
           AMC Spirit |         0          1 |         1 
            Audi 5000 |         1          0 |         1 
             Audi Fox |         1          0 |         1 
             BMW 320i |         1          0 |         1 
        Buick Century |         0          1 |         1 
        Buick Electra |         1          0 |         1 
        Buick LeSabre |         1          0 |         1 
           Buick Opel |         0          1 |         1 
            VW Dasher |         1          0 |         1 
            VW Diesel |         1          0 |         1 
            VW Rabbit |         0          1 |         1 
          VW Scirocco |         1          0 |         1 
            Volvo 260 |         1          0 |         1 
                Total |        37         37 |        74
    What was it that you were expecting?


    • #3
      Thank you for your reply, William Lisowski!
      I didn't notice that auto.dta has only unique strings for "models". In my data, there are multiple observations for each encoded string.
      Here is the example of my dataset:
      * Example generated by -dataex-. To install: ssc install dataex
      input double v1 float v2 byte v3 float(v4 v5) long v6 float v7 int v8
      4.14 215 4 2 2  19 2013 0
      4.32 209 5 3 2 170 2012 0
      4.32 209 5 2 2 170 2012 0
      4.56 208 2 2 2 170 2012 0
       4.7 213 4 2 2  19 2013 0
      4.93 209 4 2 2 198 2012 0
      4.94 208 5 3 2  19 2012 0
      4.94 208 5 1 2  19 2012 0
      5.08 212 5 1 2 114 2013 0
      5.09 211 5 1 1 142 2012 0
      5.14 208 4 1 2 168 2012 0
      5.16 211 5 3 1 142 2012 0
      5.21 209 5 2 1 138 2012 0
      5.21 208 5 2 1 138 2012 0
      5.25 212 5 2 1 138 2013 0
      5.25 213 5 2 1 138 2013 0
      5.29 211 5 2 1 142 2012 0
       5.3 214 5 2 1 138 2013 0
       5.3 212 5 3 2 114 2013 0
      5.32 215 5 2 1 138 2013 0
      5.33 211 5 1 2 114 2012 0
      5.39 211 5 2 1 138 2012 0
      5.39 210 5 2 1 138 2012 0
      5.42 211 5 3 2 114 2012 0
      5.43 209 5 2 2  92 2012 0
      5.43 209 5 1 2 114 2012 0
      5.45 209 5 3 2 114 2012 0
      5.45 213 5 1 2 114 2013 0
      5.52 208 5 1 2 114 2012 0
      5.53 215 6 2 2  19 2013 0
      5.57 208 4 3 2 168 2012 0
      5.59 208 5 3 2 114 2012 0
      5.61 213 5 3 2 114 2013 0
      5.64 215 3 2 2  19 2013 0
      5.66 208 2 3 2 170 2012 0
      5.67 210 5 1 2 114 2012 0
      5.69 218 5 1 1 142 2014 1
       5.7 210 5 3 2 114 2012 0
      5.79 208 5 2 2  92 2012 0
      5.82 214 5 1 1 142 2013 0
      5.84 215 4 2 2 198 2013 0
      5.85 208 4 1 2 170 2012 0
      5.89 214 5 1 2 114 2013 0
      5.91 213 5 2 2 114 2013 0
      5.92 210 5 1 2 172 2012 0
      5.92 212 5 2 2 114 2013 0
      6.01 208 5 1 2  44 2012 0
      6.01 209 4 1 2 170 2012 0
      6.03 209 4 1 2 168 2012 0
      6.04 208 5 3 2  44 2012 0
      6.06 214 6 2 2  19 2013 0
       6.1 217 5 1 2 114 2014 1
       6.1 214 5 3 2 114 2013 0
      6.12 213 5 3 2 196 2013 0
      6.12 213 5 2 2 196 2013 0
      6.12 214 4 2 2  57 2013 0
      6.12 208 5 2 2  44 2012 0
      6.17 208 4 1 2  19 2012 0
      6.17 212 6 2 2  19 2013 0
      6.18 213 6 2 2  19 2013 0
       6.2 215 4 2 2  57 2013 0
      6.21 216 4 2 2 198 2014 1
      6.21 214 5 1 2  46 2013 0
      6.21 216 5 1 2 114 2014 1
      6.26 217 3 1 2 174 2014 1
      6.27 215 5 1 2 114 2013 0
      6.28 208 5 2 2   1 2012 0
      6.29 208 5 2 2  46 2012 0
       6.3 208 5 2 2  34 2012 0
       6.3 216 4 2 2  57 2014 1
      6.31 213 5 3 2  44 2013 0
      6.31 208 5 1 2 172 2012 0
      6.32 214 3 2 2  19 2013 0
      6.36 217 5 3 2 170 2014 1
      6.36 210 5 1 2  42 2012 0
      6.36 217 5 3 2 114 2014 1
      6.38 209 5 2 2  34 2012 0
      6.39 214 5 1 2   1 2013 0
      6.41 210 5 3 2  42 2012 0
      6.42 216 5 3 2 114 2014 1
      6.43 210 5 1 2  44 2012 0
      6.44 211 5 1 2  44 2012 0
      6.44 218 5 1 2 114 2014 1
      6.45 214 5 2 2 114 2013 0
      6.45 208 5 1 2  42 2012 0
      6.45 215 5 1 1 168 2013 0
      6.45 219 5 1 2 114 2014 1
      6.46 217 5 1 2  46 2014 1
      6.47 215 6 3 2  19 2013 0
      6.48 212 5 1 2  42 2013 0
      6.48 212 5 3 2  44 2013 0
      6.49 215 5 3 2 114 2013 0
       6.5 208 5 3 2  42 2012 0
      6.51 215 3 3 2  19 2013 0
      6.52 209 4 1 2  19 2012 0
      6.53 209 5 1 2  42 2012 0
      6.55 208 3 1 2 174 2012 0
      6.56 216 3 1 2 174 2014 1
      6.57 209 5 1 1 169 2012 0
      6.57 214 5 3 2  46 2013 0
      format %tq v2
      v6 is the variable of numbers with encoded strings and v8 is a binary variable.
      When I use
      catplot v6 if v8==1, blabel(bar) var1opts(sort(1) descending) percent(v8)
      it doesn't show me the percentage of "1" occurring in v8 for one category of v6.
      Now I know that catplot is just a wrapper for graph bar, so I tried to do this:
      graph hbar v8 if v8==1, over(v6) ytitle(%) yla(0 0.25 "25" .5 "50" .75 "75" 1 "100")
      but it doesn't resolve my issue either.
      What are the other ways of showing a bar graph for categories of v6 in terms of v8?
      Last edited by Tursynbay Yeskendir; 31 Dec 2021, 01:46.


      • #4
        catplot is from SSC as you are asked to explain (FAQ Advice #12). catplot shows results (frequencies, fractions, percents) for the data selected and not for the data as a subset of what might have been shown.

        The frequencies of a 0, 1 variable can be summarized by the mean of such a variable. So if values are 0,0,0,1,1,1,1,1,1 the mean is 0.7 which is just the fraction of values that are 1. You can recast that mean to percents by fixing the axis labels. A bar chart with percents for such data shows redundantly that percent of 0 = 100 - percent of 1 or conversely. Rather than two bars, there could be just one piece of information.

        With your example data (thanks!) there are several zeros for fraction of 1s so for that reason alone I would not use a bar chart. It's easier to spot those categories with graph dot.

        graph dot by default shows means, which is what you seem to need.

        set scheme s1color 
        graph dot v8, over(v6, sort(1)) linetype(line) lines(lc(gs12) lw(thin)) l1title(use better text here) ytitle(... and here say % of whatever) ysc(r(-0.02 .)) yla(0 .2 "20" .4 "40" .6 "60", grid)
        Some of the code here is a matter of taste. I find that the default grid with graph dot often degrades on export to other software, so I reach in and use thin grey solid lines instead. As the results include zeros I lift the category axis to the left.

        Click image for larger version

Name:	Tursynbay.png
Views:	1
Size:	23.1 KB
ID:	1643138


        • #5
          Nick Cox, Thank you very much, you've cleared up a lot of things for me!
          However, I don't see a problem with using a bar graph. Here I used your code, replacing dot with hbar:
          Click image for larger version

Name:	Graph1.png
Views:	1
Size:	35.2 KB
ID:	1643199

          Now I want to apologize for not clearly explaining my problem. The thing that I'm looking for are the ways of showing this bar graph without zeros. Do I need to create a new variable to save the proportions/fractions? If so, how one would do that? Thanks again for your reply.
          Last edited by Tursynbay Yeskendir; 31 Dec 2021, 16:25.


          • #6
            Is this about what you want?

            EDIT: Belay my original response, Nick Cox 's graph was much better the first time

            * Example generated by -dataex-. To install: ssc install dataex
            input double v1 float v2 byte v3 float(v4 v5) long v6 float v7 int v8
            4.14 215 4 2 2  19 2013 0
            4.32 209 5 3 2 170 2012 0
            4.32 209 5 2 2 170 2012 0
            4.56 208 2 2 2 170 2012 0
             4.7 213 4 2 2  19 2013 0
            4.93 209 4 2 2 198 2012 0
            4.94 208 5 3 2  19 2012 0
            4.94 208 5 1 2  19 2012 0
            5.08 212 5 1 2 114 2013 0
            5.09 211 5 1 1 142 2012 0
            5.14 208 4 1 2 168 2012 0
            5.16 211 5 3 1 142 2012 0
            5.21 209 5 2 1 138 2012 0
            5.21 208 5 2 1 138 2012 0
            5.25 212 5 2 1 138 2013 0
            5.25 213 5 2 1 138 2013 0
            5.29 211 5 2 1 142 2012 0
             5.3 214 5 2 1 138 2013 0
             5.3 212 5 3 2 114 2013 0
            5.32 215 5 2 1 138 2013 0
            5.33 211 5 1 2 114 2012 0
            5.39 211 5 2 1 138 2012 0
            5.39 210 5 2 1 138 2012 0
            5.42 211 5 3 2 114 2012 0
            5.43 209 5 2 2  92 2012 0
            5.43 209 5 1 2 114 2012 0
            5.45 209 5 3 2 114 2012 0
            5.45 213 5 1 2 114 2013 0
            5.52 208 5 1 2 114 2012 0
            5.53 215 6 2 2  19 2013 0
            5.57 208 4 3 2 168 2012 0
            5.59 208 5 3 2 114 2012 0
            5.61 213 5 3 2 114 2013 0
            5.64 215 3 2 2  19 2013 0
            5.66 208 2 3 2 170 2012 0
            5.67 210 5 1 2 114 2012 0
            5.69 218 5 1 1 142 2014 1
             5.7 210 5 3 2 114 2012 0
            5.79 208 5 2 2  92 2012 0
            5.82 214 5 1 1 142 2013 0
            5.84 215 4 2 2 198 2013 0
            5.85 208 4 1 2 170 2012 0
            5.89 214 5 1 2 114 2013 0
            5.91 213 5 2 2 114 2013 0
            5.92 210 5 1 2 172 2012 0
            5.92 212 5 2 2 114 2013 0
            6.01 208 5 1 2  44 2012 0
            6.01 209 4 1 2 170 2012 0
            6.03 209 4 1 2 168 2012 0
            6.04 208 5 3 2  44 2012 0
            6.06 214 6 2 2  19 2013 0
             6.1 217 5 1 2 114 2014 1
             6.1 214 5 3 2 114 2013 0
            6.12 213 5 3 2 196 2013 0
            6.12 213 5 2 2 196 2013 0
            6.12 214 4 2 2  57 2013 0
            6.12 208 5 2 2  44 2012 0
            6.17 208 4 1 2  19 2012 0
            6.17 212 6 2 2  19 2013 0
            6.18 213 6 2 2  19 2013 0
             6.2 215 4 2 2  57 2013 0
            6.21 216 4 2 2 198 2014 1
            6.21 214 5 1 2  46 2013 0
            6.21 216 5 1 2 114 2014 1
            6.26 217 3 1 2 174 2014 1
            6.27 215 5 1 2 114 2013 0
            6.28 208 5 2 2   1 2012 0
            6.29 208 5 2 2  46 2012 0
             6.3 208 5 2 2  34 2012 0
             6.3 216 4 2 2  57 2014 1
            6.31 213 5 3 2  44 2013 0
            6.31 208 5 1 2 172 2012 0
            6.32 214 3 2 2  19 2013 0
            6.36 217 5 3 2 170 2014 1
            6.36 210 5 1 2  42 2012 0
            6.36 217 5 3 2 114 2014 1
            6.38 209 5 2 2  34 2012 0
            6.39 214 5 1 2   1 2013 0
            6.41 210 5 3 2  42 2012 0
            6.42 216 5 3 2 114 2014 1
            6.43 210 5 1 2  44 2012 0
            6.44 211 5 1 2  44 2012 0
            6.44 218 5 1 2 114 2014 1
            6.45 214 5 2 2 114 2013 0
            6.45 208 5 1 2  42 2012 0
            6.45 215 5 1 1 168 2013 0
            6.45 219 5 1 2 114 2014 1
            6.46 217 5 1 2  46 2014 1
            6.47 215 6 3 2  19 2013 0
            6.48 212 5 1 2  42 2013 0
            6.48 212 5 3 2  44 2013 0
            6.49 215 5 3 2 114 2013 0
             6.5 208 5 3 2  42 2012 0
            6.51 215 3 3 2  19 2013 0
            6.52 209 4 1 2  19 2012 0
            6.53 209 5 1 2  42 2012 0
            6.55 208 3 1 2 174 2012 0
            6.56 216 3 1 2 174 2014 1
            6.57 209 5 1 1 169 2012 0
            6.57 214 5 3 2  46 2013 0
            format %tq v2
            graph dot v8, over(v6, sort(1)) linetype(line) lines(lc(gs12) lw(thin)) l1title(use better text here) ytitle(... and here say % of whatever) ysc(r(-0.02 .)) yla(0 .2 "20" .4 "40" .6 "60", grid)
            Last edited by Jared Greathouse; 31 Dec 2021, 17:31.


            • #7
              Does this help?

              set scheme s1color
              egen mean = mean(v8), by(v6)
              graph dot v8 if mean > 0, over(v6, sort(1))
              with extra options, or other options, as needed.


              • #8
                That works very well. Thank you for answering my basic questions. I'm sure that you are busy but I'm a slow learner, so I really appreciate it! Happy New Year!

