Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple variables on a histogram for a count if each variable = 1

    Hi,

    I am new to STATA so please excuse my ignorance, this may be a rather simple question. I am using STATA/IC 15.1

    I have a dataset where some questions provided multiple tickboxes that the respondent can tick, rather than a question where only one answer can be selected. When I exported the data this has of course generated a variable in STATA for each tickbox, with the values being 0/1 "Unchecked"/"Checked". I converted the 0 "Unchecked" values to missing values (not sure if this was the right thing to do, but I was trying a few things out). It's easy for me to find out how many of each reason were checked, I am just having difficulty converting this to one graph with all the variables, which is what I really need to do.

    So the data looks something like this (this particular question was 'Reasons for undergoing [procedure]', with a number of options)

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(reasons___1 reasons___2 reasons___3 reasons___4 reasons___5 reasons___6)
    1 . 1 . 1 .
    1 . 1 . . .
    1 . 1 . . .
    1 . 1 . . 1
    1 . 1 1 . .
    1 . 1 1 1 1
    1 . 1 . . 1
    1 . 1 1 . .
    1 . . . . 1
    1 . 1 1 . .
    1 . . . 1 1
    1 . 1 1 . .
    . . 1 . . .
    . . . . . 1
    1 . 1 . 1 1
    1 . 1 1 . .
    1 . . . . 1
    1 . 1 1 . .
    1 . 1 1 . .
    1 . 1 1 . .
    1 . 1 1 . .
    1 . 1 . . .
    1 . 1 1 . 1
    1 . 1 . 1 .
    1 . . . . .
    1 . 1 1 1 1
    1 . . . . 1
    . . . . 1 1
    1 . 1 . 1 .
    1 . 1 1 . .
    1 1 . . 1 .
    1 . . . . 1
    1 . . 1 . .
    . . . . . 1
    
    end
    label values reasons___1 reasons___1_
    label def reasons___1_ 1 "Checked", modify
    label values reasons___2 reasons___2_
    label def reasons___2_ 1 "Checked", modify
    label values reasons___3 reasons___3_
    label def reasons___3_ 1 "Checked", modify
    label values reasons___4 reasons___4_
    label def reasons___4_ 1 "Checked", modify
    label values reasons___5 reasons___5_
    label def reasons___5_ 1 "Checked", modify
    label values reasons___6 reasons___6_
    label def reasons___6_ 1 "Checked", modify
    I would like to produce a simple graph like the one that I can produce in Excel from the raw data in a matter of seconds using the =countif function, that looks something like this (N.B. this is not the actual data, I just made up these numbers):
    graphexample.PNG



    If anyone could assist me or point me in the right direction (I feel like I must be missing something obvious?), I would be very grateful.
    Last edited by Hilary Bowman-Smart; 18 Feb 2018, 19:12.

  • #2
    Welcome to Statalist.

    I find that Stata graphics are so powerful and flexible that there are no simple questions, much less simple answers.

    Here is an approach that will start you on your way, based on the same data your provided (and let me thank you for using dataex to supply a healthy sample of your data on your first post!):
    Code:
    generate id = _n
    reshape long reasons___, i(id) j(number)
    drop if missing(reasons___)
    drop reasons___
    // see what the data looks like now
    list if id<=5, sepby(id) noobs
    // here are our counts as a table
    tab number
    // here they are as a histogram
    twoway histogram number, frequency discrete gap(50)
    Here's some output
    Code:
    . // see what the data looks like now
    . list if id<=5, sepby(id) noobs
    
      +-------------+
      | id   number |
      |-------------|
      |  1        1 |
      |  1        3 |
      |  1        5 |
      |-------------|
      |  2        1 |
      |  2        3 |
      |-------------|
      |  3        1 |
      |  3        3 |
      |-------------|
      |  4        1 |
      |  4        3 |
      |  4        6 |
      |-------------|
      |  5        1 |
      |  5        3 |
      |  5        4 |
      +-------------+
    
    . // here are our counts as a table
    . tab number
    
         number |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |         30       32.97       32.97
              2 |          1        1.10       34.07
              3 |         23       25.27       59.34
              4 |         14       15.38       74.73
              5 |          9        9.89       84.62
              6 |         14       15.38      100.00
    ------------+-----------------------------------
          Total |         91      100.00
    I also don't find posting graphics simple on Statalist, so maybe this won't look right for several edits.

    Click image for larger version

Name:	Histogram.png
Views:	1
Size:	47.9 KB
ID:	1430499

    Well, consider that a proof of concept after about 8 attempts. (If you click on the graphic, you will see it bigger.) Now comes the hard work, wading through the help files to pick the right options to get it to look just the way you want. But this is a start, at least, and hopefully will generalize well to your actual data.
    Last edited by William Lisowski; 18 Feb 2018, 20:02.

    Comment


    • #3
      Hi William,

      Thanks for the detailed response! That suggested code has worked for me, thank you. I'll just do a bit of relabelling and then the graph should come up quite nicely. I have a number of questions I need to do this for so this is immensely helpful.

      Comment


      • #4
        Hi William,

        While this has worked for one question this now poses difficulty for the rest of my dataset as instead of 118 observations I now have 314 obs for all questions (i.e. if one respondent has ticked 3 reasons, there are now 3 duplicated observations with the same responses throughout the rest of the survey).

        This also poses a (minor) problem when calculating percentages i.e. when doing tab [var], it shows 98 as 31.21% (i.e. 98/314) rather than what the data is actually showing, which is that 98 out of 118 respondents (83%) ticked reason 1.

        When I tried this code for the next question (which related to the year in which the procedure was performed - some respondents have had the procedure multiple times in different years which is why they can tick multiple options), I received this error message:

        reshape long procedure_year___, i(id) j(year)
        (note: j = 2013 2014 2015 2016 2017)
        variable id does not uniquely identify the observations
        Your data are currently wide. You are performing a reshape long. You specified
        i(id) and j(year). In the current wide form, variable id should
        uniquely identify the observations. Remember this picture:

        long wide
        +---------------+ +------------------+
        | i j a b | | i a1 a2 b1 b2 |
        |---------------| <--- reshape ---> |------------------|
        | 1 1 1 2 | | 1 1 3 2 4 |
        | 1 2 3 4 | | 2 5 7 6 8 |
        | 2 1 5 6 | +------------------+
        | 2 2 7 8 |
        +---------------+
        Type reshape error for a list of the problem observations.
        r(9);

        When I typed reshape error it said that 290 of 314 observations have duplicate i values.

        I'm not entirely clear what's happening here, or how I could go about getting it back to 118 obs for further analysis?

        Worse comes to worst I could just go over each of these questions by going back to the clean data each time? Though this may pose an issue when I'm putting together my do file.

        Comment


        • #5
          The problem is that your data structure is now awkward for anything but the reason it was reshaped.

          There are other ways to get such graphs while keeping your original data structure, unfit for purpose though it is in some ways.

          For the first graph you want just 6 numbers and they're all straightforward counts, so let's exploit those facts.

          This is a script you can put in a do-file editor window and then edit for later use.

          Code:
          gen count = . 
          gen reason = _n 
          
          quietly forval j = 1/6 { 
              count if reasons___`j' == 1 
              replace count = r(N) in `j'
          } 
          
          graph hbar (asis) count, over(reason) ytitle(Number of responses) title(Reasons)
          This is a little clumsy because the observations you're using aren't aligned with the rest of the data, but many Excel users apparently are accustomed to that.

          Note that 6 is naturally among the details that may need editing. Also, you clearly can have histograms if you prefer. I find that once the text labels get at all lengthy, (conventional) histograms are hard to optimise.

          On STATA: please have a look at https://www.statalist.org/forums/help #18.

          Comment


          • #6
            Nick anticipates my response to your follow-on question. The reshape of the data was intended to support production of the histogram you showed, not the general analysis of your data. The way you return to 118 observations is to again use the dataset, having produced your histogram.

            Let me follow on Nick's advice with a piece of advice I have been thinking about.

            I converted the 0 "Unchecked" values to missing values (not sure if this was the right thing to do, but I was trying a few things out)
            It is not the right thing to do. You know the respondent was shown the question and the item was unchecked, so you should represent that knowledge with an actual value - 0 is a good choice. Missing values would be appropriate, for example, had the question not been asked of the respondent. Your question was "why did you undergo [procedure]" and it was presumably not asked of respondents who not undergone a procedure, so if there were any such respondents in your data, missing values would be appropriate - in this case for all of the possible responses to this question. Bottom line: you should be using the data with 1/0 coding, not 1/. coding. Missing values have very specific meaning to Stata and are not in general interchangeable with non-minssing values.

            In particular, the percentage who ticked reason 1 is just the average value of the variable reason___1 if it is coded 1/0.




            Comment


            • #7
              William's good point can be related to #5, namely that you can count zeros too and work out percents or proportions before plotting. Naturally if all missings should be zeros, then the denominator for a percent or proportion is just the number of responses of any kind.

              Comment


              • #8
                Hi Nick and William, thanks for the help! Very valuable advice. (Sorry for the use of STATA, will keep in mind for the future!)

                Comment


                • #9
                  Hi all,

                  Just thought I would update that I found a pretty satisfactory way to do something like this that is quite simple, at least for my purposes.

                  Code:
                  graph bar (sum) reasons___1 reasons___2 reasons___3, bargap(1) blabel(bar) ytitle(Number of responses) title(Reasons for seeking Procedure) legend(label(1 "Reason 1") label(2 "Reason 2") label(3 "Reason 3"))
                  (etc etc for all the reasons variables I have)

                  It produces a colour legend rather than labelling along the axis but I don't mind that.

                  Comment

                  Working...
                  X