Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating graphs for Tukey band IQR

    Hello,
    I would like to generate a graph that shows Tukey band outlier with the below code:
    Code:
    global varlist emp            ///
                         labor_cost           ///
                         wage             ///
                         K_tan         ///
                         K_intan        ///
                         I_tan                  ///
                         I_intan                ///
                         sales_L             ///
                         VA_L
        
    ** Tukey IQR 
    sort $id $yr
    
    foreach var of global vars_to_check {
        bys $yr: egen Q1_`var' = pctile(`var'), p(25)
        bys $yr: egen Q3_`var' = pctile(`var'), p(75)
        
        gen IQR_`var' = Q3_`var' - Q1_`var'
        
        gen lower_band_`var' = Q1_`var' - 1.5 * IQR_`var'
        gen upper_band_`var' = Q3_`var' + 1.5 * IQR_`var'
        
        gen outlier_`var' = 0
        replace outlier_`var' = 1 if `var' < lower_band_`var' | `var' > upper_band_`var'
    }    
    foreach var of global vars_to_check {
    br $id $yr `var' Q1_`var' Q3_`var' IQR_`var' lower_band_`var' upper_band_`var' outlier_`var' if outlier_`var' == 1
    }
    and for this I tried the below code for graph generation:
    Code:
    foreach var of global varlist {    
    bys $yr: egen mean_`var' = mean(`var')
        
        * Create the box plot with Tukey IQR visualization
        twoway rbar lower_band_`var' upper_band_`var' $yr, /// 
            fcolor(gs12) lcolor(black) barw(0.5) || /// 
            rbar Q1_`var' Q3_`var' $yr, fcolor(gs12) lcolor(black) barw(0.5) || ///
            rspike Q1_`var' Q3_`var' $yr, lcolor(black) || ///
            rspike lower_band_`var' upper_band_`var' $yr, lcolor(black) || ///
            rcap Q1_`var' Q3_`var' $yr, msize(*6) lcolor(black) || ///
            scatter outlier_`var' $yr, mcolor(black) || ///
            scatter mean_`var' $yr, msymbol(Oh) msize(*2) fcolor(gs12) mcolor(black) /// 
            legend(off) ///
            ytitle("Tukey Band") graphregion(fcolor(gs15))         
    
        graph export "${Graph}/Tukey_`var.jpg", as(jpg) replace
        }
    But it is not generating what I was thinking of..
    It was something like this:



    But it doesn't need to be like this, but if there is other type of graph that can visualize tukey band, please suggest them...
    Could someone help me with generating the graph???

    Thanks a lot !!

  • #2
    Anne-Claire Jo:

    In several threads you've started recently you've presented code that won't run if only because (1) you use global macros that aren't defined and (2) you don't give data examples or use datasets distributed with Stata or at least easily accessible. So you're expecting that we read your code, which is often quite lengthy, understand it all, and can then translate to what you want. You're falling short of the explained standards for good questions here. It's entirely likely that your real data are too large or too complicated to show or use directly, but that problem is soluble by using examples that suit the questions you have.

    In this particular thread, the problem seems to be

    it is not generating what I was thinking of..
    How are we supposed to answer that? I can't see why you think this is a question that can be answered without extended study of your code, faking example data, and so on. You don't show the graph(s) that you do get either.

    Your example graph comes from some other software, not a problem except that I don't know, and you don't explain, why some points are stars and some are circles and why some points appear to be jittered. But as you say, you don't want exactly the same graph. So, what do you want?

    Despite all that, two much more positive comments seem possible.

    1. You want a graph that shows "outliers" according to the Tukey criterion of points lying outside (lower quartile MINUS 1.5 IQR, upper quartile PLUS 1.5 IQR). But this is precisely how graph box and graph hbox operate by default. (Note in passing that Tukey did not regard his criterion as a rule for identifying outliers in any sense stronger than as points to be plotted individually in combination with a box and so-called whiskers.)

    2. From the use of br (meaning browse) you want to inspect a listing of such data points, That is already possible through extremes from SSC.


    Code:
    . sysuse auto, clear
    (1978 automobile data)
    
    . extremes price, iqr
    
      +-----------------------+
      | obs:    iqr:    price |
      |-----------------------|
      |  53.   1.559    9,690 |
      |  55.   1.580    9,735 |
      |  41.   1.877   10,371 |
      |   9.   1.877   10,372 |
      |  11.   2.349   11,385 |
      |-----------------------|
      |  26.   2.401   11,497 |
      |  74.   2.633   11,995 |
      |  64.   3.096   12,990 |
      |  28.   3.318   13,466 |
      |  27.   3.378   13,594 |
      |-----------------------|
      |  12.   3.800   14,500 |
      |  13.   4.455   15,906 |
      +-----------------------+
    https://journals.sagepub.com/doi/pdf...867X0900900309 would be a reference that helps to explain your code to others.

    Comment


    • #3
      Nick Cox Sorry for the confusion first of all.
      I couldn't show the data in full version because of confidentiality issue so I tried to show them at maximum.
      And Im sorry but I can't really understand why you think it's a fake code.. cause I am really using them in my do-file.
      Maybe it's because i am using so many global/local, but as I explained earlier, I MUST use them because I need to use this do-file in the future with other types of data (that I can't imagine how it looks like for now). This is the reason why I should generalize and predict all the circumstances (of whether there will be missing variables, other variable names etc..).

      Comment


      • #4
        Your answer in #3 does not engage at all with the positive parts of #2, principally that what you want seems already to be coded up in Stata.

        I am not saying that your code is fake, just that no-one can run it. Please read what I said again,

        I can't see why you think this is a question that can be answered without extended study of your code, faking example data, and so on.
        That is we seem to be asked to do, write code in terms of faked example data -- because your real data aren't accessible to us, which is fair enough -- and your code is not so simple that an experienced Stata user can just read it and understand it.

        It's important for you, but frankly not really for us, that your code needs eventually to be made more general. If that goal means that you insist on asking questions that cannot be answered, I can't even try to help further. As well as myself, I think most of the people who have answered in your threads are more experienced in Stata than yourself. That's not a criticism, but the point is simple. If we can't follow exactly what you're asking, there isn't much point in pursuing discussion. Otherwise put, it's not good strategy to try to write code that deals at once with all possible complications.

        Comment


        • #5
          Incidentally, it's not at all essential to write general code using globals, and experienced user-programmers essentially never do that.

          The ideal in Stata is to write general commands and let the syntax statement (or equivalent syntax) handle the variable names you supply.

          That in turn is not essential. I wrote nothing but do files in early Stata use, and switching to writing commands for general tasks came slowly but surely.

          Comment


          • #6
            Thanks for the explanation and apologies for the confusion again.
            It's true that i am just a beginner and at the same time i need to formulate my questions ..
            To be honest i do not know much things in stata but i needed to try quite everything to make things make sense..
            I will be using the do-file in the future with other data but at the same time it will be sent and run by other collaborators who have their own (different) dataset.
            This Tukey outline exercise was used to see how to set the threshold for jump variables later on.
            But the code i showed here is really part of my do-file and contains many global/local (for loop), but im not sure how to make them more understandable but at the same time be applied directly to my do-file (cause as I mentioned, it contains gl/loc)

            Comment

            Working...
            X