Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a histogram with categorical data

    Hi everyone,

    I am trying to create a histogram with a string variable. In the example dataset below, I have a string variable called SchoolName which identifies where someone currently attends college. I also have a numeric variable called Studentid which identifies a unique student. I want to create a histogram with the string variable SchoolName.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte Studentid str25 SchoolName
     1 "Purdue University"        
     2 "Purdue University"        
     3 "Purdue University"        
     4 "Purdue University"        
     5 "Purdue University"        
     6 "Indiana University"       
     7 "Indiana University"       
     8 "Indiana University"       
     9 "Indiana University"       
    10 "University of Michigan"   
    11 "University of Michigan"   
    12 "University of Michigan"   
    13 "University of Michigan"   
    14 "University of Michigan"   
    15 "University of Michigan"   
    16 "University of Michigan"   
    17 "University of Michigan"   
    18 "Michigan State University"
    19 "Michigan State University"
    20 "University of Maryland"   
    21 "University of Maryland"   
    22 "University of Maryland"   
    23 "University of Maryland"   
    24 "University of Maryland"   
    25 "University of Maryland"   
    26 "University of Maryland"   
    27 "University of Maryland"   
    28 "University of Maryland"   
    29 "University of Maryland"   
    30 "University of Maryland"   
    31 "University of Maryland"   
    32 "University of Maryland"   
    33 "University of Illinois"   
    34 "University of Illinois"   
    35 "University of Illinois"   
    36 "University of Illinois"   
    37 "University of Illinois"   
    38 "University of Illinois"   
    39 "University of Illinois"   
    40 "University of Iowa"       
    41 "University of Minnesota"  
    42 "University of Minnesota"  
    43 "University of Minnesota"  
    44 "University of Minnesota"  
    45 "University of Minnesota"  
    46 "University of Minnesota"  
    47 "University of Minnesota"  
    48 "University of Minnesota"  
    49 "University of Minnesota"  
    50 "University of Minnesota"  
    51 "University of Minnesota"  
    52 "University of Minnesota"  
    53 "University of Minnesota"  
    54 "University of Minnesota"  
    55 "University of Minnesota"  
    56 "University of Minnesota"  
    57 "University of Minnesota"  
    58 "University of Minnesota"  
    59 "University of Nebraska"   
    60 "University of Nebraska"   
    61 "University of Nebraska"   
    62 "University of Nebraska"   
    63 "Northwestern University"  
    64 "Northwestern University"  
    65 "Northwestern University"  
    66 "Northwestern University"  
    67 "Northwestern University"  
    68 "Northwestern University"  
    69 "Northwestern University"  
    70 "Ohio State University"    
    71 "Ohio State University"    
    72 "Ohio State University"    
    73 "Ohio State University"    
    74 "Ohio State University"    
    75 "Ohio State University"    
    76 "Ohio State University"    
    77 "Ohio State University"    
    78 "Ohio State University"    
    79 "Ohio State University"    
    80 "Ohio State University"    
    81 "University of Wisconsin"  
    82 "University of Wisconsin"  
    83 "University of Wisconsin"  
    84 "University of Wisconsin"  
    85 "University of Wisconsin"  
    86 "University of Wisconsin"  
    87 "University of Wisconsin"  
    88 "University of Wisconsin"  
    89 "University of Wisconsin"  
    90 "University of Wisconsin"  
    end

    I would like to create a histogram identifying only the top three schools that are most frequently attended. I would like the histogram to include a label with the frequency count and a label with the name of the school. I am currently using Stata 16. Thank you so much.

  • #2
    Thanks for the data example.

    Try

    Code:
    isid SchoolName Studentid
    contract SchoolName
    gsort -_freq, gen(rank)
    graph hbar _freq if rank <= 3, over(SchoolName, des)   blabel(bar, pos(outside)) ytitle("students")

    Comment


    • #3
      Justin Blasongame's solution is good for the question asked -- but what's with the readership that can only cope with 3 bars?

      This is just to note that


      Code:
      graph hbar (count), over(SchoolName, sort(1) descending)
      gets you most of the way towards a nice graph, and as Justin showed you can show bar labels too (in which case the axis labels are dispensable).

      Comment


      • #4
        Hi Justin. Thank you so much for your answer. This is exactly what I needed. As I am running each line of your code, it is so clear how you are attacking this problem. The solution is super intuitive and thank you for your help. I really appreciate it.


        Hi Nick. First off, you are a legend and I jumped when I saw that you had replied. You have made Statalist such a useful space to get help about statistics. Everyone in my office talks about getting help from you via Statalist. As for your question, let me explain my situation. I actually have a dataset of over 50,000 uniques students who attend about 2,000 unique schools. For my project, I want to identify the top 20 most attended schools. Because my actual data is restricted, I created a sample dataset and I posted it with my question. The sample dataset is composed of BIG 10 schools and there are 12 unique members (I'm not sure why there's no name change). Since the sample dataset only has 12 unique members, I randomly request 3 as a threshold for my histogram. Justin's solution can be easily applied to my actual dataset of 50,000 unique students and 2,000 unique schools. I can use Justin's solution to create a histogram showing the top 20 schools. I am still relatively new to posting on Statalist but in the future, I can provide a fuller explanation of my situation. Thanks again for all you have done.

        Comment


        • #5
          Thanks for your very nice words, which I much appreciate.

          Here's a way to adapt #3 to the top 20 schools, which can be modified for "any value of 20".

          Code:
          bysort SchoolName : gen neg count = -_N 
          by SchoolName : gen tag = -(_n==1) 
          sort tag neg_count 
          gen wanted = _n <= 20 
          bysort SchoolName (wanted) : replace wanted = wanted[_N] 
            
           graph hbar (count) if wanted, over(SchoolName, sort(1) descending)

          This is a complete reproducible example (in which the value of "20" is 3). Anyone wanting to follow how it works can run it and pepper the code with as many list statements as are desired (or keep the Data Editor open)

          Code:
          sysuse auto, clear
          bysort rep78 : gen count = _N
          by rep78 : gen tag = _n==1
          gsort -tag -count
          gen wanted = _n <= 3  & tag 
          bysort rep78 (wanted) : replace wanted = wanted[_N]
          graph hbar (count) if wanted, over(rep78)
          tab rep78

          Comment


          • #6
            Thank you Nick for this alternative approach. This works as well and you use a larger dataset which is more similar to my actual dataset. You never disappoint and you are always so helpful. Thanks for setting such a great example. I hope to pay it forward.

            Comment

            Working...
            X