Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Listing top 5 unique values in a large dataset

    Hi,

    I have a large dataset with 39,764 observations of which many have repeating values. The variable kommune_karaktergennemsnit (the school grade average of a municipality) has 22 unique values. The variable kommune_id (the name of the municipality) has 96 unique values. How do I list the top (and bottom) 5 (n) unique values of kommune_karaktergennemsnit – ideally with the corresponding values of kommune_id:

    kommune_id kommune_karaktergennemsnit
    1. Lyngby-Taarbæk 8.6
    2. Hørsholm 8.5
    3. Gentofte 8.3
    4. Rudersdal 8.1
    5. Allerød 8.0

    Thanks!

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long(elev_id kommune_id) double(elev_karaktergennemsnit kommune_karaktergennemsnit)
    20912 56 8.8 8.6
    20978 56 8.1 8.6
    20997 56 9.4 8.6
    21023 56 6.9 8.6
    21004 56 9.4 8.6
    21098 56 9.3 8.6
    20692 56  10 8.6
    20736 56 7.9 8.6
    20932 56 7.9 8.6
    20776 56  10 8.6
    20714 56 8.8 8.6
    20961 56 8.1 8.6
    21120 56 9.3 8.6
    20833 56 8.7 8.6
    20765 56  10 8.6
    20979 56 8.1 8.6
    21052 56 7.1 8.6
    21102 56 9.3 8.6
    20664 56  10 8.6
    21080 56 9.3 8.6
    end
    label values kommune_id kommune_n
    label def kommune_n 56 "Lyngby-Taarbæk", modify

  • #2
    Code:
    ssc desc extremes

    Comment


    • #3
      Also:

      Code:
      frame put kommune_id kommune_karaktergennemsnit, into(wanted)
      frame wanted{
          contract *
          gsort -kommune_karaktergennemsnit
          l in 1/5
          l in -5/l
      }
      frame drop wanted
      Note that ties are broken arbitrarily.
      Last edited by Andrew Musau; 09 Mar 2024, 08:15.

      Comment


      • #4
        Thanks, Nick, ‘extremes’ gives me the five highest values, but since the first 500 values are identical, it doesn't produce the desired result:

        +------------------------------------+
        | 39760. 8.6 Lyngby-Taarbæk |
        | 39761. 8.6 Lyngby-Taarbæk |
        | 39762. 8.6 Lyngby-Taarbæk |
        | 39763. 8.6 Lyngby-Taarbæk |
        | 39764. 8.6 Lyngby-Taarbæk |
        +------------------------------------+

        The ‘, frequencies’ option seems to be for one variable only – am I missing something?

        Anders

        Comment


        • #5
          Thanks, Andrew, works great – and thanks for introducing me to frame!

          Anders

          Comment


          • #6
            See also groups from the Stata Journal

            Code:
            search st0496, entry
            Otherwise I think extremes fits your case too, but I am away from a computer to check.

            Comment


            • #7
              Looking at this more carefully:

              I agree that extremes does not handle your desired case. I am confident now why not. It is because it is very likely that there is not a one-to-one correspondence between the top five distinct (a much better word than unique here) values on one variable and the values on any other variable.

              See Section 2 of https://journals.sagepub.com/doi/pdf...867X0800800408 on distinct and unique.

              Here is an artificial example where there is one-to-one correspondence by construction. groups handles this circumstance quite cleanly.

              Code:
              . sysuse auto, clear
              (1978 automobile data)
              
              . egen rank = rank(mpg)
              
              . groups mpg rank, select(5)
              
                +------------------------------+
                | mpg   rank   Freq.   Percent |
                |------------------------------|
                |  12    1.5       2      2.70 |
                |  14    5.5       6      8.11 |
                |  15    9.5       2      2.70 |
                |  16   12.5       4      5.41 |
                |  17   16.5       4      5.41 |
                +------------------------------+
              
              . groups mpg rank, select(-5)
              
                +------------------------------+
                | mpg   rank   Freq.   Percent |
                |------------------------------|
                |  30   68.5       2      2.70 |
                |  31     70       1      1.35 |
                |  34     71       1      1.35 |
                |  35   72.5       2      2.70 |
                |  41     74       1      1.35 |
                +------------------------------+
              Last edited by Nick Cox; 09 Mar 2024, 10:49.

              Comment


              • #8
                Super helpful, thanks Nick!

                Also made a note of distinct and unique.

                Anders

                Comment

                Working...
                X