Listing top 5 unique values in a large dataset

Anders Schepelern

Join Date: Feb 2024

Posts: 11
#1

Listing top 5 unique values in a large dataset

09 Mar 2024, 07:47

Hi,

I have a large dataset with 39,764 observations of which many have repeating values. The variable kommune_karaktergennemsnit (the school grade average of a municipality) has 22 unique values. The variable kommune_id (the name of the municipality) has 96 unique values. How do I list the top (and bottom) 5 (n) unique values of kommune_karaktergennemsnit – ideally with the corresponding values of kommune_id:

kommune_id kommune_karaktergennemsnit

1. Lyngby-Taarbæk 8.6

2. Hørsholm 8.5

3. Gentofte 8.3

4. Rudersdal 8.1

5. Allerød 8.0

Thanks!

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long(elev_id kommune_id) double(elev_karaktergennemsnit kommune_karaktergennemsnit) 20912 56 8.8 8.6 20978 56 8.1 8.6 20997 56 9.4 8.6 21023 56 6.9 8.6 21004 56 9.4 8.6 21098 56 9.3 8.6 20692 56 10 8.6 20736 56 7.9 8.6 20932 56 7.9 8.6 20776 56 10 8.6 20714 56 8.8 8.6 20961 56 8.1 8.6 21120 56 9.3 8.6 20833 56 8.7 8.6 20765 56 10 8.6 20979 56 8.1 8.6 21052 56 7.1 8.6 21102 56 9.3 8.6 20664 56 10 8.6 21080 56 9.3 8.6 end label values kommune_id kommune_n label def kommune_n 56 "Lyngby-Taarbæk", modify
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35211
#2

09 Mar 2024, 08:07

Code:

ssc desc extremes
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9945

09 Mar 2024, 08:10

Also:

Code:

frame put kommune_id kommune_karaktergennemsnit, into(wanted)
frame wanted{
    contract *
    gsort -kommune_karaktergennemsnit
    l in 1/5
    l in -5/l
}
frame drop wanted

Note that ties are broken arbitrarily.

Last edited by Andrew Musau; 09 Mar 2024, 08:15.

Comment

Anders Schepelern

Join Date: Feb 2024

Posts: 11
#4

09 Mar 2024, 08:20

Thanks, Nick, ‘extremes’ gives me the five highest values, but since the first 500 values are identical, it doesn't produce the desired result:

+------------------------------------+
| 39760. 8.6 Lyngby-Taarbæk |
| 39761. 8.6 Lyngby-Taarbæk |
| 39762. 8.6 Lyngby-Taarbæk |
| 39763. 8.6 Lyngby-Taarbæk |
| 39764. 8.6 Lyngby-Taarbæk |
+------------------------------------+

The ‘, frequencies’ option seems to be for one variable only – am I missing something?

Anders
Comment
Anders Schepelern

Join Date: Feb 2024

Posts: 11
#5

09 Mar 2024, 08:30

Thanks, Andrew, works great – and thanks for introducing me to frame!

Anders
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#6

09 Mar 2024, 09:14

See also groups from the Stata Journal

Code:

search st0496, entry

Otherwise I think extremes fits your case too, but I am away from a computer to check.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35211

09 Mar 2024, 10:47

Looking at this more carefully:

I agree that extremes does not handle your desired case. I am confident now why not. It is because it is very likely that there is not a one-to-one correspondence between the top five distinct (a much better word than unique here) values on one variable and the values on any other variable.

See Section 2 of https://journals.sagepub.com/doi/pdf...867X0800800408 on distinct and unique.

Here is an artificial example where there is one-to-one correspondence by construction. groups handles this circumstance quite cleanly.

Code:

. sysuse auto, clear
(1978 automobile data)

. egen rank = rank(mpg)

. groups mpg rank, select(5)

  +------------------------------+
  | mpg   rank   Freq.   Percent |
  |------------------------------|
  |  12    1.5       2      2.70 |
  |  14    5.5       6      8.11 |
  |  15    9.5       2      2.70 |
  |  16   12.5       4      5.41 |
  |  17   16.5       4      5.41 |
  +------------------------------+

. groups mpg rank, select(-5)

  +------------------------------+
  | mpg   rank   Freq.   Percent |
  |------------------------------|
  |  30   68.5       2      2.70 |
  |  31     70       1      1.35 |
  |  34     71       1      1.35 |
  |  35   72.5       2      2.70 |
  |  41     74       1      1.35 |
  +------------------------------+

Last edited by Nick Cox; 09 Mar 2024, 10:49.

Comment

Anders Schepelern

Join Date: Feb 2024

Posts: 11
#8

09 Mar 2024, 13:34

Super helpful, thanks Nick!

Also made a note of distinct and unique.

Anders
Comment

kommune_id	kommune_karaktergennemsnit
1. Lyngby-Taarbæk	8.6
2. Hørsholm	8.5
3. Gentofte	8.3
4. Rudersdal	8.1
5. Allerød	8.0

Announcement