Question regarding calculating percentages of distinct values

Cadence Luchsinger

Join Date: Aug 2021

Posts: 4
#1

Question regarding calculating percentages of distinct values

21 Aug 2021, 12:52

Hello,
Using Stata 16.1 for Mac, I have been doing quite a bit of searching online to try to figure this out and keep getting confused. I have a dataset with a large number of IDs, many of which are repeated multiple times. I want to figure out the percentage of IDs that have only been seen 1 time, that have been seen 2 times, 3 times, etc. so that I can calculate the percentage of people who were seen once, twice, three times, etc. My dataset is about 80,000 observations (which includes the repeated values), so it is impossible to recode any variables by hand.

I have used bysort patient_id : gen patient_id_count = _N, but the frequencies included the duplicate values, which makes my data meaningless.

I have downloaded the Distinct function, but it just shows me how many distinct values there are overall - what I need is the number of times each distinct value is showing up as a frequency and as a percentage.
I have also tried:
by patient_id, sort: gen nvals = _n == 1
tab nvals

which shows me the number of distinct values overall as well.

The only thing that got me close to what I want is Contract, but that eliminates the rest of my data, so it is difficult to use that.
contract patient_id
tab _freq

Any suggestions?
Thank you kindly.

Last edited by Cadence Luchsinger; 21 Aug 2021, 13:13.
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#2

21 Aug 2021, 13:12

no data example (please read the FAQ), by my guess is:

Code:

bysort patient_id: gen nvals=_N

this will give you the same value for each observation with the same patient_id but you can then do something like:

Code:

bys patient_id: gen pct=nvals/denominator if _n==_N

just replace "denominator" with what you want as the divisor (distinct patient_ids????); if you want each observation to have a "pct", just leave out the "if _n==_N" part
Comment
Cadence Luchsinger

Join Date: Aug 2021

Posts: 4
#3

21 Aug 2021, 13:15

I did not include a data sample because I am uncertain about whether or not I can share this data, apologies.

Last edited by Cadence Luchsinger; 21 Aug 2021, 13:58.
Comment
Cadence Luchsinger

Join Date: Aug 2021

Posts: 4
#4

21 Aug 2021, 13:22

.

Last edited by Cadence Luchsinger; 21 Aug 2021, 13:58.
Comment
Cadence Luchsinger

Join Date: Aug 2021

Posts: 4
#5

21 Aug 2021, 13:40

That ended up giving me not what I was looking for. This is still showing me all the duplicate values added up, and then when I divided by patient_id_num (I updated patient_id so it was no longer a string variable), it gave me tiny decimal points. Any other ideas? I'm really at a loss.

Here is the output:
bysort patient_id_num: gen nvals=_N

. tab nvals

nvals | Freq. Percent Cum.
------------+-----------------------------------
1 | 4,311 5.27 5.27
2 | 10,950 13.38 18.65
3 | 14,448 17.65 36.30
4 | 15,316 18.71 55.01
5 | 9,560 11.68 66.69
6 | 7,314 8.94 75.63
7 | 5,768 7.05 82.68
8 | 4,632 5.66 88.34
9 | 3,447 4.21 92.55
10 | 2,150 2.63 95.18
11 | 1,562 1.91 97.08
12 | 792 0.97 98.05
13 | 585 0.71 98.77
14 | 322 0.39 99.16
15 | 180 0.22 99.38
16 | 128 0.16 99.54
17 | 187 0.23 99.76
18 | 18 0.02 99.79
19 | 38 0.05 99.83
20 | 20 0.02 99.86
21 | 21 0.03 99.88
22 | 22 0.03 99.91
23 | 23 0.03 99.94
25 | 25 0.03 99.97
26 | 26 0.03 100.00
------------+-----------------------------------
Total | 81,845 100.00

. bys patient_id_num: gen pct=nvals/patient_id_num

. codebook pct

-------------------------------------------------------------------------------
pct (unlabeled)
-------------------------------------------------------------------------------

type: numeric (float)

range: [.00004188,3] units: 1.000e-12
unique values: 21,373 missing .: 0/81,845

mean: .002052
std. dev: .033188

percentiles: 10% 25% 50% 75% 90%
.000136 .000219 .000401 .000821 .001994

.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

21 Aug 2021, 14:30

This code will limit your tabulation of nvals to once per patient_id

Code:

bysort patient_id_num: gen nvals=_N egen oneobs = tag(patient_id) tab nvals if oneobs==1
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#7

21 Aug 2021, 19:10

again, please read the FAQ which specifically discusses the issue of data confidentiality and providing mock data which has the same structure, but different data of course, as your actual data
1 like
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2402

21 Aug 2021, 20:14

Here's a very minimal example of what I imagine your data looks like. I create 10 subject IDs with a range of records from 1 to 3 times. The code is a condensed version of what William proposes in #6, and is attractive because it keeps the structure of the starting data.

Code:

clear *
cls

set obs 10
gen byte id = _n
gen byte freq = mod(_n, 3) + 1
expand freq
drop freq
tab id

* Start here
bys id: gen byte n_recs = _N if _n==1
tab n_recs

Result

Code:

. tab n_recs

     n_recs |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          3       30.00       30.00
          2 |          4       40.00       70.00
          3 |          3       30.00      100.00
------------+-----------------------------------
      Total |         10      100.00

If you only need the counts and percentages, and don't mind discarding the data, you can go a step further to store these as their own data.

Code:

preserve
contract id
tab _fr
contract _freq, freq(num) percent(pct)
list
restore

Result

Code:

     +---------------------+
     | _freq   num     pct |
     |---------------------|
  1. |     1     3   30.00 |
  2. |     2     4   40.00 |
  3. |     3     3   30.00 |
     +---------------------+

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#9

22 Aug 2021, 04:08

On #1

I have downloaded the Distinct function, but it just shows me how many distinct values there are overall

The allusion is to the distinct command (not function) from the Stata Journal. That's what it was intended to do all along.

The principles behind the answers here were, however, discussed in https://www.stata-journal.com/articl...article=dm0042
1 like
Comment

Announcement

Question regarding calculating percentages of distinct values

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment