Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question regarding calculating percentages of distinct values

    Hello,
    Using Stata 16.1 for Mac, I have been doing quite a bit of searching online to try to figure this out and keep getting confused. I have a dataset with a large number of IDs, many of which are repeated multiple times. I want to figure out the percentage of IDs that have only been seen 1 time, that have been seen 2 times, 3 times, etc. so that I can calculate the percentage of people who were seen once, twice, three times, etc. My dataset is about 80,000 observations (which includes the repeated values), so it is impossible to recode any variables by hand.

    I have used bysort patient_id : gen patient_id_count = _N, but the frequencies included the duplicate values, which makes my data meaningless.

    I have downloaded the Distinct function, but it just shows me how many distinct values there are overall - what I need is the number of times each distinct value is showing up as a frequency and as a percentage.
    I have also tried:
    by patient_id, sort: gen nvals = _n == 1
    tab nvals

    which shows me the number of distinct values overall as well.

    The only thing that got me close to what I want is Contract, but that eliminates the rest of my data, so it is difficult to use that.
    contract patient_id
    tab _freq

    Any suggestions?
    Thank you kindly.
    Last edited by Cadence Luchsinger; 21 Aug 2021, 13:13.

  • #2
    no data example (please read the FAQ), by my guess is:
    Code:
    bysort patient_id: gen nvals=_N
    this will give you the same value for each observation with the same patient_id but you can then do something like:
    Code:
    bys patient_id: gen pct=nvals/denominator if _n==_N
    just replace "denominator" with what you want as the divisor (distinct patient_ids????); if you want each observation to have a "pct", just leave out the "if _n==_N" part

    Comment


    • #3
      I did not include a data sample because I am uncertain about whether or not I can share this data, apologies.
      Last edited by Cadence Luchsinger; 21 Aug 2021, 13:58.

      Comment


      • #4
        .
        Last edited by Cadence Luchsinger; 21 Aug 2021, 13:58.

        Comment


        • #5
          That ended up giving me not what I was looking for. This is still showing me all the duplicate values added up, and then when I divided by patient_id_num (I updated patient_id so it was no longer a string variable), it gave me tiny decimal points. Any other ideas? I'm really at a loss.


          Here is the output:
          bysort patient_id_num: gen nvals=_N

          . tab nvals

          nvals | Freq. Percent Cum.
          ------------+-----------------------------------
          1 | 4,311 5.27 5.27
          2 | 10,950 13.38 18.65
          3 | 14,448 17.65 36.30
          4 | 15,316 18.71 55.01
          5 | 9,560 11.68 66.69
          6 | 7,314 8.94 75.63
          7 | 5,768 7.05 82.68
          8 | 4,632 5.66 88.34
          9 | 3,447 4.21 92.55
          10 | 2,150 2.63 95.18
          11 | 1,562 1.91 97.08
          12 | 792 0.97 98.05
          13 | 585 0.71 98.77
          14 | 322 0.39 99.16
          15 | 180 0.22 99.38
          16 | 128 0.16 99.54
          17 | 187 0.23 99.76
          18 | 18 0.02 99.79
          19 | 38 0.05 99.83
          20 | 20 0.02 99.86
          21 | 21 0.03 99.88
          22 | 22 0.03 99.91
          23 | 23 0.03 99.94
          25 | 25 0.03 99.97
          26 | 26 0.03 100.00
          ------------+-----------------------------------
          Total | 81,845 100.00

          . bys patient_id_num: gen pct=nvals/patient_id_num

          . codebook pct

          -------------------------------------------------------------------------------
          pct (unlabeled)
          -------------------------------------------------------------------------------

          type: numeric (float)

          range: [.00004188,3] units: 1.000e-12
          unique values: 21,373 missing .: 0/81,845

          mean: .002052
          std. dev: .033188

          percentiles: 10% 25% 50% 75% 90%
          .000136 .000219 .000401 .000821 .001994

          .

          Comment


          • #6
            This code will limit your tabulation of nvals to once per patient_id
            Code:
            bysort patient_id_num: gen nvals=_N
            egen oneobs = tag(patient_id)
            tab nvals if oneobs==1

            Comment


            • #7
              again, please read the FAQ which specifically discusses the issue of data confidentiality and providing mock data which has the same structure, but different data of course, as your actual data

              Comment


              • #8
                Here's a very minimal example of what I imagine your data looks like. I create 10 subject IDs with a range of records from 1 to 3 times. The code is a condensed version of what William proposes in #6, and is attractive because it keeps the structure of the starting data.

                Code:
                clear *
                cls
                
                set obs 10
                gen byte id = _n
                gen byte freq = mod(_n, 3) + 1
                expand freq
                drop freq
                tab id
                
                * Start here
                bys id: gen byte n_recs = _N if _n==1
                tab n_recs
                Result

                Code:
                . tab n_recs
                
                     n_recs |      Freq.     Percent        Cum.
                ------------+-----------------------------------
                          1 |          3       30.00       30.00
                          2 |          4       40.00       70.00
                          3 |          3       30.00      100.00
                ------------+-----------------------------------
                      Total |         10      100.00
                If you only need the counts and percentages, and don't mind discarding the data, you can go a step further to store these as their own data.

                Code:
                preserve
                contract id
                tab _fr
                contract _freq, freq(num) percent(pct)
                list
                restore
                Result

                Code:
                     +---------------------+
                     | _freq   num     pct |
                     |---------------------|
                  1. |     1     3   30.00 |
                  2. |     2     4   40.00 |
                  3. |     3     3   30.00 |
                     +---------------------+

                Comment


                • #9
                  On #1

                  I have downloaded the Distinct function, but it just shows me how many distinct values there are overall
                  The allusion is to the distinct command (not function) from the Stata Journal. That's what it was intended to do all along.

                  The principles behind the answers here were, however, discussed in https://www.stata-journal.com/articl...article=dm0042

                  Comment

                  Working...
                  X