Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Count observations by group

    Hello,
    I have a number of variables. One of them is an ID and another one is Place.
    I want to generate a variable that tells me how many ID´s each place has.

    I have tried this:
    egen Counter = count(ID), by (place)

    But I get a type mismatch.

    I figured it has something to do with string variables, but I could not find a solution.

    Thank you!

  • #2
    Frauke:
    perhaps the following toy-example may help:
    Code:
    use "C:\Program Files (x86)\Stata15\ado\base\a\auto.dta"
    g id=_n
    bysort id: egen counter=count(make)
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      If you want to count distinct IDs, as I guess you do, count() is irrelevant, as it counts non-missing values regardless of how many distinct values that means.

      Here is a demonstration. The data are 666 values of 42, but count() returns 666, not 1.

      Code:
      . clear
      
      . set obs 666
      number of observations (_N) was 0, now 666
      
      . gen answer = 42
      
      . egen count = count(answer)
      
      . su count
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
             count |        666         666           0        666        666
      I don't see where the type mismatch comes from in your code. count() will accept string variables. You say place and Place in different places, but you don't give us a data example to make clear what is going on. Please do read and act on FAQ Advice #12.

      Counting distinct values: there was a survey of the terrain by Gary Longton and myself in

      SJ-8-4 dm0042 . . . . . . . . . . . . Speaking Stata: Distinct observations
      (help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
      Q4/08 SJ 8(4):557--568
      shows how to answer questions about distinct observations
      from first principles; provides a convenience command

      which is accessible at http://www.stata-journal.com/article...article=dm0042

      The software introduced there was updated in

      SJ-15-3 dm0042_2 . . . . . . . . . . . . . . . . Software update for distinct
      (help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
      Q3/15 SJ 15(3):899
      improved table format and display of large numbers of
      observations

      SJ-12-2 dm0042_1 . . . . . . . . . . . . . . . . Software update for distinct
      (help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
      Q2/12 SJ 12(2):352
      options added to restrict output to variables with a minimum
      or maximum of distinct values

      but I guess this is what you most need:

      Code:
      egen tag = tag(ID Place)
      egen distinct = total(tag), by(Place) 
      and that's explained in the 2008 paper. See p.563.



      Comment


      • #4
        Thank you very much!

        The following approach worked:
        egen tag = tag(ID Place)
        egen distinct = total(tag), by(Place)
        Last edited by Frauke Hildebrand; 19 Jan 2018, 07:13.

        Comment


        • #5
          What version of Stata are you using?

          Comment


          • #6
            Version 13.1

            Comment


            • #7
              Version 13.1. OK, so you should be telling us that in all future questions, as explained here

              https://www.statalist.org/forums/help#version
              11. What should I say about the version of Stata I use?

              The current version of Stata is 15.1. Please specify if you are using an earlier version; otherwise, the answer to your question may refer to commands or features unavailable to you. Moreover, as bug fixes and new features are issued frequently by StataCorp, make sure that you update your Stata before posting a query, as your problem may already have been solved.
              Specifically, I was wondering whether count() wouldn't take string variables in earlier versions, but I think that it did in 13.1.

              So, I don't know yet where your type mismatch error arises.

              Comment


              • #8
                Hello all,

                Using Stata/MP 14.1 for Mac, I have the same issue as Frauke: I get a type mismatch when attempting to count a string variable through egen.

                The command egen newvar = count(stringVar), by(groups) does not work (type mismatch r(109);). Removing the by(groups) doesn't solve the issue: the problem seems to come from the string variable to be counted. Trying the same command with an integer variable yields expected results, and no error occurs.

                Trying Carlo's code, I get the same error, i.e. type mismatch.

                Moreover, my colleagues on Windows do not experience this issue (using Stata 13.1).

                Any idea what is the issue here?

                Best

                Comment


                • #9
                  Vincent:
                  welcome to this forum.
                  The first question in this case is: is your copy of Stata full updated?
                  Kind regards,
                  Carlo
                  (StataNow 18.5)

                  Comment


                  • #10
                    This was a bug fixed within the lifetime of Stata 14:

                    ------- update 03mar2016 -------------------------------------------------------------------

                    6. egen with function count(exp) incorrectly exited with error message "type
                    mismatch" when exp contained a string variable. This has been fixed.

                    You should be able to read about this at https://www.stata.com/help.cgi?whatsnew14 and -- I agree with Carlo -- the diagnosis is that you need to update your Stata, as also flagged in #13.





                    Comment


                    • #11
                      Carlo, thanks for the welcome and the advice. Should have thought about it earlier, obviously.

                      Thank you Nick for the technical details.

                      Kind regards,
                      Vincent

                      Comment


                      • #12
                        Dear Nick,

                        first of all thank you very much for the provided codes:
                        Code:
                         egen tag = tag(ID Place) egen distinct = total(tag), by(Place) 
                        I had the same problem as Frauke and could solve it using your advices. There is just one small problem left which gives me a headache:

                        In the created "distinct" variable (in my case "no_cluster"), the numbers of interest are displayed as the answer possibilities (see Example 1 below).

                        Going through the journal article you have recommended I found an additional command "tabdisp" which could solve this problem and show me the answer possibilites "place" (in my case "country) and the belonging number of IDs (in my case "no_cluster") (see Example 2). Since "tabdisp" is only a command for the visualisation, I was wondering if I could modify the variable somehow.

                        Code:
                        *count number of cluster per country
                        
                        egen tag = tag (id_cluster country)
                        egen no_cluster = total(tag), by (country)
                        tabdisp country, cell(no_cluster)

                        Example 1:

                        no_cluster | Freq. Percent Cum.
                        ------------+-----------------------------------
                        252 | 5,329 0.70 0.70
                        274 | 4,987 0.65 1.35
                        330 | 9,480 1.24 2.59
                        336 | 8,422 1.10 3.69
                        351 | 10,060 1.32 5.01
                        392 | 24,553 3.21 8.23
                        573 | 17,087 2.24 10.46


                        Example 2:

                        country | no_cluster
                        ----------+-----------
                        AO | 625
                        BF | 573
                        BJ | 1305
                        BU | 930
                        CD | 836
                        CI | 351


                        Thank you very much for your help in advance!

                        Warm greetings, Bianca

                        Comment


                        • #13
                          Sorry, but I don't understand #12. Modify the commands to do what differently?

                          Comment


                          • #14
                            Hello,
                            I am trying to run a similar line of code to the one discussed here as solution - I need to print the number of distinct observations (id) in each group- , adapted to my dataset:
                            egen tag = tag(ID Place year)
                            egen distinct = total(tag), by(Place year)
                            However it takes extremely long, left it overnight and didnt run till the end.
                            The dataset is large but still, I am assuming there is an issue as other variable creation function work in a timely manner.
                            I am using stata 15.1. Any idea what can be blocking it? How could I go about it? Thanks!

                            Comment


                            • #15
                              In #14 what does "large" mean exactly in terms of number of observations and number of variables?

                              Comment

                              Working...
                              X