Count observations by group

Frauke Hildebrand

Join Date: Jan 2018

Posts: 18
#1

Count observations by group

19 Jan 2018, 03:31

Hello,
I have a number of variables. One of them is an ID and another one is Place.
I want to generate a variable that tells me how many ID´s each place has.

I have tried this:
egen Counter = count(ID), by (place)

But I get a type mismatch.

I figured it has something to do with string variables, but I could not find a solution.

Thank you!
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17675
#2

19 Jan 2018, 04:18

Frauke:
perhaps the following toy-example may help:

Code:

use "C:\Program Files (x86)\Stata15\ado\base\a\auto.dta" g id=_n bysort id: egen counter=count(make)

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35438
#3

19 Jan 2018, 06:07

If you want to count distinct IDs, as I guess you do, count() is irrelevant, as it counts non-missing values regardless of how many distinct values that means.

Here is a demonstration. The data are 666 values of 42, but count() returns 666, not 1.

Code:

. clear . set obs 666 number of observations (_N) was 0, now 666 . gen answer = 42 . egen count = count(answer) . su count Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- count | 666 666 0 666 666

I don't see where the type mismatch comes from in your code. count() will accept string variables. You say place and Place in different places, but you don't give us a data example to make clear what is going on. Please do read and act on FAQ Advice #12.

Counting distinct values: there was a survey of the terrain by Gary Longton and myself in

SJ-8-4 dm0042 . . . . . . . . . . . . Speaking Stata: Distinct observations
(help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
Q4/08 SJ 8(4):557--568
shows how to answer questions about distinct observations
from first principles; provides a convenience command

which is accessible at http://www.stata-journal.com/article...article=dm0042

The software introduced there was updated in

SJ-15-3 dm0042_2 . . . . . . . . . . . . . . . . Software update for distinct
(help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
Q3/15 SJ 15(3):899
improved table format and display of large numbers of
observations

SJ-12-2 dm0042_1 . . . . . . . . . . . . . . . . Software update for distinct
(help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
Q2/12 SJ 12(2):352
options added to restrict output to variables with a minimum
or maximum of distinct values

but I guess this is what you most need:

Code:

egen tag = tag(ID Place) egen distinct = total(tag), by(Place)

and that's explained in the 2008 paper. See p.563.
2 likes
Comment
Frauke Hildebrand

Join Date: Jan 2018

Posts: 18
#4

19 Jan 2018, 06:07

Thank you very much!

The following approach worked:
egen tag = tag(ID Place)
egen distinct = total(tag), by(Place)

Last edited by Frauke Hildebrand; 19 Jan 2018, 06:13.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35438
#5

19 Jan 2018, 06:08

What version of Stata are you using?
Comment
Frauke Hildebrand

Join Date: Jan 2018

Posts: 18
#6

19 Jan 2018, 07:03

Version 13.1
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35438
#7

19 Jan 2018, 07:32

Version 13.1. OK, so you should be telling us that in all future questions, as explained here

https://www.statalist.org/forums/help#version
11. What should I say about the version of Stata I use?

The current version of Stata is 15.1. Please specify if you are using an earlier version; otherwise, the answer to your question may refer to commands or features unavailable to you. Moreover, as bug fixes and new features are issued frequently by StataCorp, make sure that you update your Stata before posting a query, as your problem may already have been solved.

Specifically, I was wondering whether count() wouldn't take string variables in earlier versions, but I think that it did in 13.1.

So, I don't know yet where your type mismatch error arises.
Comment
Vincent Thorne

Join Date: Oct 2018

Posts: 8
#8

03 Oct 2018, 08:53

Hello all,

Using Stata/MP 14.1 for Mac, I have the same issue as Frauke: I get a type mismatch when attempting to count a string variable through egen.

The command egen newvar = count(stringVar), by(groups) does not work (type mismatch r(109);). Removing the by(groups) doesn't solve the issue: the problem seems to come from the string variable to be counted. Trying the same command with an integer variable yields expected results, and no error occurs.

Trying Carlo's code, I get the same error, i.e. type mismatch.

Moreover, my colleagues on Windows do not experience this issue (using Stata 13.1).

Any idea what is the issue here?

Best
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17675
#9

03 Oct 2018, 08:57

Vincent:
welcome to this forum.
The first question in this case is: is your copy of Stata full updated?

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35438
#10

03 Oct 2018, 09:17

This was a bug fixed within the lifetime of Stata 14:

------- update 03mar2016 -------------------------------------------------------------------

6. egen with function count(exp) incorrectly exited with error message "type
mismatch" when exp contained a string variable. This has been fixed.

You should be able to read about this at https://www.stata.com/help.cgi?whatsnew14 and -- I agree with Carlo -- the diagnosis is that you need to update your Stata, as also flagged in #13.
Comment
Vincent Thorne

Join Date: Oct 2018

Posts: 8
#11

04 Oct 2018, 02:41

Carlo, thanks for the welcome and the advice. Should have thought about it earlier, obviously.

Thank you Nick for the technical details.

Kind regards,
Vincent
Comment
Bianca Duelken

Join Date: Dec 2020

Posts: 19
#12

30 Dec 2020, 14:00

Dear Nick,

first of all thank you very much for the provided codes:

Code:

egen tag = tag(ID Place) egen distinct = total(tag), by(Place)

I had the same problem as Frauke and could solve it using your advices. There is just one small problem left which gives me a headache:

In the created "distinct" variable (in my case "no_cluster"), the numbers of interest are displayed as the answer possibilities (see Example 1 below).

Going through the journal article you have recommended I found an additional command "tabdisp" which could solve this problem and show me the answer possibilites "place" (in my case "country) and the belonging number of IDs (in my case "no_cluster") (see Example 2). Since "tabdisp" is only a command for the visualisation, I was wondering if I could modify the variable somehow.

Code:

*count number of cluster per country egen tag = tag (id_cluster country) egen no_cluster = total(tag), by (country) tabdisp country, cell(no_cluster)

Example 1:

no_cluster | Freq. Percent Cum.
------------+-----------------------------------
252 | 5,329 0.70 0.70
274 | 4,987 0.65 1.35
330 | 9,480 1.24 2.59
336 | 8,422 1.10 3.69
351 | 10,060 1.32 5.01
392 | 24,553 3.21 8.23
573 | 17,087 2.24 10.46

Example 2:

country | no_cluster
----------+-----------
AO | 625
BF | 573
BJ | 1305
BU | 930
CD | 836
CI | 351

Thank you very much for your help in advance!

Warm greetings, Bianca
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35438
#13

31 Dec 2020, 02:28

Sorry, but I don't understand #12. Modify the commands to do what differently?
Comment
Da GXHI

Join Date: Nov 2020

Posts: 17
#14

25 May 2022, 10:38

Hello,
I am trying to run a similar line of code to the one discussed here as solution - I need to print the number of distinct observations (id) in each group- , adapted to my dataset:
egen tag = tag(ID Place year)
egen distinct = total(tag), by(Place year)
However it takes extremely long, left it overnight and didnt run till the end.
The dataset is large but still, I am assuming there is an issue as other variable creation function work in a timely manner.
I am using stata 15.1. Any idea what can be blocking it? How could I go about it? Thanks!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35438
#15

25 May 2022, 11:08

In #14 what does "large" mean exactly in terms of number of observations and number of variables?
Comment

Announcement

Count observations by group

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment