Understanding the behavior of the group function

Riccardo Valboni

Join Date: Jun 2014
Posts: 123

Understanding the behavior of the group function

19 Jun 2014, 11:29

Dear all,

I am a bit puzzled about the behavior of the group function. My dataset is like this (excluding the group column). I ran the following code and expected Stata to produce the column 'group' you see below. However, the result is a group column containing many different group numbers within the same year like 234, 23, 12, 1,...etc. Do you know by any chance the reason of this behavior?

Code:

bysort ID (year): gen group=group(year)

ID	year	var1	group
445678	2006	454	1
445678	2006	788	1
445678	2006	67567	1
445678	2006	546	1
445678	2007	678	2
445678	2007	7868	2
445678	2008	67878	3
445678	2009	6709	4
445678	2009	546	4
…	…	…	…

Many thanks for your help!
Riccardo

Tags: None

FernandoRios

Join Date: Apr 2014

Posts: 2429
#2

19 Jun 2014, 11:35

I dont think anything "weird" with the results. Perhaps the right question is why what you did is different from what you were expecting, and more important what is exactly what you were expecting?
Fernando
Comment

Riccardo Valboni

Join Date: Jun 2014
Posts: 123

19 Jun 2014, 11:38

I was expecting a column like 'group expected' below. Instead, I got something like 'group obtained'. It's like -group- did not consider my by (varlist): statement.

ID	year	var1	group expected	group obtained
445678	2006	454	1	234
445678	2006	788	1	23
445678	2006	67567	1	344
445678	2006	546	1	21
445678	2007	678	2	12
445678	2007	7868	2	567
445678	2008	67878	3	53
445678	2009	6709	4	29
445678	2009	546	4	78
445679	2007	656	1	435
445679	2007	568	1	12
…	…	…	…	…

Last edited by Riccardo Valboni; 19 Jun 2014, 11:55.

Comment

FernandoRios

Join Date: Apr 2014

Posts: 2429
#4

19 Jun 2014, 11:45

Try this:
egen group_ob=group(id year)
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#5

19 Jun 2014, 11:53

To get what you want, you should use the -egen- function group. (However, it cannot be used with -by-.)

Code:

egen group = group(year)

will produce what you expected. The non-egen group() function you are using is a relic from earlier versions of Stata. It has been kept around so that old code that used it will still run. But it is not recommended for current use, and it disappeared long enough ago that I, at least, don't remember what it actually does (did).

All of that said, what are you trying to accomplish here. You could have also gotten the same result with just

Code:

gen group = year - 2005

And, even so, why do you need a variable that just encodes the year in a slightly different way? Although the -egen group()- function can certainly be run with just one variable, as here, in most applications the idea is to designate combinations of two or more variables. Perhaps that is what you had in mind by putting -bysort ID- in your code. Your example doesn't show us what you expect to get when there is more than one value of the ID variable in the data set. But perhaps what you are really looking for is

Code:

egen group = group(ID year)

If none of this covers what you want, you should provide a more detailed explanation that also includes multiple values for the ID variable.
1 like
Comment
Riccardo Valboni

Join Date: Jun 2014

Posts: 123
#6

19 Jun 2014, 11:53

My example was unclear with respect to what I wanted. I would like -group- to restart counting every time the ID changes. See the edit of my example above. Apologies for the confusion.

Last edited by Riccardo Valboni; 19 Jun 2014, 11:56.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35405
#7

19 Jun 2014, 11:56

As Fernando implies, the group() function invoked by generate is quite different from egen's group() function. It is now undocumented, but see e.g.

http://www.stata.com/statalist/archi.../msg00406.html

and its refererences.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2429
#8

19 Jun 2014, 11:58

.

Last edited by FernandoRios; 19 Jun 2014, 12:04.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2429
#9

19 Jun 2014, 12:01

Thanks Nick.
That is something I didnt know, but I have used egen group to create fixed effects for a while.
nd Riccardo, if you provide a better example, it might probably be easier to provide more accurate suggestions.
Comment

Riccardo Valboni

Join Date: Jun 2014
Posts: 123

#10

19 Jun 2014, 12:05

Thank you for the responses. I extended the table above a bit. In the first year I see an ID, I want to give it 1, in the second I want to give it 2...etc The count restarts from 1 for every new ID Hope this clarifies

ID	year	var1	group expected
445678	2006	454	1
445678	2006	788	1
445678	2006	67567	1
445678	2006	546	1
445678	2007	678	2
445678	2007	7868	2
445678	2008	67878	3
445678	2009	6709	4
445678	2009	546	4
445679	2007	656	1
445679	2007	568	1
445679	2008	453	2
445679	2008	345	2

Last edited by Riccardo Valboni; 19 Jun 2014, 12:11.

Comment

Riccardo Valboni

Join Date: Jun 2014

Posts: 123
#11

19 Jun 2014, 12:10

I think I got it. My solution

Code:

use file.dta keep ID year duplicates drop by ID (year): gen gruppo=_n joinby ID year using file.dta save file.dta, replace

Thank you all for answering!
Riccardo
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35405
#12

19 Jun 2014, 12:17

No need for file choreography:

Code:

bysort ID year : gen gruppo = _n == 1 by ID: replace gruppo = sum(gruppo)
1 like
Comment
Riccardo Valboni

Join Date: Jun 2014

Posts: 123
#13

19 Jun 2014, 12:19

Wow. That's brilliant.
Comment

Announcement

Understanding the behavior of the group function

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment