Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Workaround for Lack of Egen Group, by()

    I keep running into this issue where I want to be able to assign a group id using the egen var = group(), by() command, but Stata doesn't let you use , by() option with the egen group command. Does anyone happen to know of any workaround for this?

    Basically, I have a bunch of addresses listed in my data set that are identified by a particular id, and I want to assign a new and different id number (starting with 1, 2, etc.) to each different store name that appeared at each address. So I'd like to use the egen var = group(store_name), by(address) command. So for example, if there was an address with id=1 that had data for 2000-2015, I want to be able to identify how many stores there were at that address and be able to assign each of them an id starting with 1, 2, etc. So the first store that was listed at that address would get a "1" the second store would get a "2" etc. I have run into this issue is different contexts over the past year or so and am trying to find a solution. Thanks in advance for your help.

  • #2
    The use of -by()- isn't permitted with -egen, group()- because it is redundant. Think about the problem this way, if you have to count groups, you are counting by the group identifier. If you can define how you count by some further grouping,then you put those varaible(s) into the -group()- function all the same. In your case, this seems like what you want (although without a clear data example and description of the outcome, I cannot be sure).

    Code:
    egen id = group(store_name address)

    Comment


    • #3
      Hi Leonardo,
      Thanks for the response. So the suggestion you propose would not resolve my issue. Let me explain my empirical setting. I have annual data on about 40,000 addresses spanning the period 2000-2019. Each address in the data set is identified by a unique identifier (variable = stdlinxscd) and there are different dollar store chains identified at each address by the variable (sname). Some of the addresses start with a single dollar store chain and then switch to a different dollar store chain at some point in my sample period (some have more than one switch). What I would like to do is for each address, identify the first chain that was at the address in 2000, assign that store an id value of 1 for all the years that store name is associated with that address, then if a different chain entered that address such that the store name effectively changes but the address stays the same, I want to assign the id value 2 to the second store name for all the years it was at that address Then the 3rd, etc. If I use the group(store_name address) command (which would be group(stdlinxscd sname) in my data set), I would not get the id=1, id=2, etc. values for the different stores at each address but would instead get a separate id for each combination of store_name and address, which doesn't tell me which store was the first store, second store, third store, etc at that address, which is why if I were to sort the data by address and year and use the egen var_name = group(sname), by(stdlinxscd) command, that would effectively resolve my issue if that command were allowed. If you know of an alternative way I can do this, I would be grateful for the assistance. Thanks.

      Comment


      • #4
        Does this do what you want?

        Code:
        bysort address group name:  gen wanted = _n == 1
        by address : replace wanted = sum(wanted)

        Comment


        • #5
          I still don't see that -egen, group()- is the correct way to go in your use case, nor does a by() option given the intended use of -egen, group()-. You may want to take advantage of it for a different purpose, but that is clearly not working.

          Nick Cox has given a suggestion. If neither his nor my suggestion are what you are after, then you should post back with a simplified reproducible data example that doesn't have to be your real data, but should reproduce the essential features and data types (store name, address, chain, year).

          Instead, I think the concept of spells might be more appropriate here. The concept was elaborated by Nick in the Stata Journal (Tip #123). Here I've extended those ideas. I made up some fake data that I think matches your description, but I might have confused parts of your requirements. I create a number of locations, which are observed for illustration sake for 5 years from 2000-2004. Each location has a corporate chain associated with it for each year, and these change over time from one chain to another.

          Beginning with the "Begin here" comment, the start of the observation period for each location (the panel) is marked in a variable called -start-. Then, a flag is created for each time the chain changes at a specific location (-switch-). Lastly, within each panel, these two flags are summed in such a way to identify the number of change overs in ownership or occupation (-id-). Then you can take the observation from the last value of -id- in each panel as the total number of chains associated with that store location.

          Code:
          clear *
          cls
          
          // create some fake data
          set seed 17
          set obs 10
          gen byte location = _n
          expand 5
          bys location: gen int year = 2000 + _n - 1
          gen byte chain = runiformint(1, 3)
          sort location year
          
          // begin here
          bys location (year) : gen byte start = _n==1
          bys location (year) : gen byte switch = cond(_n==1, 0, ///
                                                  cond(chain==chain[_n-1], 0, 1))
          bys location (year) : gen id_start = sum(start)
          bys location (year) : gen id_switch = sum(switch)
          bys location (year) : gen id = id_start + id_switch
          drop id_start id_switch
          list
          Result for first and last location

          Code:
          . list if inlist(location, 1,10)
          
               +-----------------------------------------------+
               | location   year   chain   start   switch   id |
               |-----------------------------------------------|
            1. |        1   2000       3       1        0    1 |
            2. |        1   2001       1       0        1    2 |
            3. |        1   2002       3       0        1    3 |
            4. |        1   2003       3       0        0    3 |
            5. |        1   2004       2       0        1    4 |
               |-----------------------------------------------|
           46. |       10   2000       3       1        0    1 |
           47. |       10   2001       3       0        0    1 |
           48. |       10   2002       2       0        1    2 |
           49. |       10   2003       2       0        0    2 |
           50. |       10   2004       2       0        0    2 |
               +-----------------------------------------------+

          Comment

          Working...
          X