Egen group across two datasets with unequally sized groups

Henry Strawforrd

Join Date: Sep 2021

Posts: 228
#1

Egen group across two datasets with unequally sized groups

04 Feb 2022, 02:08

I presume the answer is no, but I want to double-check.

I have two datasets with a string identifier. In one dataset there are more than in the other.

Is it possible to do egen = group(), such that the same string values would have the same group in both datasets?
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

04 Feb 2022, 03:09

Short answer is No.

If exactly the same set of identifiers occurs in each dataset, you are good. There are imaginable cases in which different sets are nor problematic. If frog and newt are identifiers in one set and frog newt toad in another then the mappings to 1 2 and to 1 2 3 respectively are compatible. But broadly it's simple: each mapping to integers can't know about the other.

The context could help here. It may be that the real need is to combine the datasets first.
1 like
Comment

Øyvind Snilsberg

Join Date: Oct 2021
Posts: 591

04 Feb 2022, 03:09

yes, if the sort order of the string variable ensures that values that are common to both datasets are sorted first,

Code:

clear
input str2 v1
"A1"
"A2"
end

egen group = group(v1)

tempfile f1
save `f1'

clear
input str2 v2
"A1"
"A2"
"B1"
"B2"
end

egen group = group(v2)

tempfile f2
save `f2'

use `f1', clear
merge 1:1 group using `f2'

list v1 v2 group _merge

     +----------------------------------+
     | v1   v2   group           _merge |
     |----------------------------------|
  1. | A1   A1       1      Matched (3) |
  2. | A2   A2       2      Matched (3) |
  3. |      B1       3   Using only (2) |
  4. |      B2       4   Using only (2) |
     +----------------------------------+

Comment

Henry Strawforrd

Join Date: Sep 2021

Posts: 228
#4

08 Feb 2022, 03:02

Instead of using egen, group, I replaced the letters in the string with numbers.

Code:

gen id = idStr local i = 1 foreach k in `c(ALPHA)' { replace id = subinstr(id, "`k'", "`i'", .) local +ii } destring id, replace

I do this in both datasets and get a numeric identifier that is consistent.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

08 Feb 2022, 07:47

As indicated in post #3, the approach there will not work if the additional identifiers in the second dataset do not sort to the end.

Code:

clear input str2 v1 "A1" "B2" end egen group = group(v1) tempfile f1 save `f1' clear input str2 v2 "A1" "A2" "B1" "B2" end egen group = group(v2) tempfile f2 save `f2' use `f1', clear merge 1:1 group using `f2'

Code:

list v1 v2 group _merge +----------------------------------+ | v1 v2 group _merge | |----------------------------------| 1. | A1 A1 1 Matched (3) | 2. | B2 A2 2 Matched (3) | 3. | B1 3 Using only (2) | 4. | B2 4 Using only (2) | +----------------------------------+

The approach in post #4 may accidentally work in some cases, depending on the exact nature of the string identifiers, but in general is a Really Bad Idea, as the following example indicates. That is, the numeric value may be "consistent" - the same in both datasets - but two different string string identifiers may yield the same numeric value, so the numeric values are no longer "distinct" - different cases will have identical values.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str5 idStr "AU000" "LA000" end gen id = idStr local i = 1 foreach k in `c(ALPHA)' { quietly replace id = subinstr(id, "`k'", "`i'", .) local ++i } destring id, replace list

Code:

. list +----------------+ | idStr id | |----------------| 1. | AU000 121000 | 2. | LA000 121000 | +----------------+

You do not tell us why you want to create numeric identifiers separately in the two datasets, but your previous topic

https://www.statalist.org/forums/for...ric-identifier

suggest that you may be trying to speed up a merge of the two datasets. You would be well advised to merge the two datasets reliably using idStr and then encode the values of idStr.

If it is taking to long to develop your code because of time spent merging large datasets, consider creating a well-chosen subset of your data and use it to develop your code.

Last edited by William Lisowski; 08 Feb 2022, 07:50.
3 likes
Comment
Henry Strawforrd

Join Date: Sep 2021

Posts: 228
#6

08 Feb 2022, 13:12

Thank you very much for saving me from this bad idea! I did not consider that the identifier will no longer be distinct...
Comment

Announcement

Egen group across two datasets with unequally sized groups

Comment

Comment

Comment

Comment

Comment