Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Egen group across two datasets with unequally sized groups

    I presume the answer is no, but I want to double-check.

    I have two datasets with a string identifier. In one dataset there are more than in the other.

    Is it possible to do egen = group(), such that the same string values would have the same group in both datasets?

  • #2
    Short answer is No.

    If exactly the same set of identifiers occurs in each dataset, you are good. There are imaginable cases in which different sets are nor problematic. If frog and newt are identifiers in one set and frog newt toad in another then the mappings to 1 2 and to 1 2 3 respectively are compatible. But broadly it's simple: each mapping to integers can't know about the other.

    The context could help here. It may be that the real need is to combine the datasets first.

    Comment


    • #3
      yes, if the sort order of the string variable ensures that values that are common to both datasets are sorted first,
      Code:
      clear
      input str2 v1
      "A1"
      "A2"
      end
      
      egen group = group(v1)
      
      tempfile f1
      save `f1'
      
      clear
      input str2 v2
      "A1"
      "A2"
      "B1"
      "B2"
      end
      
      egen group = group(v2)
      
      tempfile f2
      save `f2'
      
      use `f1', clear
      merge 1:1 group using `f2'
      
      list v1 v2 group _merge
      
           +----------------------------------+
           | v1   v2   group           _merge |
           |----------------------------------|
        1. | A1   A1       1      Matched (3) |
        2. | A2   A2       2      Matched (3) |
        3. |      B1       3   Using only (2) |
        4. |      B2       4   Using only (2) |
           +----------------------------------+

      Comment


      • #4
        Instead of using egen, group, I replaced the letters in the string with numbers.

        Code:
        gen id = idStr  
        
        local i = 1
        foreach k in `c(ALPHA)' {    
             replace id = subinstr(id, "`k'", "`i'", .)    
             local +ii
        }
        
        destring id, replace
        I do this in both datasets and get a numeric identifier that is consistent.

        Comment


        • #5
          As indicated in post #3, the approach there will not work if the additional identifiers in the second dataset do not sort to the end.
          Code:
          clear
          input str2 v1
          "A1"
          "B2"
          end
          
          egen group = group(v1)
          
          tempfile f1
          save `f1'
          
          clear
          input str2 v2
          "A1"
          "A2"
          "B1"
          "B2"
          end
          
          egen group = group(v2)
          
          tempfile f2
          save `f2'
          
          use `f1', clear
          merge 1:1 group using `f2'
          Code:
           list v1 v2 group _merge
          
               +----------------------------------+
               | v1   v2   group           _merge |
               |----------------------------------|
            1. | A1   A1       1      Matched (3) |
            2. | B2   A2       2      Matched (3) |
            3. |      B1       3   Using only (2) |
            4. |      B2       4   Using only (2) |
               +----------------------------------+
          The approach in post #4 may accidentally work in some cases, depending on the exact nature of the string identifiers, but in general is a Really Bad Idea, as the following example indicates. That is, the numeric value may be "consistent" - the same in both datasets - but two different string string identifiers may yield the same numeric value, so the numeric values are no longer "distinct" - different cases will have identical values.
          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str5 idStr
          "AU000"
          "LA000"
          end
          gen id = idStr  
          
          local i = 1
          foreach k in `c(ALPHA)' {    
               quietly replace id = subinstr(id, "`k'", "`i'", .)    
               local ++i
          }
          
          destring id, replace
          list
          Code:
          . list
          
               +----------------+
               | idStr       id |
               |----------------|
            1. | AU000   121000 |
            2. | LA000   121000 |
               +----------------+
          You do not tell us why you want to create numeric identifiers separately in the two datasets, but your previous topic

          https://www.statalist.org/forums/for...ric-identifier

          suggest that you may be trying to speed up a merge of the two datasets. You would be well advised to merge the two datasets reliably using idStr and then encode the values of idStr.

          If it is taking to long to develop your code because of time spent merging large datasets, consider creating a well-chosen subset of your data and use it to develop your code.
          Last edited by William Lisowski; 08 Feb 2022, 08:50.

          Comment


          • #6
            Thank you very much for saving me from this bad idea! I did not consider that the identifier will no longer be distinct...

            Comment

            Working...
            X