Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Concatenation of variables without duplicates

    Hello everyone,
    I have a small question of data management, impossible to find the answer in "Data Management Using Stata - a practical handbook" of Michael N. Mitchell. I would like to concatenate several variables but concatenating the distinct values.
    For example :
    Var1 Var2 Var3 Var4
    A A B B
    I want to create a variable Var5 which is worth "A,B" and not "A,A,B,B".

    Can someone have an idea?
    Thank you in advance !
    Joséphine

  • #2
    Does order matter? For example, is ABC the same as CAB? If not, you can do something like below:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1 var1 str2 var2 str1(var3 var4)
    "A" "A" "B" "B"
    "B" "C" "A" "A"
    "E" "E" "E" "E"
    end
    
    gen concat=var1
    local N=_N
    forval i=1/`N'{
    forval v=2/4{
         local j= `v'-1
         if var`v'!= var`j' in `i'{
         replace concat= concat+var`v' in `i'
    }
    }
    }
    This exploits the fact that your variables are numbered in consecutive order, i.e., var1, var2,.., var4 - hence "forval v=2/4{". You will need to amend the code if this is not the case.

    Res.:

    Code:
    . l
    
         +------------------------------------+
         | var1   var2   var3   var4   concat |
         |------------------------------------|
      1. |    A      A      B      B       AB |
      2. |    B      C      A      A      BCA |
      3. |    E      E      E      E        E |
         +------------------------------------+
    Last edited by Andrew Musau; 20 Nov 2019, 09:09.

    Comment


    • #3
      I like Andrew Musau's strategy but suggest a simplification to

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str1 var1 str2 var2 str1(var3 var4)
      "A" "A" "B" "B"
      "B" "C" "A" "A"
      "E" "E" "E" "E"
      end
      
      gen concat = var1
      
      forval v = 2/4 {
          local V = `v' - 1 
          replace concat = concat + var`v' if var`v' != var`V' 
      }
      
      list

      Comment


      • #4
        Hello everyone,
        Thank you very much for your help, it fully meets my request. I'm not used to working with loops, I would have to work harder!

        Again thank you it's great!

        Comment


        • #5
          Hello everyone,
          I allow myself to go back on this subject because finally the answers given do not bring me entirely the solution. Indeed, this loop works perfectly in the case where the identical modalities follow each other. On the other hand, in the following case:
          var1 var2 var3 var4
          A A C A
          A B A B
          I get this with the loop:
          var5
          A,C,A
          A,B,A,B
          I do not see how to adapt the loop to reach my goals?
          I can replace the "-1" by "-2", "-3" etc but that does not seem optimal...

          Thank you in advance for your help
          Best regards

          Comment


          • #6
            What are your goals that you are not reaching? Do you want just AC and AB in those cases?

            Here is some technique:

            Code:
            clear 
            input str1 (var1    var2    var3    var4) 
            A    A    C    A
            A    B    A    B
            A    A    C    A
            A    B    A    B
            C   B   C   B 
            end 
            
            * sorted A B C if any occur 
            
            gen wanted1 = "" 
            
            foreach x in A B C { 
                replace wanted1 = wanted1 + "`x'" if inlist("`x'", var1, var2, var3, var4)
            }
            
            * sorted by order of first occurrence 
            
            gen wanted2 = var1 
            
            foreach v in var2 var3 var4 { 
                replace wanted2 = wanted2 + `v' if !strpos(wanted2, `v') 
            } 
            
            list 
            
                 +-----------------------------------------------+
                 | var1   var2   var3   var4   wanted1   wanted2 |
                 |-----------------------------------------------|
              1. |    A      A      C      A        AC        AC |
              2. |    A      B      A      B        AB        AB |
              3. |    A      A      C      A        AC        AC |
              4. |    A      B      A      B        AB        AB |
              5. |    C      B      C      B        BC        CB |
                 +-----------------------------------------------+
            Last edited by Nick Cox; 28 Nov 2019, 07:54.

            Comment


            • #7
              Yes that's it, I do not want any duplicates in my concatenation.

              Comment


              • #8
                Your second proposal responds well to my request thank you very much (the first implies to know the different possible values, here A, B or C but it is not always the case, in my data there is a lot of possible values).
                I did not know the strpos function.

                Thank you very much for your help, it's great!

                Comment

                Working...
                X