Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Search for common values in two columns

    Hi everyone,

    First I would like to appologize in case this question was posted before.

    I have data in the form:
    Var1 Var2 Var3
    A A 1
    A B 2
    A C 3
    B D 4
    B E 5
    ... ... ...

    and I want to generate a new variable that picks Var1, search for the same value in Var2 and equals Var3. It should end up like this:
    Var1 Var2 Var3 Var4
    A A 1 1
    A B 2 1
    A C 3 1
    B D 4 2
    B E 5 2
    ... ... ... ...

    Is there any way to do this?
    Thanks in advance
    Best Regards,

    Pedro
    (StataMP 16 user)

  • #2
    Maybe. In order for this to be a solvable problem (in general, not just in Stata) your data must satisfy this condition: each value of Var2 is always associated with the same value of Var3 whenever it appears. The code below will check for that. This is easily done with -rangestat-, except that var1 and var2 are string variables. But that is overcome by -encoding- them with a common label:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1(var1 var2) byte var3
    "A" "A" 1
    "A" "B" 2
    "A" "C" 3
    "B" "D" 4
    "B" "E" 5
    end
    
    
    //    VERIFY CONSISTENCY OF VAR3 IN ALL OCCURRENCES OF VAR2
    by var2 (var3), sort: assert var3[1] == var3[_N]
    
    //    NUMERICALLY ENCODE VAR2 AND VAR1 WITH THE SAME LABEL
    encode var2, gen(n_var2) label(common)
    encode var1, gen(n_var1) label(common)
    
    //    NOW CREATE VAR4
    rangestat (first) var4 = var3, interval(n_var2 n_var1 n_var1)
    This requires that you install the wonderful -rangestat- program written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.

    In the future, please show example data using the -dataex- command, as I have done before. Run -ssc install dataex- to get the command and then run -help dataex- to read the simple instructions for use. With -dataex- you enable those who want to help you to create a complete and faithful replica of your Stata example with a simple copy/paste operation, so that they can try out their code and assure that it will work with your data.

    Comment


    • #3
      Thank you very much for your quick answer.

      It partially worked but I might not give all the information needed, because it doesn't fill all the observations of RIC.
      Here is part of my data:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str53 Name str11(RIC Type) int sic4
      "ACCOR SA - MARKET VAL BY CO."                       "ACCP.PA" "ACCP.PA" 9999
      "ACCOR SA - NET SALES OR REVENUES"                   "ACCP.PA" "ACN.F"   8742
      "ACCOR SA - NET INC BEFORE EXTRA/PFD DIVS"           "ACCP.PA" "ADP.PA"  4581
      "ACCOR SA - RECEIVABLES(NET)"                        "ACCP.PA" "ADSGn.F" 3949
      "ACCOR SA - PROPERTY, PLANT & EQUIP-GROSS"           "ACCP.PA" "AIRF.PA" 4512
      "ACCOR SA - TOTAL ASSETS"                            "ACCP.PA" "ALVG.F"  9999
      "ACCOR SA - TOTAL DEBT"                              "ACCP.PA" "ANA.MC"  6531
      "ACCOR SA - NET CASH FLOW-OPERATING ACTIVS"          "ACCP.PA" "ATL.MI"  4231
      "ACCENTURE PLC - MARKET VAL BY CO."                  "ACN.F"   "BASFn.F" 2879
      "ACCENTURE PLC - NET SALES OR REVENUES"              "ACN.F"   "BMWG.F"  9999
      "ACCENTURE PLC - NET INC BEFORE EXTRA/PFD DIVS"      "ACN.F"   "CAGR.PA" 6163
      "ACCENTURE PLC - RECEIVABLES(NET)"                   "ACN.F"   "DAIGn.F" 3799
      "ACCENTURE PLC - PROPERTY, PLANT & EQUIP-GROSS"      "ACN.F"   "DANO.PA" 2099
      "ACCENTURE PLC - TOTAL ASSETS"                       "ACN.F"   "DAST.PA" 7379
      "ACCENTURE PLC - TOTAL DEBT"                         "ACN.F"   "DB1Gn.F" 6231
      "ACCENTURE PLC - NET CASH FLOW-OPERATING ACTIVS"     "ACN.F"   "DPWGn.F" 8742
      "ASML HOLDING NV - MARKET VAL BY CO."                 "ASML.AS" "WLSNc.AS"    7372
      "ASML HOLDING NV - NET SALES OR REVENUES"             "ASML.AS" "WRT1V.HE"    8299
      "ASML HOLDING NV - NET INC BEFORE EXTRA/PFD DIVS"     "ASML.AS" ""               ""
      "ASML HOLDING NV - RECEIVABLES(NET)"                  "ASML.AS" ""               ""
      "ASML HOLDING NV - PROPERTY, PLANT & EQUIP-GROSS"     "ASML.AS" ""               ""
      "ASML HOLDING NV - TOTAL ASSETS"                      "ASML.AS" ""               ""
      "ASML HOLDING NV - TOTAL DEBT"                        "ASML.AS" ""               ""
      "ASML HOLDING NV - COMMON SHAREHOLDERS' EQUITY"       "ASML.AS" ""               ""
      "ASML HOLDING NV - NET CASH FLOW-OPERATING ACTIVS"    "ASML.AS" ""              ""
      end
      The number of observation in "RIC" and "Type" is different. Maybe this is why ir stops fill the obseravtions when Type (and sic4) ends.
      Best Regards,

      Pedro
      (StataMP 16 user)

      Comment


      • #4
        I'm sorry, but now I don't understand the question at all. The data you show now has 4 variables, but the original problem started with 3. I don't know which of these new variables corresponds to the original var1, var2, and var3. I guess var3 is sic4 because it is the only one that's numeric, but the others are unclear.

        I should also point out that your -dataex- output is not valid and cannot be read in because there are no values specified for sic4 in the last 7 observations. I am quite confident that you did not actually create what you show with -dataex- but that you edited the material from actual dataex output. The result is unusable.

        Please rephrase your question, and show an actual example of your data . In rephrasing the question use the actual variable names appearing in the example.

        Comment


        • #5
          I wonder if Type and sic4 are like a lookup table (with only 18 values) and the idea is to look up RIC to find the corresponding sic4. Problem with this theory is that RIC value ASML.AS is not a value of Type and would be unmatched. Anyway, to do this one could store Name and RIC in one dataset sorted by RIC, store Type and sic4 in another dataset, rename Type to RIC and sort by that, go back to the first dataset and merge m:1 (many-to-one) on RIC. The variable _merge will be 3 for RIC values that appear on both datasets (like ACCP.PA), and 1 or 2 for those appearing only in one (like ASML.AS in 1 and ADP.PA) on 2. Happy to provide code if this is the aim. Below are the first 20 observations in the merged sample data
          Code:
                                                           Name       RIC   sic4            _merge  
            1.                     ACCOR SA - MARKET VAL BY CO.   ACCP.PA   9999       matched (3)  
            2.                 ACCOR SA - NET SALES OR REVENUES   ACCP.PA   9999       matched (3)  
            3.         ACCOR SA - NET INC BEFORE EXTRA/PFD DIVS   ACCP.PA   9999       matched (3)  
            4.                      ACCOR SA - RECEIVABLES(NET)   ACCP.PA   9999       matched (3)  
            5.         ACCOR SA - PROPERTY, PLANT & EQUIP-GROSS   ACCP.PA   9999       matched (3)  
            6.                          ACCOR SA - TOTAL ASSETS   ACCP.PA   9999       matched (3)  
            7.                            ACCOR SA - TOTAL DEBT   ACCP.PA   9999       matched (3)  
            8.        ACCOR SA - NET CASH FLOW-OPERATING ACTIVS   ACCP.PA   9999       matched (3)  
            9.                ACCENTURE PLC - MARKET VAL BY CO.     ACN.F   8742       matched (3)  
           10.            ACCENTURE PLC - NET SALES OR REVENUES     ACN.F   8742       matched (3)  
           11.    ACCENTURE PLC - NET INC BEFORE EXTRA/PFD DIVS     ACN.F   8742       matched (3)  
           12.                 ACCENTURE PLC - RECEIVABLES(NET)     ACN.F   8742       matched (3)  
           13.    ACCENTURE PLC - PROPERTY, PLANT & EQUIP-GROSS     ACN.F   8742       matched (3)  
           14.                     ACCENTURE PLC - TOTAL ASSETS     ACN.F   8742       matched (3)  
           15.                       ACCENTURE PLC - TOTAL DEBT     ACN.F   8742       matched (3)  
           16.   ACCENTURE PLC - NET CASH FLOW-OPERATING ACTIVS     ACN.F   8742       matched (3)  
           17.              ASML HOLDING NV - MARKET VAL BY CO.   ASML.AS      .   master only (1)  
           18.          ASML HOLDING NV - NET SALES OR REVENUES   ASML.AS      .   master only (1)  
           19.                                                     ADP.PA   4581    using only (2)  
           20.                                                    ADSGn.F   3949    using only (2)
          Added: Just saw Clyde's reply and he is right regarding the lack of sic4 values for the last 7 observations. I assume they were meant to be missing.
          Last edited by German Rodriguez; 20 Oct 2017, 12:42.

          Comment


          • #6
            Clyde you are right: I did create what I posted with -dataex- but I did some editing to show that the length of RIC and Type were not the same but it would become very long if I placed all the obseravtions. Sorry for my mistake, didn“t mean to "cheat".
            The "Name" variable is not relevant and I should have not used it in the post.
            The best description was the one I made in the first post, where Var1 is RIC, Var2 is Type and Var3 is sic4. So, I want to generate a new variable that "looks" at RIC, search for the same value in Type and equals the corresponding sic4.


            Here is the exact dataset sample (variable RIC consinues and have all and each Type value 8 times):

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str11(RIC Type) str4 sic4
            "ACCP.PA" "ACCP.PA"     "9999"
            "ACCP.PA" "ACN.F"       "8742"
            "ACCP.PA" "ADP.PA"      "4581"
            "ACCP.PA" "ADSGn.F"     "3949"
            "ACCP.PA" "AIRF.PA"     "4512"
            "ACCP.PA" "ALVG.F"      "9999"
            "ACCP.PA" "ANA.MC"      "6531"
            "ACCP.PA" "ATL.MI"      "4231"
            "ACN.F"   "BASFn.F"     "2879"
            "ACN.F"   "BMWG.F"      "9999"
            "ACN.F"   "CAGR.PA"     "6163"
            "ACN.F"   "DAIGn.F"     "3799"
            "ACN.F"   "DANO.PA"     "2099"
            "ACN.F"   "DAST.PA"     "7379"
            "ACN.F"   "DB1Gn.F"     "6231"
            "ACN.F"   "DPWGn.F"     "8742"
            "ADP.PA"  "DTEGn.F"     "7379"
            "ADP.PA"  "ENAG.MC"     "9999"
            "ADP.PA"  "ESSI.PA"     "9999"
            "ADP.PA"  "FER.MC"      "4581"
            "ADP.PA"  "FMEG.F"      "8071"
            "ADP.PA"  "GALP.LS"     "4932"
            "ADSGn.F" "HNKG.F"      "2891"
            "ADSGn.F" "HOTG.F"      "9999"
            "ADSGn.F" "IBE.MC"      "6531"
            "ADSGn.F" "INGA.AS"     "6082"
            "ADSGn.F" "ISP.MI"      "6029"
            "ADSGn.F" "ITX.MC"      "9999"
            "ADSGn.F" "LVMH.PA"     "2844"
            "AIRF.PA" "MICP.PA"     "7549"
            "AIRF.PA" "MUVGn.F"     "6311"
            "AIRF.PA" "NESTE.HE"    "2999"
            "AIRF.PA" "NOKIA.HE"    "3669"
            "AIRF.PA" "OMVV.VI"     "1311"
            "AIRF.PA" "OREP.PA"     "9999"
            "AIRF.PA" "OTE1V.HE"    "9999"
            "AIRF.PA" "PHG.AS"      "3845"
            "ALVG.F"  "RDSa.AS"     "9999"
            "ALVG.F"  "RENA.PA"     "5521"
            "ALVG.F"  "REP.MC"      "9999"
            "ALVG.F"  "SAPG.F"      "8243"
            "ALVG.F"  "SCHN.PA"     "7371"
            "ALVG.F"  "SGOB.PA"     "5039"
            "ALVG.F"  "SIEGn.F"     "7373"
            "ANA.MC"  "SOGN.PA"     "6211"
            "ANA.MC"  "SPMI.MI"     "1381"
            "ANA.MC"  "STERV.HE"    "2671"
            "ANA.MC"  "TIE1V.HE"    "7379"
            "ANA.MC"  "TNTE.AS^G16" "9999"
            "ANA.MC"  "UMI.BR"      "3356"
            "ANA.MC"  "VIV.PA"      "4833"
            "ASML.AS" "WLSNc.AS"    "7372"
            "ASML.AS" "WRT1V.HE"    "8299"
            "ASML.AS" ""            ""    
            "ASML.AS" ""            ""    
            "ASML.AS" ""            ""    
            "ASML.AS" ""            ""    
            "ASML.AS" ""            ""    
            "ASML.AS" ""            ""    
            "ASML.AS" ""            ""    
            end
            Best Regards,

            Pedro
            (StataMP 16 user)

            Comment


            • #7
              So your data is actually a rather bizarre arrangement because the values of the variables Type and sic4 do not actually correspond to the values of RIC in the same observations. It is as if you smashed two separate data sets together. I've seen this kind of thing done in spreadsheets, but in a Stata data set it is just a recipe for trouble. (To be honest, I don't think it's a good data practice in spreadsheets either.) Anyway, the thing to do is to separate this into the two separate data sets they need to be and then merge them. The results from this code eliminates the spurious values of Type and sic4 found in your original data, and rather gives, what I think is really what you were after: each RIC is now paired with its corresponding sic4 value.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str11(RIC Type) str4 sic4
              "ACCP.PA" "ACCP.PA"     "9999"
              "ACCP.PA" "ACN.F"       "8742"
              "ACCP.PA" "ADP.PA"      "4581"
              "ACCP.PA" "ADSGn.F"     "3949"
              "ACCP.PA" "AIRF.PA"     "4512"
              "ACCP.PA" "ALVG.F"      "9999"
              "ACCP.PA" "ANA.MC"      "6531"
              "ACCP.PA" "ATL.MI"      "4231"
              "ACN.F"   "BASFn.F"     "2879"
              "ACN.F"   "BMWG.F"      "9999"
              "ACN.F"   "CAGR.PA"     "6163"
              "ACN.F"   "DAIGn.F"     "3799"
              "ACN.F"   "DANO.PA"     "2099"
              "ACN.F"   "DAST.PA"     "7379"
              "ACN.F"   "DB1Gn.F"     "6231"
              "ACN.F"   "DPWGn.F"     "8742"
              "ADP.PA"  "DTEGn.F"     "7379"
              "ADP.PA"  "ENAG.MC"     "9999"
              "ADP.PA"  "ESSI.PA"     "9999"
              "ADP.PA"  "FER.MC"      "4581"
              "ADP.PA"  "FMEG.F"      "8071"
              "ADP.PA"  "GALP.LS"     "4932"
              "ADSGn.F" "HNKG.F"      "2891"
              "ADSGn.F" "HOTG.F"      "9999"
              "ADSGn.F" "IBE.MC"      "6531"
              "ADSGn.F" "INGA.AS"     "6082"
              "ADSGn.F" "ISP.MI"      "6029"
              "ADSGn.F" "ITX.MC"      "9999"
              "ADSGn.F" "LVMH.PA"     "2844"
              "AIRF.PA" "MICP.PA"     "7549"
              "AIRF.PA" "MUVGn.F"     "6311"
              "AIRF.PA" "NESTE.HE"    "2999"
              "AIRF.PA" "NOKIA.HE"    "3669"
              "AIRF.PA" "OMVV.VI"     "1311"
              "AIRF.PA" "OREP.PA"     "9999"
              "AIRF.PA" "OTE1V.HE"    "9999"
              "AIRF.PA" "PHG.AS"      "3845"
              "ALVG.F"  "RDSa.AS"     "9999"
              "ALVG.F"  "RENA.PA"     "5521"
              "ALVG.F"  "REP.MC"      "9999"
              "ALVG.F"  "SAPG.F"      "8243"
              "ALVG.F"  "SCHN.PA"     "7371"
              "ALVG.F"  "SGOB.PA"     "5039"
              "ALVG.F"  "SIEGn.F"     "7373"
              "ANA.MC"  "SOGN.PA"     "6211"
              "ANA.MC"  "SPMI.MI"     "1381"
              "ANA.MC"  "STERV.HE"    "2671"
              "ANA.MC"  "TIE1V.HE"    "7379"
              "ANA.MC"  "TNTE.AS^G16" "9999"
              "ANA.MC"  "UMI.BR"      "3356"
              "ANA.MC"  "VIV.PA"      "4833"
              "ASML.AS" "WLSNc.AS"    "7372"
              "ASML.AS" "WRT1V.HE"    "8299"
              "ASML.AS" ""            ""    
              "ASML.AS" ""            ""    
              "ASML.AS" ""            ""    
              "ASML.AS" ""            ""    
              "ASML.AS" ""            ""    
              "ASML.AS" ""            ""    
              "ASML.AS" ""            ""  
              end
              
              
              preserve
              
              // CREATE A TEMPFILE WITH A CROSSWALK
              //    BETWEEN Type AND sic4
              tempfile crosswalk
              keep Type sic4
              assert missing(Type) == missing(sic4)
              drop if missing(Type)
              by Type (sic4), sort: assert sic4[1] == sic4[_N] 
              isid Type, sort
              rename Type RIC
              save `crosswalk'
              
              restore
              
              drop Type sic4
              merge m:1 RIC using `crosswalk', keep(match master) nogenerate
              So I have given you what I think is what you really wanted, though it is not what you asked for. If that's not helpful, please post back with a clearer explanation.

              By the way, I noted that in this example, RIC value ASML.AS is not found at all in Type, so it gets no matching value for sic4.

              This solution is along the lines that German Rodriguez proposed in #5. Full code is provided here.

              Comment


              • #8
                Thank you Clyde and German. By following your latest instructions I managed to solve the problem. I should have, right from the beginning, worked two datasets.
                With a single topic I learned much more than what I asked for. Thank you both.
                Best Regards,

                Pedro
                (StataMP 16 user)

                Comment


                • #9
                  Backtracking to #1 and #2: encoding two or more string variables consistently is quite a common need.

                  multencode (SSC) is one solution. Originally announced at https://www.stata.com/statalist/arch.../msg00729.html

                  Not a big deal here, as it saves only one line, but for the record here is Clyde's solution tweaked:

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input str1(var1 var2) byte var3
                  "A" "A" 1
                  "A" "B" 2
                  "A" "C" 3
                  "B" "D" 4
                  "B" "E" 5
                  end
                  
                  
                  //    VERIFY CONSISTENCY OF VAR3 IN ALL OCCURRENCES OF VAR2
                  by var2 (var3), sort: assert var3[1] == var3[_N]
                  
                  //    NUMERICALLY ENCODE VAR2 AND VAR1 WITH THE SAME LABEL
                  multencode var1 var2, gen(n_var1 n_var2) 
                  
                  //    NOW CREATE VAR4
                  rangestat (first) var4 = var3, interval(n_var2 n_var1 n_var1)

                  Comment

                  Working...
                  X