Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Joining data that has no unique identifier

    Dear members,

    I am using Stata 12.1 for Mac.

    My research requires a lot of data management before getting to the statistical part. And this is where I am stuck.

    I have two datasets: (-describe- shown at end of message)

    Dataset 1 has the following variables:
    - Ticker
    - Group
    - Year
    - this dataset has companies (ticker) which belong to different risk factor groups (group) each year. (The companies can possibly change from a group to another each year.)
    - # of obs.: 4082 observations

    Dataset 2 has:
    - Group
    - data (YMD)
    - returns
    - this dataset has daily returns from 02Jan2000 thru 29Dec2012
    - # of obs.: 17712 observations

    I have tried a few things:
    1) -merge-
    But have learned that to use this function i would need a unique identifier. Given that it is inherent to the data that no one field is uniquely identified, I did not find a way to use this command.

    2) -joinby-
    joinby group using [dataset 2]
    result: the command copied data without respect to date/year

    3) used these line of code with dataset 1 in the memory:
    quietly levelsof year, local(levs)
    quietly foreach lev of local levs {
    joinby grupo using fipe_ajustes
    }
    result: Stata stopped the operation and issue the following message:
    "sum of expand values exceed 2,147,483,647
    The dataset may not contain more than 2,147,483,647 observations."

    Here, I am not sure if this 3rd code is working, and if the problem is really size of dataset.

    I read the guide for posting, and I hope to have been clear enough.

    Thanks,
    Clarice

    ------------------------------------------------

    Dataset 1:

    . describe

    Contains data from stocks_ajustes.dta
    obs: 4,082
    vars: 3 20 Apr 2014 18:20
    size: 81,640
    -----------------------------------------------------------------------------------------------------------------------------
    storage display value
    variable name type format label variable label
    -----------------------------------------------------------------------------------------------------------------------------
    ticker str16 %16s ticker
    year int %10.0g year
    group str2 %9s
    -----------------------------------------------------------------------------------------------------------------------------

    Dataset 2:

    . describe

    Contains data from fipe_ajustes.dta
    obs: 17,712
    vars: 3 20 Apr 2014 17:55
    size: 247,968
    -----------------------------------------------------------------------------------------------------------------------------
    storage display value
    variable name type format label variable label
    -----------------------------------------------------------------------------------------------------------------------------
    grupo str2 %9s grupo
    ret_ajust double %10.0g
    period float %td
    -----------------------------------------------------------------------------------------------------------------------------
    Sorted by:
    Note: dataset has changed since last saved


  • #2
    Clarice,

    Before deciding between merge and joinby, we would need a better idea of what you are trying to accomplish (i.e., what variables do you want to match observations between the two data sets? Are there any combinations of variables that uniquely identify observations in either data set?).

    From what I can tell so far, it looks like data set 1 has variables group and year which appear to uniquely identify observations, so that's a good start. Data set 2 has variables grupo and data which could be modified to accomplish the merge. Accordingly, this is my best guess as to what you might want to do:

    Code:
    use fipe_ajustes, clear
    rename grupo group   // Assuming the grupo variable is defining the same thing as the group variable in the other data set
    gen year=year(period)
    
    merge m:1 group year using stocks_ajustes
    This will match every observation in data set 2 with the observation in data set 1 that has the same year and the same group. If this is not what you want, we will probably need more information.

    Regards,
    Joe

    Comment


    • #3
      I assume your return data is captured at the firm level and not on an industry level? And you want to merge some company data with returns data, right? Then you definitely need an identifier variable for your "fipe_ajustes.dta" dataset what seem to be missing. In this case you need to go back to the source of your data and add this variable. Further your second dataset seems to be lacking a time variable (What is the time period where the returns were generated?).

      In case you actually want to merge the datasets on the group level you need to rename the variable first. As it stands now in one dataset the variable is called "grupo" and in the other one "group". If you further have a time variable in your second dataset you can easily match one return observation per year group with the data in your first mentioned dataset using the merge command:

      merge 1:1 group year using fipe_ajustes.dta

      HTH,
      Roberto

      Comment


      • #4
        Thank you all... I believe I need to explain myself better. My apologies.

        Nevertheless, before doing that I tried both suggestions and still got the error message:
        "variables group year do not uniquely identify observations in the using data
        r(459);"
        Dataset 1 - stocks_ajustes:
        - Data is at the firm level, but the list of firms repeat each year (2000-2012)
        - Each year a firm can be placed in a different risk group.
        - Variables: ticker, group, year
        - Sample data:

        ticker group year
        PETR4 BH 2000
        ABEV3 BL 2000
        USIM5 SM 2000
        PETR4 BM 2001
        ABEV3 BH 2001
        USIM5 SM 2001

        Dataset 2 - fipe_ajustes
        - These are daily returns separated by group.
        - There are 6 groups which are the same groups used in Dataset 1
        - Variables: group, period (YDM), ret_ajust
        - Sample data:

        group period ret_ajust
        BH 1aug2000 -0.01
        BL 1aug2000 -0.02
        BM 1aug2000 -0.03
        SM 1aug2000 -0.04
        BH 2aug2000 0.05
        BL 2aug2000 0.02
        BM 2aug2000 0.03
        SM 2aug2000 0.02
        BH 2jan2001 -0.03
        BL 2jan2001 0.02
        BM 2jan2001 -0.03
        SM 2jan2001 0.01
        BH 3jan2001 -0.001
        BL 3jan2001 0.03
        BM 3jan2001 -0.07
        SM 3jan2001 0.08

        All the different daily returns need to be copied to one master database with the Firms and their corresponding group each year.

        For instance in 2000, PETR4 belongs to the group BH, but on year 2001, PETR4 changed to group BM.

        So, the final result should like this (sample of desired result):

        ticker group year period ret_ajust
        PETR4 BH 2000 1aug2000 -0.01 <== from dataset 2, data corresponding to returns for firms in BH category for 01aug2000
        PETR4 BH 2000 2aug2000 0.05 <== from dataset 2, data corresponding to returns for firms in BH category for 02aug2000
        PETR4 BM 2001 2jan2001 -0.03 <== from dataset 2, data corresponding to returns for firms in BM category for 02jan2001
        PETR4 BM 2001 3jan2001 -0.07 <== from dataset 2, data corresponding to returns for firms in BM category for 03jan2001


        I hope this clarifies.

        Thank you for your comments!

        Regards,
        Clarice

        Comment


        • #5
          Clarice,

          Thanks; that helps. Based on your clarification, I think joinby might indeed be what is required, but with group and year, not just group:

          Code:
          use fipe_ajustes, clear
          rename grupo group    
          gen year=year(period)  
          
          joinby group year using stocks_ajustes
          This may help avoid the error you got previously with joinby. If you still get that error, then you may want to re-think your strategy, or you may have to split your data sets into smaller pieces in order to make it work.

          Regards,
          Joe

          Comment


          • #6
            There is more than one possible observation for - group year -. For example, BH 1aug2000 - 0.01 BH 2aug2000 0.05 After extracting the year you end up with: BH 2000 - 0.01 BH 2000 0.05 Stata doesn't know which one to use. Hence those two variables do not uniquely identify observations.
            You should:

            1. Read the FAQ carefully.

            2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

            3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

            4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

            Comment


            • #7
              Something wrong with formatting and my smartphone. Can't even get line breaks. Sorry.
              You should:

              1. Read the FAQ carefully.

              2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

              3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

              4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

              Comment


              • #8
                Joe,

                Thank you!! It worked perfectly...!!!

                I was so close... but, didn't try the use of -joinby- with the two variables. Fantastic!


                Thank you all for comments.

                Regards,
                Clarice

                Comment


                • #9
                  I don't see why Joe Canner's first solution (post #2) doesn't work for you. A simple merge m:1 works in the following example.

                  Code:
                  clear all
                  set more off
                  
                  input ///
                  str6 ticker str2 group year
                  PETR4 BH 2000
                  ABEV3 BL 2000
                  USIM5 SM 2000
                  PETR4 BM 2001
                  ABEV3 BH 2001
                  USIM5 SM 2001
                  end
                  
                  tempfile stocks
                  save "`stocks'"
                  
                  clear all
                  
                  input ///
                  str2 group str11 period ret
                  BH 1aug2000 -0.01
                  BL 1aug2000 -0.02
                  BM 1aug2000 -0.03
                  SM 1aug2000 -0.04
                  BH 2aug2000 0.05
                  BL 2aug2000 0.02
                  BM 2aug2000 0.03
                  SM 2aug2000 0.02
                  BH 2jan2001 -0.03
                  BL 2jan2001 0.02
                  BM 2jan2001 -0.03
                  SM 2jan2001 0.01
                  BH 3jan2001 -0.001
                  BL 3jan2001 0.03
                  BM 3jan2001 -0.07
                  SM 3jan2001 0.08
                  end
                  
                  gen period2 = date(period,"DMY")
                  format period2 %td
                  gen year=year(period2)
                  
                  merge m:1 group year using "`stocks'"
                  list if ticker == "PETR4"
                  Maybe you were using merge 1:1 which gives the error

                  variables group year do not uniquely identify observations in the master data
                  r(459);

                  You should:

                  1. Read the FAQ carefully.

                  2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

                  3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

                  4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

                  Comment


                  • #10
                    Hello Roberto,

                    sorry, I didn't see your comment. So, I went back to double check.

                    Well, it seems that, with -joinby- suggeted above by Joe, Stata knew to get all the entries with the year BH 2000 ticker into the correct place.

                    Here is a snapshot:

                    ticker group year ret_ajust period
                    ABCB4 BZ Equity BH 2007 .0076035 02jan2007
                    ABCB4 BZ Equity BH 2007 -.0107174 03jan2007
                    ABCB4 BZ Equity BH 2007 -.0040778 04jan2007
                    ABCB4 BZ Equity BH 2007 -.0327368 05jan2007
                    ABCB4 BZ Equity BH 2007 .0061015 08jan2007
                    ABCB4 BZ Equity BH 2007 -.017616 09jan2007
                    ABCB4 BZ Equity BH 2007 .0110438 10jan2007
                    ABCB4 BZ Equity BH 2007 .0100308 11jan2007
                    ABCB4 BZ Equity BH 2007 .0016244 12jan2007
                    ABCB4 BZ Equity BH 2007 -.0038603 15jan2007
                    ABCB4 BZ Equity BH 2007 -.0076374 16jan2007
                    ABCB4 BZ Equity BH 2007 .0071772 17jan2007
                    ABCB4 BZ Equity BH 2007 .0048582 18jan2007
                    ABCB4 BZ Equity BH 2007 .01307 19jan2007
                    ABCB4 BZ Equity BH 2007 .0031272 22jan2007
                    ABCB4 BZ Equity BH 2007 .0099545 23jan2007
                    ABCB4 BZ Equity BH 2007 .0106728 24jan2007
                    ABCB4 BZ Equity BH 2007 .0035398 26jan2007
                    ABCB4 BZ Equity BH 2007 -.0148391 29jan2007
                    ABCB4 BZ Equity BH 2007 .0027485 30jan2007
                    ABCB4 BZ Equity BH 2007 .0082529 31jan2007


                    It seems to me it is ok, unless I am missing something.

                    Thanks for the comment.

                    Regards,
                    Clarice

                    Comment


                    • #11
                      Roberto,

                      I think the disconnect is that Clarice's example didn't reveal that there are multiple entries in the first data set for each group in a given year. The example gives the impression that there is a 1:1 correspondence between firm and group, and that was my assumption when I first suggested merge. However, based on Clarice's explanation and upon further reflection, I realized that there would be multiple firms in a given group in any given year. There are also multiple records per year for a given group (one per day) in the second data set, thus requiring joinby.

                      All's well that end's well!

                      Regards,
                      Joe

                      Comment


                      • #12
                        Hello, Roberto!
                        Sorry, it seems I am always missing your comments.

                        I tried now your suggestion with -merge- :

                        1) with the sample code you suggested, I am having trouble with "`stocks'"

                        Code:
                        . merge m:1 group year using "`stocks'"
                        invalid file specification
                        r(198);
                        I don't know if you notice (I mentioned on original post) I am using Stata 12.1 for Mac.

                        Maybe it has to do with the version?


                        2) I tried -merge- as you suggested, but with the file name in place of "`stocks'"

                        Code:
                        . merge m:1 group year using stocks_ajustes 
                        variables group year do not uniquely identify observations in the using data
                        r(459);
                        As you see I still get the error message.

                        Am I doing something really dumb and not seeing it??

                        Thanks again!

                        These exchanges are very valuable to my learning with Stata.

                        Comment


                        • #13
                          Originally posted by Clarice Martins View Post
                          1) with the sample code you suggested, I am having trouble with "`stocks'"

                          Code:
                          . merge m:1 group year using "`stocks'"
                          invalid file specification
                          r(198);
                          This probably has to do with you running the code by parts. The file name is saved to a tempfile, which is just a local macro. Local macros have local scope so you need to run the complete code for it to work.


                          Originally posted by Clarice Martins View Post
                          2) I tried -merge- as you suggested, but with the file name in place of "`stocks'"

                          Code:
                          . merge m:1 group year using stocks_ajustes
                          variables group year do not uniquely identify observations in the using data
                          r(459);
                          This error is slightly different than the one you would get changing the merge m:1 to merge 1:1 in my previous example. You have trouble with the using data while I reported a possible error with the master data.

                          If I were to modify the using data in my previous example adding a NEW firm:

                          Code:
                          input ///
                          str6 ticker str2 group year
                          PETR4 BH 2000
                          ABEV3 BL 2000
                          USIM5 SM 2000
                          PETR4 BM 2001
                          ABEV3 BH 2001
                          USIM5 SM 2001
                          NEW BH 2000
                          NEW BL 2001
                          end
                          and then use merge m:1, then I would get the same error as you. It finally hit me with Joe's clarification in post #11. That speaks of the importance of trying our best when posting example databases. We always want them to be as representative as possible.

                          Originally posted by Clarice Martins View Post
                          These exchanges are very valuable to my learning with Stata.
                          Indeed they are, for many of us. My advice is you keep experimenting with the different versions of merge to get a better understanding.
                          You should:

                          1. Read the FAQ carefully.

                          2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

                          3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

                          4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

                          Comment


                          • #14
                            My apologies for the sample not being as representative, I am really, really a beginner and certain things I still don't have "an eye" for it. ( I really thought my sample was so good!)

                            But, thank you for the advice and... Yes, learned a few nuances that did not catch my eye before. I will indeed keep experimenting away.

                            Comment

                            Working...
                            X