Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • merging panel data

    dear,

    I have 2 panel datasets which I want to merge. One dataset contains values from 2009 to 2015 and the other dataset contains values from 2016 to 2018 from variables that were already in the first dataset, in addition to new variables from 2010 to 2018. I am struggling to merge both, since I think I cannot simply append, because there are variables in my second dataset which are not in the first dataset. I was wondering of the merge 1:1 was a correct way to do this? I also had to force because turnover was string in the first, and double in the second. Does this incur problems?

    kind regards,
    Timea De Wispelaere

  • #2
    Welcome to Statalist.

    It would be helpful if you were to copy from Stata's Results window the merge command you ran and the messages the command printed, and then paste that into a reply on this topic. Without knowing what command you ran and what Stata told you, it is difficult for readers to apply their experience to tell you if you incurred problems or not.

    To present these results readably, please paste them into a code block in the Forum editor, as explained in the Statalist FAQ linked to at the top of the page. For example, if this is how a post looks when it is being edited

    [CODE]
    // sample code
    sysuse auto, clear
    describe
    [/CODE]

    then will be presented in the form as the following:
    Code:
    // sample code
    sysuse auto, clear
    describe
    which will make a lot of difference when results are supposed to line up in columns.

    Comment


    • #3
      thank you for your feedback!

      Comment


      • #4
        So the code and output is this. I used a do file to merge 3 files together.
        Code:
        merge 1:1 numeric_BVD year using "C:\Users\timea\OneDrive\Documents\master\stata\deel2goed.dta", force
        (note: variable turnover was str11 in the using data, but will be double now)
        
        Result # of obs.
        -----------------------------------------
        not matched 915
        from master 254 (_merge==1)
        from using 661 (_merge==2)
        
        matched 1,292 (_merge==3)
        -----------------------------------------
        
        . drop _merge
        
        .
        . *merge2
        . merge 1:1 numeric_BVD year using "C:\Users\timea\OneDrive\Documents\master\stata\deel3goed.dta", force
        
        Result # of obs.
        -----------------------------------------
        not matched 254
        from master 254 (_merge==1)
        from using 0 (_merge==2)
        
        matched 1,953 (_merge==3)
        -----------------------------------------
        
        . drop _merge
        Last edited by Timea De Wispelaere; 07 Mar 2020, 09:20.

        Comment


        • #5
          In post #1 you talked about two datasets being merged, in post #4 we see three datasets being merged, so I cannot be sure I understand how the description of two datasets from post #1 applies to the three in post #4, nor how the third dataset would be described. But that won't stop me from seeing what I can figure out.

          Your dataset that is in memory I will call your master dataset. Your two using datasets are deel2goed and deel3goed.

          The master dataset has 1546 observations, while deel2goed and deel3goed each have 1953 observations.

          You are merging by numeric_BVD and year.

          In your first merge, there were 1292 observations in both the master dataset and in deel2goed that had matching values of numeric_BVD and year. For those observations, any variable that was in both datasets (for example, turnover) will retain the value it had in the master dataset. There were 254 observations in the master dataset that were not matched in deel2goed. In the merged datset, these 254 observations will have missing values for any variables that are only in the deel2goed. Similarly, there were 661 observations in deel2goed that were not in the master dataset. In the merged dataset, these 661 observations will have missing values for any variables that are only in the master dataset.

          Now your master dataset has 1292+254+661 = 2207 observations and deel3goed has 1953 observations. For this second merge, ever observation in deel3goed matches an observation in the master dataset. For these observations, any variable that was in both dataset will retain the value it had in the master dataset. Again, 254 observations in the master dataset were not matched in the deel3goed dataset. In the merged dataset, these 254 observations will having missing values for any variables that are only in deel3goed.

          At the end, the master dataset has 2207 observations.

          The merge 1:1 seems likely to have given you the results you expected. As long as you didn't expect the variables in deel2goed that were already in the master dataset to change the values from the master dataset, or the values in deel3goed that were already in deel2goed or the original master dataset to change those values.

          But if for example your original master dataset had missing values that were not missing in deel2goed or deel3goed, perhaps you wanted the
          update option on your merges. Or if the deel2goed and deel3goed had new or corrected values, perhaps you wanted the update replace options on your merges.

          The thing I am worried about is the variable turnover that was string in deel2goed. Letting merge force the string variable to become numeric risks overlooking real problems in your data. Look at the following example, where x_str is a string variable that I convert to a numeric variable using the real() function.
          Code:
          . generate float x_num = real(x_str)
          (2 missing values generated)
          
          . 
          . list, clean noobs
          
              x_str   x_num  
               n.a.       .  
                  1       1  
                 12      12  
                123     123  
               1234    1234  
              1,234       .
          Certainly the "n.a." (usually meanig "not available" or "not applicable") does not have a numeric value, so assigning it a missing value is appropriate. But "1,234" is not missing - it just has a comma that Stata does not expect. Having it become missing is a real problem.

          You should go back to your deel2goed dataset and find and ifx any values of turnover that should be numeric but cannot be converted successfully to numeric values.
          Code:
          tab turnover if missing(real(turnover))
          will show you the values that cannot be successfully converted. Then you should convert turnover to numeric and save a new copy of the dataset. And then you should remove the force option from both merges. If Stata tells you something is not as it should be, the better response is to correct the problem, not force Stata to deal with it blindly.


          Comment


          • #6
            thank you very much! this is a very useful response!

            Comment


            • #7
              The only thing I do not understand is how I should fix the values of turnover that should be numeric but cannot be converted successfully to numeric values. Should I replace a comma with a dot so the 1,234 becomes 1.234?

              Comment


              • #8
                How you fix the values of the turnover variable depends on what they are. You will likely want to use a combination of replace - to correct codes like "n.a." - and then destring - to convert the strings to numeric values ignoring characters that should be ignored. Here's an extension to my earlier example.
                Code:
                . list, clean noobs
                
                    x_str  
                     n.a.  
                        1  
                       12  
                      123  
                     1234  
                    1,234  
                
                . replace x_str = "" if x_str=="n.a."
                (1 real change made)
                
                . destring x_str, generate(x_num)
                x_str: contains nonnumeric characters; no generate
                
                . destring x_str, generate(x_num) ignore(",")
                x_str: character , removed; x_num generated as int
                (1 missing value generated)
                
                . list, clean noobs
                
                    x_str   x_num  
                                .  
                        1       1  
                       12      12  
                      123     123  
                     1234    1234  
                    1,234    1234
                You will want to read the output of help destring to learn how to use it effectively.

                In particular, you suggested changing "1,234" to 1.234" as a possible solution. In the notation I'm familiar with in the US, the first string represents the whole number "one thousand two hundred thirty four" while the second string represents the fraction "one and two hundred thirty four thousandths". I understand that in other countries, the comma is used where we use the decimal point, so "1,234" would represent the fraction "one and two hundred thirty four thousandths". If that is the correct interpretation for your data, the dpcomma option on destring causes it to take that interpretation of the comma.

                Comment

                Working...
                X