Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Convert Variables to another Dataset

    Hi everyone,

    I need help with transferring specific variables to another dataset. The variables I need to move are "teducation," "teductuition," "teducpreprimary," and "geducpreprimary." I tried using the merge code, but it says that the variable does not uniquely identify observations in the master data. Is there another way to extract these variables and add them to another dataset? Here is the variable table from which I need to get the variables.

    Thank you!
    Attached Files

  • #2
    I'm afraid this information is rather incomplete. If you are attempting a merge, please tell us a bit more about the structure of the two datasets -- does for instance, an observation (i.e. a row) in this dataset map one-to-one with an observation in the other? Does one observation map to multiple observations in the other? What variables can serve as identifiers to match the observation(s) in one that match with observation(s) in the other?

    It might also be helpful to provide a short data extract (a few rows of the identifier variables and maybe a couple others) of each dataset. You can use the dataex command to do this.

    Comment


    • #3
      The two datasets map one-to-one with observations in each other. The variable that can be used as an identifier is sequence_no. Below is a short data extract of both datasets. The only variables I need to extract from one dataset are teductuition, teducpreprimary, teducprimary, teducsecondary, teducpostsec, and teductertiary. thank you

      clear
      input long(sequence_no teductuition teducpreprimary teducprimary teducsecondary teducpostsec teductertiary)
      1 0 0 0 0 0 0
      2 0 0 0 0 0 0
      3 0 0 0 0 0 0
      4 10000 0 0 0 0 10000
      5 0 0 0 0 0 0
      6 0 0 0 0 0 0
      7 0 0 0 0 0 0
      8 0 0 0 0 0 0
      9 0 0 0 0 0 0
      10 0 0 0 0 0 0
      11 0 0 0 0 0 0
      12 0 0 0 0 0 0
      13 0 0 0 0 0 0
      14 0 0 0 0 0 0
      15 78000 0 0 0 0 78000
      16 0 0 0 0 0 0
      17 12000 0 12000 0 0 0
      18 0 0 0 0 0 0
      19 0 0 0 0 0 0
      20 0 0 0 0 0 0
      21 9980 0 0 0 0 9980
      22 0 0 0 0 0 0
      23 0 0 0 0 0 0
      24 0 0 0 0 0 0
      25 0 0 0 0 0 0
      26 0 0 0 0 0 0
      27 0 0 0 0 0 0
      28 0 0 0 0 0 0
      29 0 0 0 0 0 0
      30 0 0 0 0 0 0
      31 0 0 0 0 0 0
      32 0 0 0 0 0 0
      33 1550 0 0 1550 0 0
      34 0 0 0 0 0 0
      35 0 0 0 0 0 0
      36 2300 1000 0 1300 0 0
      37 0 0 0 0 0 0
      38 560 0 0 560 0 0
      39 0 0 0 0 0 0
      40 0 0 0 0 0 0
      41 0 0 0 0 0 0
      42 0 0 0 0 0 0
      43 0 0 0 0 0 0
      44 0 0 0 0 0 0
      45 0 0 0 0 0 0
      46 0 0 0 0 0 0
      47 1500 0 0 0 0 1500
      48 0 0 0 0 0 0
      49 0 0 0 0 0 0
      50 600 600 0 0 0 0
      51 0 0 0 0 0 0
      52 0 0 0 0 0 0
      53 1500 0 0 0 0 1500
      54 0 0 0 0 0 0
      55 0 0 0 0 0 0
      56 0 0 0 0 0 0
      57 700 0 0 700 0 0
      58 0 0 0 0 0 0
      59 0 0 0 0 0 0
      60 0 0 0 0 0 0
      61 0 0 0 0 0 0
      62 0 0 0 0 0 0
      63 0 0 0 0 0 0
      64 1800 0 0 1800 0 0
      65 0 0 0 0 0 0
      66 11750 0 0 1750 0 10000
      67 0 0 0 0 0 0
      68 0 0 0 0 0 0
      69 0 0 0 0 0 0
      70 6100 0 0 6100 0 0
      71 0 0 0 0 0 0
      72 0 0 0 0 0 0
      73 0 0 0 0 0 0
      74 0 0 0 0 0 0
      75 0 0 0 0 0 0
      76 0 0 0 0 0 0
      77 0 0 0 0 0 0
      78 0 0 0 0 0 0
      79 30795 0 17310 13485 0 0
      80 0 0 0 0 0 0
      81 0 0 0 0 0 0
      82 0 0 0 0 0 0
      83 0 0 0 0 0 0
      84 2100 0 0 2100 0 0
      85 0 0 0 0 0 0
      86 0 0 0 0 0 0
      87 0 0 0 0 0 0
      88 0 0 0 0 0 0
      89 9000 0 0 9000 0 0
      90 0 0 0 0 0 0
      91 0 0 0 0 0 0
      92 0 0 0 0 0 0
      93 0 0 0 0 0 0
      94 1200 0 0 1200 0 0
      95 0 0 0 0 0 0
      96 0 0 0 0 0 0
      97 0 0 0 0 0 0
      98 0 0 0 0 0 0
      99 0 0 0 0 0 0
      100 45600 12600 0 7500 0 25500

      here is the other dataset
      clear
      input long(sequence_no lc07_grade) byte lc08_cursch
      1 0 .
      2 10020 .
      2 10020 .
      2 10010 1
      2 10010 1
      2 10010 1
      2 1000 1
      3 10020 .
      3 10010 .
      3 24020 .
      4 24020 .
      4 51013 2
      4 . .
      4 60111 2
      5 24020 .
      5 60010 .
      5 60010 2
      6 10010 .
      6 60111 .
      6 1000 1
      6 60613 .
      7 60111 .
      7 60413 .
      7 . .
      8 24020 .
      8 10020 .
      8 60111 .
      8 60111 .
      8 24020 .
      8 24010 .
      8 10010 1
      9 10020 .
      10 24010 .
      10 24020 .
      11 24010 .
      11 60010 .
      11 2000 1
      12 10010 .
      12 24010 .
      12 24010 1
      12 24010 1
      12 10010 1
      12 10010 1
      13 24010 .
      13 24010 .
      13 10010 1
      13 10010 1
      14 24020 .
      15 24010 .
      15 24020 .
      15 60010 1
      15 60010 1
      15 34012 1
      15 60111 2
      16 24020 .
      16 60413 .
      16 10010 1
      16 10020 .
      16 24020 .
      16 24020 2
      16 24020 .
      17 24020 .
      17 10020 .
      17 60010 .
      17 60010 .
      17 10010 1
      18 60010 .
      18 10010 1
      18 . .
      18 60010 .
      19 24020 .
      19 60111 .
      19 24020 1
      19 10010 1
      20 60010 .
      20 24020 .
      20 24010 1
      21 60010 .
      21 24020 .
      21 60010 1
      21 24020 2
      22 60411 .
      22 24020 .
      22 24010 1
      23 24020 .
      24 24020 .
      24 24020 .
      24 40716 2
      24 24020 2
      24 60413 2
      24 . .
      25 0 .
      25 60010 .
      26 60413 .
      26 60010 .
      27 10020 .
      27 24010 .
      27 40716 .
      28 24020 .
      28 10020 1
      end

      Comment


      • #4
        If I'm not mistaken, your observations actually do not map one-to-one, but multiple observations in your second dataset map to one in the first. So with your second dataset in memory, you need something like:

        Code:
        merge m:1 sequence_no using `"/path/to/first/dataset/firstfile.dta"', keepusing(teductuition teducpreprimary teducprimary teducsecondary teducpostsec teductertiary)
        Also, thanks for the data extract. One further improvement would be to enclose it in delimiters specifying it is code (press the # sign on the edit toolbar above the text box), which would format it the way my code above is.
        Last edited by Hemanshu Kumar; 30 May 2024, 04:44.

        Comment


        • #5

          I already tried that, but after I did, the number of observations increased.

          Comment


          • #6
            Yes, that is entirely possible, if you have values of sequence_no in your first dataset that do not appear in the second. For instance, when I run the merge on the data extracts you provided, the observations increase from 100 to 172. The output of the merge command gives us the clue:

            Code:
                Result                      Number of obs
                -----------------------------------------
                Not matched                            72
                    from master                         0  (_merge==1)
                    from using                         72  (_merge==2)
            
                Matched                               100  (_merge==3)
                -----------------------------------------
            There are 72 observations in the first ("using") dataset that have sequence_no that do not match with the second ("master") -- the numbers 29 through 100. By default, Stata keeps both the matched and unmatched observations in memory after merging. You can either drop the unmatched observations using the _merge variable that has been created:

            Code:
            . tab _merge
            
               Matching result from |
                              merge |      Freq.     Percent        Cum.
            ------------------------+-----------------------------------
                     Using only (2) |         72       41.86       41.86
                        Matched (3) |        100       58.14      100.00
            ------------------------+-----------------------------------
                              Total |        172      100.00
            
            . drop if _merge == 2
            or alternatively, you can tell Stata what to do in the merge command itself:

            Code:
            merge m:1 sequence_no using `"/path/to/first/dataset/firstfile.dta"', ///
                keepusing(teductuition teducpreprimary teducprimary teducsecondary teducpostsec teductertiary) ///
                keep(match)
            Last edited by Hemanshu Kumar; 30 May 2024, 05:00.

            Comment


            • #7
              I tried doing it, here is the results
              Result Number of obs
              -----------------------------------------
              Not matched 0
              Matched 675,473 (_merge==3)

              the observations still increased, and it states that it is all matched, my observations should only be 147,717
              Last edited by Jorge Martin; 31 May 2024, 03:40.

              Comment


              • #8
                Which data did you have in memory, and which were you trying to merge with? If the second dataset with lc07 lc08 etc is in memory (i.e. the "master") and the first one with education levels is what you are trying to merge with (the "using") dataset, the you need m:1 in your merge command. If the reverse, then you need 1:m instead. If you get the error despite this, then perhaps your data with education levels does not have sequence_no as a unique identifier for each observation. To confirm this is the case, can you run
                Code:
                isid sequence_no
                with that dataset in memory and tell us the result?

                Comment


                • #9
                  it says, variable sequence_no does not uniquely identify the observations

                  Comment

                  Working...
                  X