Convert Variables to another Dataset

Jorge Martin

Join Date: Apr 2024

Posts: 19
#1

Convert Variables to another Dataset

29 May 2024, 20:58

Hi everyone,

I need help with transferring specific variables to another dataset. The variables I need to move are "teducation," "teductuition," "teducpreprimary," and "geducpreprimary." I tried using the merge code, but it says that the variable does not uniquely identify observations in the master data. Is there another way to extract these variables and add them to another dataset? Here is the variable table from which I need to get the variables.

Thank you!
Attached Files
Tags: None
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1400
#2

30 May 2024, 01:24

I'm afraid this information is rather incomplete. If you are attempting a merge, please tell us a bit more about the structure of the two datasets -- does for instance, an observation (i.e. a row) in this dataset map one-to-one with an observation in the other? Does one observation map to multiple observations in the other? What variables can serve as identifiers to match the observation(s) in one that match with observation(s) in the other?

It might also be helpful to provide a short data extract (a few rows of the identifier variables and maybe a couple others) of each dataset. You can use the dataex command to do this.
Comment
Jorge Martin

Join Date: Apr 2024

Posts: 19
#3

30 May 2024, 03:29

The two datasets map one-to-one with observations in each other. The variable that can be used as an identifier is sequence_no. Below is a short data extract of both datasets. The only variables I need to extract from one dataset are teductuition, teducpreprimary, teducprimary, teducsecondary, teducpostsec, and teductertiary. thank you

clear
input long(sequence_no teductuition teducpreprimary teducprimary teducsecondary teducpostsec teductertiary)
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 10000 0 0 0 0 10000
5 0 0 0 0 0 0
6 0 0 0 0 0 0
7 0 0 0 0 0 0
8 0 0 0 0 0 0
9 0 0 0 0 0 0
10 0 0 0 0 0 0
11 0 0 0 0 0 0
12 0 0 0 0 0 0
13 0 0 0 0 0 0
14 0 0 0 0 0 0
15 78000 0 0 0 0 78000
16 0 0 0 0 0 0
17 12000 0 12000 0 0 0
18 0 0 0 0 0 0
19 0 0 0 0 0 0
20 0 0 0 0 0 0
21 9980 0 0 0 0 9980
22 0 0 0 0 0 0
23 0 0 0 0 0 0
24 0 0 0 0 0 0
25 0 0 0 0 0 0
26 0 0 0 0 0 0
27 0 0 0 0 0 0
28 0 0 0 0 0 0
29 0 0 0 0 0 0
30 0 0 0 0 0 0
31 0 0 0 0 0 0
32 0 0 0 0 0 0
33 1550 0 0 1550 0 0
34 0 0 0 0 0 0
35 0 0 0 0 0 0
36 2300 1000 0 1300 0 0
37 0 0 0 0 0 0
38 560 0 0 560 0 0
39 0 0 0 0 0 0
40 0 0 0 0 0 0
41 0 0 0 0 0 0
42 0 0 0 0 0 0
43 0 0 0 0 0 0
44 0 0 0 0 0 0
45 0 0 0 0 0 0
46 0 0 0 0 0 0
47 1500 0 0 0 0 1500
48 0 0 0 0 0 0
49 0 0 0 0 0 0
50 600 600 0 0 0 0
51 0 0 0 0 0 0
52 0 0 0 0 0 0
53 1500 0 0 0 0 1500
54 0 0 0 0 0 0
55 0 0 0 0 0 0
56 0 0 0 0 0 0
57 700 0 0 700 0 0
58 0 0 0 0 0 0
59 0 0 0 0 0 0
60 0 0 0 0 0 0
61 0 0 0 0 0 0
62 0 0 0 0 0 0
63 0 0 0 0 0 0
64 1800 0 0 1800 0 0
65 0 0 0 0 0 0
66 11750 0 0 1750 0 10000
67 0 0 0 0 0 0
68 0 0 0 0 0 0
69 0 0 0 0 0 0
70 6100 0 0 6100 0 0
71 0 0 0 0 0 0
72 0 0 0 0 0 0
73 0 0 0 0 0 0
74 0 0 0 0 0 0
75 0 0 0 0 0 0
76 0 0 0 0 0 0
77 0 0 0 0 0 0
78 0 0 0 0 0 0
79 30795 0 17310 13485 0 0
80 0 0 0 0 0 0
81 0 0 0 0 0 0
82 0 0 0 0 0 0
83 0 0 0 0 0 0
84 2100 0 0 2100 0 0
85 0 0 0 0 0 0
86 0 0 0 0 0 0
87 0 0 0 0 0 0
88 0 0 0 0 0 0
89 9000 0 0 9000 0 0
90 0 0 0 0 0 0
91 0 0 0 0 0 0
92 0 0 0 0 0 0
93 0 0 0 0 0 0
94 1200 0 0 1200 0 0
95 0 0 0 0 0 0
96 0 0 0 0 0 0
97 0 0 0 0 0 0
98 0 0 0 0 0 0
99 0 0 0 0 0 0
100 45600 12600 0 7500 0 25500

here is the other dataset
clear
input long(sequence_no lc07_grade) byte lc08_cursch
1 0 .
2 10020 .
2 10020 .
2 10010 1
2 10010 1
2 10010 1
2 1000 1
3 10020 .
3 10010 .
3 24020 .
4 24020 .
4 51013 2
4 . .
4 60111 2
5 24020 .
5 60010 .
5 60010 2
6 10010 .
6 60111 .
6 1000 1
6 60613 .
7 60111 .
7 60413 .
7 . .
8 24020 .
8 10020 .
8 60111 .
8 60111 .
8 24020 .
8 24010 .
8 10010 1
9 10020 .
10 24010 .
10 24020 .
11 24010 .
11 60010 .
11 2000 1
12 10010 .
12 24010 .
12 24010 1
12 24010 1
12 10010 1
12 10010 1
13 24010 .
13 24010 .
13 10010 1
13 10010 1
14 24020 .
15 24010 .
15 24020 .
15 60010 1
15 60010 1
15 34012 1
15 60111 2
16 24020 .
16 60413 .
16 10010 1
16 10020 .
16 24020 .
16 24020 2
16 24020 .
17 24020 .
17 10020 .
17 60010 .
17 60010 .
17 10010 1
18 60010 .
18 10010 1
18 . .
18 60010 .
19 24020 .
19 60111 .
19 24020 1
19 10010 1
20 60010 .
20 24020 .
20 24010 1
21 60010 .
21 24020 .
21 60010 1
21 24020 2
22 60411 .
22 24020 .
22 24010 1
23 24020 .
24 24020 .
24 24020 .
24 40716 2
24 24020 2
24 60413 2
24 . .
25 0 .
25 60010 .
26 60413 .
26 60010 .
27 10020 .
27 24010 .
27 40716 .
28 24020 .
28 10020 1
end
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1400
#4

30 May 2024, 03:39

If I'm not mistaken, your observations actually do not map one-to-one, but multiple observations in your second dataset map to one in the first. So with your second dataset in memory, you need something like:

Code:

merge m:1 sequence_no using `"/path/to/first/dataset/firstfile.dta"', keepusing(teductuition teducpreprimary teducprimary teducsecondary teducpostsec teductertiary)

Also, thanks for the data extract. One further improvement would be to enclose it in delimiters specifying it is code (press the # sign on the edit toolbar above the text box), which would format it the way my code above is.

Last edited by Hemanshu Kumar; 30 May 2024, 03:44.
Comment
Jorge Martin

Join Date: Apr 2024

Posts: 19
#5

30 May 2024, 03:45

I already tried that, but after I did, the number of observations increased.
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1400

30 May 2024, 03:55

Yes, that is entirely possible, if you have values of sequence_no in your first dataset that do not appear in the second. For instance, when I run the merge on the data extracts you provided, the observations increase from 100 to 172. The output of the merge command gives us the clue:

Code:

    Result                      Number of obs
    -----------------------------------------
    Not matched                            72
        from master                         0  (_merge==1)
        from using                         72  (_merge==2)

    Matched                               100  (_merge==3)
    -----------------------------------------

There are 72 observations in the first ("using") dataset that have sequence_no that do not match with the second ("master") -- the numbers 29 through 100. By default, Stata keeps both the matched and unmatched observations in memory after merging. You can either drop the unmatched observations using the _merge variable that has been created:

Code:

. tab _merge

   Matching result from |
                  merge |      Freq.     Percent        Cum.
------------------------+-----------------------------------
         Using only (2) |         72       41.86       41.86
            Matched (3) |        100       58.14      100.00
------------------------+-----------------------------------
                  Total |        172      100.00

. drop if _merge == 2

or alternatively, you can tell Stata what to do in the merge command itself:

Code:

merge m:1 sequence_no using `"/path/to/first/dataset/firstfile.dta"', ///
    keepusing(teductuition teducpreprimary teducprimary teducsecondary teducpostsec teductertiary) ///
    keep(match)

Last edited by Hemanshu Kumar; 30 May 2024, 04:00.

Comment

Jorge Martin

Join Date: Apr 2024

Posts: 19
#7

31 May 2024, 02:24

I tried doing it, here is the results
Result Number of obs
-----------------------------------------
Not matched 0
Matched 675,473 (_merge==3)

the observations still increased, and it states that it is all matched, my observations should only be 147,717

Last edited by Jorge Martin; 31 May 2024, 02:40.
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1400
#8

31 May 2024, 02:40

Which data did you have in memory, and which were you trying to merge with? If the second dataset with lc07 lc08 etc is in memory (i.e. the "master") and the first one with education levels is what you are trying to merge with (the "using") dataset, the you need m:1 in your merge command. If the reverse, then you need 1:m instead. If you get the error despite this, then perhaps your data with education levels does not have sequence_no as a unique identifier for each observation. To confirm this is the case, can you run

Code:

isid sequence_no

with that dataset in memory and tell us the result?
Comment
Jorge Martin

Join Date: Apr 2024

Posts: 19
#9

31 May 2024, 02:55

it says, variable sequence_no does not uniquely identify the observations
Comment

Announcement