Merging two data sets based on ranges and assign a number

Anoush Khachatryan

Join Date: Sep 2021

Posts: 56
#1

Merging two data sets based on ranges and assign a number

24 Feb 2022, 19:40

I would like to merge two data sets where the first dataset contains probabilities. The second data set contains a column indicating the number of weeks and two more columns showing a minimum and maximum range. The probability in dataset 1 should be assigned a "weeks" value from dataset 2 based on the range in which it falls. There can be more than one "weeks" value assigned to each row. For example, in row 3, there would be a "weeks" value assigned to each 0.72, 0.79, 0.84, 0.43, and 0.33. Additionally, "weeks" values are only assigned to numbers following a "." or numbers preceding "." when there are no values following ".". For instance, in row 1, a "weeks" value would only be assigned to 0.12 and in row 4 a "weeks" value would only be assigned to 0.71.

I would appreciate any assistance with this.

Thanks,
A

Dataset 1
V1 V2 V3 V4 V5 V6

0.61 0.29 . . . 0.12

0.52 0.63 . . . 0.11

0.72 0.79 0.84 0.43 0.33 .

0.57 0.37 0.22 . 0.71 .

Dataset 2
Weeks Min Max

1 0 0.3

2 0.3 0.6

3 0.6 1
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

24 Feb 2022, 20:46

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(v1 v2 v3 v4 v5 v6) .61 .29 . . . .12 .52 .63 . . . .11 .72 .79 .84 .43 .33 . .57 .37 .22 . .71 . end tempfile dataset1 save `dataset1' * Example generated by -dataex-. For more info, type help dataex clear input byte weeks float(min max) 1 0 .3 2 .3 .6 3 .6 1 end tempfile dataset2 save `dataset2' use `dataset1', clear gen long obs_no = _n reshape long v, i(obs_no) j(order) tempfile to_pair save `to_pair' use `dataset2', clear rangejoin v min max using `to_pair' by obs_no order (min), sort: keep if _n == 1 merge 1:1 obs_no order using `to_pair', nogenerate by obs_no (order), sort: gen n_miss = sum(missing(v[_n-1]))-1 by obs_no (order): replace weeks = . if n_miss == 0 & n_miss[_N] > 0 drop min max n_miss reshape wide v weeks, i(obs_no) j(order)

-rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Notes:

1. The ranges for the different values of weeks in dataset 2 overlap at the endpoints. As it happens, none of the values of v in the example dataset 1 fall exactly on any endpoint. But if that were to happen in the real data, the code above will favor the lower number of weeks.

2. Dataset 1 is in wide layout. I do not know how you will be using this data, but most things in Stata are more easily done, or only possible, with long layout of series of variables like this. Indeed, you can see that the first thing the code above does is reshape the data to long layout. The final command in the code shown restores the original wide layout to the data. But unless you have a clear and convincing reason to do that, I recommend you skip the final -reshape wide- command and leave the data in long layout. It is likely that whatever you want to do next with the data will also be best done with the data long, so you might as well leave it that way.

In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
1 like
Comment
Anoush Khachatryan

Join Date: Sep 2021

Posts: 56
#3

24 Feb 2022, 21:19

Thank you, Clyde!
Comment
Anoush Khachatryan

Join Date: Sep 2021

Posts: 56
#4

25 Feb 2022, 20:08

Clyde Schechter I have another question regarding this example. Is there a way to complete the above code by group. For instance, dataset 2 has a column named "group" and each group contains three weeks. Therefore, I could combine the two datasets with -rangejoin- per individual group (each group would have its own combined dataset based on datasets 1 and 2).

Thanks,
A
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#5

25 Feb 2022, 22:19

I don't understand what you're asking. I can visualize a dataset2 with groups each having three weeks and corresponding min and max values. But I don't get how you want this to relate to dataset 1. Showing an example might be more helpful than explaining in words.
Comment
Anoush Khachatryan

Join Date: Sep 2021

Posts: 56
#6

26 Feb 2022, 12:48

Clyde Schechter Thanks for your response. I hope to clarify a bit more. In dataset 2, I inserted a group_id variable. There are three weeks per group_id. Following the same code you provided in an earlier reply, I would like to assign a "weeks" value to each "v" observation in dataset 1 based on the range in which it falls between "min" and "max" in dataset 2. I would like to do this for each group. In the end, I should have three different merged tables (one for each group). I would appreciate any assistance with this. Anoush

[clear]
[input float(v1 v2 v3 v4 v5 v6)]
[.61 .29 . . . .12 ]
[.52 .63 . . . .11 ]
[.72 .79 .84 .43 .33 .]
[ .57 .37 .22 . .71 . ]
[end]
[tempfile dataset1]
[save `dataset1']

[clear ]
[input byte weeks group_id float(min max) ]
[1 1 0 .1 ]
[2 1 .2 .3]
[ 3 1 .3 .4]
[ 1 2 .4 .5]
[ 2 2 .5 .6]
[ 3 2 .6 .7]
[ 1 3 .7 .8 ]
[2 3 .8 .9[
[ 3 3 .9 1]
[ end ]
[tempfile dataset2]
[ save `dataset2']

Last edited by Anoush Khachatryan; 26 Feb 2022, 12:51.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29796

26 Feb 2022, 13:28

For some reason, each line of your -dataex- output is wrapped in brackets, sometimes mismatched brackets. Please always post -dataex- output exactly the way it comes out from -dataex-. Do not editin any way. It wastes people's time to have to clean up mangled -dataex- output to make it usable.

Code:

clear
input float(v1 v2 v3 v4 v5 v6)
.61 .29 . . . .12
.52 .63 . . . .11
.72 .79 .84 .43 .33 .
 .57 .37 .22 . .71 .
end
tempfile dataset1
save `dataset1'

clear
input byte weeks group_id float(min max)
1 1 0 .1
2 1 .2 .3
 3 1 .3 .4
 1 2 .4 .5
 2 2 .5 .6
 3 2 .6 .7
 1 3 .7 .8
2 3 .8 .9
 3 3 .9 1
 end
tempfile dataset2
 save `dataset2'
 
forvalues i = 1/3{
    use `dataset1', clear
    gen long obs_no = _n
    reshape long v, i(obs_no) j(order)
    tempfile to_pair
    save `to_pair'

    use `dataset2' if group_id == `i', clear
    list, noobs clean
    rangejoin v min max using `to_pair'
    drop if missing(obs_no)
    by obs_no order (min), sort: keep if _n == 1
    merge 1:1 obs_no order using `to_pair', nogenerate
    replace group_id = `i'

    by obs_no (order), sort: gen n_miss = sum(missing(v[_n-1]))-1
    by obs_no (order): replace weeks = . if n_miss == 0 & n_miss[_N] > 0

    drop min max n_miss

    reshape wide v weeks, i(obs_no) j(order)
    tempfile results`i'
    save `results`i''
}

clear
append using `results1' `results2' `results3'

I do not know what you mean by "three different tables." The above code does the matching separately for each group_id and stores the results in separate tempfiles, `results1', `results2', and `results3'. You can then do whatever you like with those tempfiles. The final two lines of the code appends them all together in a single data set in memory--but that may not be what you wanted.

Comment

Anoush Khachatryan

Join Date: Sep 2021

Posts: 56
#8

26 Feb 2022, 13:37

Clyde, thank you so much! This is exactly what I was looking for!

Many thanks,
A
Comment

V1	V2	V3	V4	V5	V6
0.61	0.29	.	.	.	0.12
0.52	0.63	.	.	.	0.11
0.72	0.79	0.84	0.43	0.33	.
0.57	0.37	0.22	.	0.71	.

Weeks	Min	Max
1	0	0.3
2	0.3	0.6
3	0.6	1

Announcement