Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two data sets based on ranges and assign a number

    I would like to merge two data sets where the first dataset contains probabilities. The second data set contains a column indicating the number of weeks and two more columns showing a minimum and maximum range. The probability in dataset 1 should be assigned a "weeks" value from dataset 2 based on the range in which it falls. There can be more than one "weeks" value assigned to each row. For example, in row 3, there would be a "weeks" value assigned to each 0.72, 0.79, 0.84, 0.43, and 0.33. Additionally, "weeks" values are only assigned to numbers following a "." or numbers preceding "." when there are no values following ".". For instance, in row 1, a "weeks" value would only be assigned to 0.12 and in row 4 a "weeks" value would only be assigned to 0.71.

    I would appreciate any assistance with this.

    Thanks,
    A

    Dataset 1
    V1 V2 V3 V4 V5 V6
    0.61 0.29 . . . 0.12
    0.52 0.63 . . . 0.11
    0.72 0.79 0.84 0.43 0.33 .
    0.57 0.37 0.22 . 0.71 .
    Dataset 2
    Weeks Min Max
    1 0 0.3
    2 0.3 0.6
    3 0.6 1

  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(v1 v2 v3 v4 v5 v6)
    .61 .29   .   .   . .12
    .52 .63   .   .   . .11
    .72 .79 .84 .43 .33   .
    .57 .37 .22   . .71   .
    end
    tempfile dataset1
    save `dataset1'
    
    
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte weeks float(min max)
    1  0 .3
    2 .3 .6
    3 .6  1
    end
    tempfile dataset2
    save `dataset2'
    
    use `dataset1', clear
    gen long obs_no = _n
    reshape long v, i(obs_no) j(order)
    tempfile to_pair
    save `to_pair'
    
    use `dataset2', clear
    rangejoin v min max using `to_pair'
    by obs_no order (min), sort: keep if _n == 1
    merge 1:1 obs_no order using `to_pair', nogenerate
    
    by obs_no (order), sort: gen n_miss = sum(missing(v[_n-1]))-1
    by obs_no (order): replace weeks = . if n_miss == 0 & n_miss[_N] > 0
    
    drop min max n_miss
    
    reshape wide v weeks, i(obs_no) j(order)
    -rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

    Notes:

    1. The ranges for the different values of weeks in dataset 2 overlap at the endpoints. As it happens, none of the values of v in the example dataset 1 fall exactly on any endpoint. But if that were to happen in the real data, the code above will favor the lower number of weeks.

    2. Dataset 1 is in wide layout. I do not know how you will be using this data, but most things in Stata are more easily done, or only possible, with long layout of series of variables like this. Indeed, you can see that the first thing the code above does is reshape the data to long layout. The final command in the code shown restores the original wide layout to the data. But unless you have a clear and convincing reason to do that, I recommend you skip the final -reshape wide- command and leave the data in long layout. It is likely that whatever you want to do next with the data will also be best done with the data long, so you might as well leave it that way.

    In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    Comment


    • #3
      Thank you, Clyde!

      Comment


      • #4
        Clyde Schechter I have another question regarding this example. Is there a way to complete the above code by group. For instance, dataset 2 has a column named "group" and each group contains three weeks. Therefore, I could combine the two datasets with -rangejoin- per individual group (each group would have its own combined dataset based on datasets 1 and 2).

        Thanks,
        A

        Comment


        • #5
          I don't understand what you're asking. I can visualize a dataset2 with groups each having three weeks and corresponding min and max values. But I don't get how you want this to relate to dataset 1. Showing an example might be more helpful than explaining in words.

          Comment


          • #6

            Clyde Schechter Thanks for your response. I hope to clarify a bit more. In dataset 2, I inserted a group_id variable. There are three weeks per group_id. Following the same code you provided in an earlier reply, I would like to assign a "weeks" value to each "v" observation in dataset 1 based on the range in which it falls between "min" and "max" in dataset 2. I would like to do this for each group. In the end, I should have three different merged tables (one for each group). I would appreciate any assistance with this. Anoush

            [
            clear]
            [input float(v1 v2 v3 v4 v5 v6)]
            [.61 .29 . . . .12 ]
            [.52 .63 . . . .11 ]
            [.72 .79 .84 .43 .33 .]
            [ .57 .37 .22 . .71 . ]
            [end]
            [tempfile dataset1]
            [save `dataset1']

            [clear ]
            [input byte weeks group_id float(min max) ]
            [1 1 0 .1 ]
            [2 1 .2 .3]
            [ 3 1 .3 .4]
            [ 1 2 .4 .5]
            [ 2 2 .5 .6]
            [ 3 2 .6 .7]
            [ 1 3 .7 .8 ]
            [2 3 .8 .9[
            [ 3 3 .9 1]
            [ end ]
            [tempfile dataset2]
            [ save `dataset2']
            Last edited by Anoush Khachatryan; 26 Feb 2022, 12:51.

            Comment


            • #7
              For some reason, each line of your -dataex- output is wrapped in brackets, sometimes mismatched brackets. Please always post -dataex- output exactly the way it comes out from -dataex-. Do not editin any way. It wastes people's time to have to clean up mangled -dataex- output to make it usable.

              Code:
              clear
              input float(v1 v2 v3 v4 v5 v6)
              .61 .29 . . . .12
              .52 .63 . . . .11
              .72 .79 .84 .43 .33 .
               .57 .37 .22 . .71 .
              end
              tempfile dataset1
              save `dataset1'
              
              clear
              input byte weeks group_id float(min max)
              1 1 0 .1
              2 1 .2 .3
               3 1 .3 .4
               1 2 .4 .5
               2 2 .5 .6
               3 2 .6 .7
               1 3 .7 .8
              2 3 .8 .9
               3 3 .9 1
               end
              tempfile dataset2
               save `dataset2'
               
              forvalues i = 1/3{
                  use `dataset1', clear
                  gen long obs_no = _n
                  reshape long v, i(obs_no) j(order)
                  tempfile to_pair
                  save `to_pair'
              
                  use `dataset2' if group_id == `i', clear
                  list, noobs clean
                  rangejoin v min max using `to_pair'
                  drop if missing(obs_no)
                  by obs_no order (min), sort: keep if _n == 1
                  merge 1:1 obs_no order using `to_pair', nogenerate
                  replace group_id = `i'
              
                  by obs_no (order), sort: gen n_miss = sum(missing(v[_n-1]))-1
                  by obs_no (order): replace weeks = . if n_miss == 0 & n_miss[_N] > 0
              
                  drop min max n_miss
              
                  reshape wide v weeks, i(obs_no) j(order)
                  tempfile results`i'
                  save `results`i''
              }
              
              clear
              append using `results1' `results2' `results3'
              I do not know what you mean by "three different tables." The above code does the matching separately for each group_id and stores the results in separate tempfiles, `results1', `results2', and `results3'. You can then do whatever you like with those tempfiles. The final two lines of the code appends them all together in a single data set in memory--but that may not be what you wanted.

              Comment


              • #8
                Clyde, thank you so much! This is exactly what I was looking for!

                Many thanks,
                A

                Comment

                Working...
                X