Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching with devation

    Dear Statalisters,


    I'm a little confused with a pairing in my search. In particular, I want to match numeric values with a deviation from the exact matching value. I'm trying to use the rangejoin command but it doesn't work. Do you know if there is any other way?
    Any help or suggestion would be greatly appreciated.

    Thank you in advance.
    Angeliki

    (Stata 16.0)
    Last edited by Angeliki Skoura; 04 Mar 2022, 11:17.

  • #2
    There are countless ways code might "not work". Your question really isn't clear without more detail, or at a minimum it is too difficult to guess at a good answer from what little you have shared. Please help us help you. Show example data. Show your code. Show us what Stata told you. Tell us what precisely is wrong. The Statalist FAQ provides advice on effectively posing your questions, posting data, and sharing Stata output.

    Added in edit: after reading your post and writing the above, I found your most recent previous topic at

    https://www.statalist.org/forums/for...eric-variables

    which seems to describe this problem. Did you give up, and are now asking for another approach? Because there's also very little information in that topic, so it's not clear that rangejoin cannot be made to work.

    The more you help others understand your problem, the more likely others are to be able to help you solve your problem or find an alternative approach that doesn't lead to the same failure.
    Last edited by William Lisowski; 04 Mar 2022, 13:16.

    Comment


    • #3
      Of course, @William Lisowski you are absolutely right and I regret that I did not position myself correctly.

      As for my problem, the rangejoin command keeps giving this error "op. sys. refuses to provide memory" no matter how much I shrink my sample. I tried to find the error, but to no avail. More specifically, I would like to match the sales values with the EBT values but if the exact value does not exist, the command receives some devation as to the matching. E.g. -1000, +1000. I am trying to describe it in a better way.

      My code is:
      rangejoin sales -1000 1000 using "C:\Users\skour\OneDrive\Υπολογιστής\skoura research\Diff Databases\dataset 1.dta"


      Thank you very much in advance,
      Angeliki
      Last edited by Angeliki Skoura; 05 Mar 2022, 03:15.

      Comment


      • #4
        Please run the command
        Code:
        summarize sales, detail
        on the dataset that is in memory and on the using dataset, and present the results of both commands here,

        Comment


        • #5
          Of course, but I'm using a dataset in memory and the two variables I want to match are in the same database. Please let me know if I need to clarify any further.
          The results are:
          summarize sales, detail
          Click image for larger version

Name:	Screenshot_1.png
Views:	1
Size:	15.1 KB
ID:	1653364



          Best wishes,
          Angeliki
          Last edited by Angeliki Skoura; 07 Mar 2022, 11:47.

          Comment


          • #6
            Apparently you have two variables - sales and EBT. But for the rangejoin command you showed in post #3, the variable "sales" in your dataset in memory will have to match to a variable "sales" in "...\skoura research\Diff Databases\dataset 1.dta". I expected that you had created "dataset 1" from your original dataset by renaming EBT to sales.

            And since in the past you have been comparing "treated" to "control" observations, I expected that your data in memory contained only "treated" observations of "sales", while your "dataset 1" contained only "control" observations of a variable named "sales" but apparently originally named EBT.

            That I why I asked for a separate summary of "sales" for each of the datasets that you are matching using the rangejoin command you showed in post #3.

            Comment


            • #7
              i find the following confusing: "the two variables I want to match are in the same database" - please start with a -dataex- example (described in the FAQ or via "help dataex") and a discussion of what you mean by this

              Comment


              • #8
                First of all , thanks for your feedbacks. I appreciate your help.

                After all, I have created two datasets and each one has different sales and EBT values, if i apply the command "rangejoin sales -1000 1000 using "C:\Users\skour\OneDrive\Υπολογιστής\skoura research\Diff Databases\dataset 2.dta" is it going to match the sales of one dataset and the other with the devation i set ?

                Comment


                • #9
                  For each of the two datasets you want to match using rangejoin, run the following commands and post the results here.
                  Code:
                  summarize sales, detail
                  count if sales==0
                  count if inrange(sales,-1000,1000)

                  Comment


                  • #10
                    Ok, below are the results:

                    Dataset 1:
                    Click image for larger version

Name:	dataset1.png
Views:	1
Size:	18.9 KB
ID:	1653604


                    Dataset 2:
                    Click image for larger version

Name:	dataset2.png
Views:	1
Size:	19.1 KB
ID:	1653605


                    Thank you for advices!

                    Comment


                    • #11
                      OK, above you show us statistics about dataset 1 and dataset 2.

                      If you run something like the following, based on your code in post #8
                      Code:
                      use "C:\Users\skour\OneDrive\Υπολογιστής\skoura research\Diff Databases\dataset 1.dta", clear
                      rangejoin sales -1000 1000 using "C:\Users\skour\OneDrive\Υπολογιστής\skoura research\Diff Databases\dataset 2.dta"
                      then for each of the 233,646 observations in dataset 1, it will create one new observation for each observation in dataset 2 for which sales matches within +/-1000 of the of the value of sales on the observation from dataset 1.

                      But what are the implications of this? You have 42,622 observations in dataset 1 for which sales is 0, and 38 observations in dataset 2 for which sales is between -1000 and 1000. So each of thse 42,622 observations in dataset 1 will match 38 observations in dataset 2, and the result will be 1,619,636 observations in the output dataset.

                      We can think about this in the opposite order. If you run
                      Code:
                      use "C:\Users\skour\OneDrive\Υπολογιστής\skoura research\Diff Databases\dataset 2.dta", clear
                      rangejoin sales -1000 1000 using "C:\Users\skour\OneDrive\Υπολογιστής\skoura research\Diff Databases\dataset 1.dta"
                      then each of the 29 observations in dataset 2 for which sales is 0 will match to 61,493 observations in dataset 1 for which sales is between -1000 and 1000. So the result will be 1,791,997 observations in the output dataset just from the 29 observations in dataset 2 for which sales is 0.

                      The difference between these two approaches is that rangejoin keeps every observation from the dataset in memory - giving missing values if there is no match - so if your dataset 2 is treatment and your dataset 1 is controls, you probably want the second approach, because you want to keep all your treatment data, not keep controls for which there is no match.

                      But also, let us note that your two largest values in dataset 2 are approximately 2,650,000,000 and 2,430,000,000, a difference of 220,000,000. And for dataset 1 the difference between the two largest values is 70,000,000. It seems unlikely that the largest observations in each dataset are going to find matching values within +/-1000.

                      So what are the implications for your work?

                      First, rangejoin can in theory produced the matched dataset you specify, but as a consequence of your data, there will be vastly more observations in the result than in two two input datasets. Even if you required exact matches, 29 observations of 0 in your dataset 2 will match to 42,622 observations of 0 in your dataset 1 giving 1,236,038 observations in the output dataset. You haven't told us what you are trying to accomplish with this matching, but it is hard to see what use you could make of these matches.

                      So to address your question in post #1

                      I want to match numeric values with a deviation from the exact matching value. I'm trying to use the rangejoin command but it doesn't work. Do you know if there is any other way?
                      the answer is that it's not that rangejoin doesn't work, it's that the dataset you seek to produce is too large to produce on your computer, so any other approach to producing that dataset is likely doomed to fail. This is a consequence producing many many matches - not just a single "best" match - for each observation in your dataset in memory.

                      And beyond that, were you able to produce the dataset that rangejoin would create, it is not clear what use it could be put to effectively, when tens of thousands of matches are possible.

                      Comment


                      • #12
                        First of all your help is valuable and I thank you for it.


                        So there is no other way that produces only one "best" match with some devation -/+10000 for each observation in my dataset?
                        Or do you suggest I cut the dataset?

                        Thank you in advance.

                        Comment


                        • #13
                          Your attempt to use rangejoin as an alternative to the matching techniques you have discussed throughout your earlier topics and summarized at

                          https://www.statalist.org/forums/for...l-observations

                          was apparently based on a misunderstanding of rangejoin.

                          The bottom line is, when you are matching only on the value of one variable, and there are 60,000 observations in the control group with a value of zero, which single one of them is "best"?

                          The techniques you discuss in the linked topic are designed to narrow the choice by considering the values of other variables to determine the matches. Those techniques are where you should be looking for matching controls to your treated cases. That is not what rangejoin was intended to do.

                          Comment


                          • #14
                            Ok thanks a lot for your guidance and help I will focus on those techniques! I really appreciate it.

                            Best wishes,
                            Angeliki

                            Comment

                            Working...
                            X