Hi!
I am looking to get some assistance with dealing with data in stata as I seem to be a bit stuck.
1) Creating final dataset: I have two datasets data1 with some internal obs and data2 with external obs. I would like to merge the two on the common variable D_id. But, in both sets there a multiple observations for each unique Id (D_id) as they refer to a person and the grants received. I have already tried sorting, looking for duplicates, isid before merging but a constantly get and error (r(459): variable D_id does not uniquely identify the observations). Further, I have tried joinby but seems as if there are then created more variables and then I do not know what to drop and what to keep. How can I "glue" these datasets together without merge m:m (which I can understand is not a really good tool) as in reality there a more obs in both master and using dataset.
--> The end goal of the merge/joinby is to have a complete dataset where I can look up how many grants from dataset1 (id like g.1,g.2,g.3, etc.) there is for each of the internal applicants and get full info on application date, demographics, etc. (but they also might appear several times meaning they have done several projects with the company).
2) Creating new variables: From the final data set I will then need to create new variables for total grants (count of unique id per Mainid), amount of grants prior/post (unique ids in dataset1) gotten before and after the project data (dataset2) in accordance to the projects data/ status of Yes/No or maybe score, so a count of some kind that consider the Appid for each Mainid or D_id (those are corresponding) on the date (start, end, year).
This is to then analyze if a person have better changes of getting grants after engaging in projects (that will then be the treatment) --> because these variables and observations should ideally help me perform a discontinuity regression with those who didn't do project vs. those who did (that is based on the score maybe like 1-3 Yes/3-6 No). How can i make new variables that are needed to perform this discontinuity regression and analyzing the effect of a treatment and what other descriptive statistics or causality are important to consider in such an analysis?
3) Forecasting for new project applicants: further, as there is a time limit included in this analysis and with some observations are from 2022 and 2023 it can be hard to measure an outcome already, would it be possible to create a forecast model for upcoming applicants and how would one go about it with the data available?
4) Deleting observations in one variable: lastly a rather simple questions, in the data i have some variables that I can drop (not included in the dataex) but further also some observations within a variable, how can you delete those or filter without having to type a lot in a drop command?
I really hope you can help! I have inserted dataex for dataset1 and dataset2 below. Thanks a lot in advance!
dataset1:
dataset 2:
I am looking to get some assistance with dealing with data in stata as I seem to be a bit stuck.
1) Creating final dataset: I have two datasets data1 with some internal obs and data2 with external obs. I would like to merge the two on the common variable D_id. But, in both sets there a multiple observations for each unique Id (D_id) as they refer to a person and the grants received. I have already tried sorting, looking for duplicates, isid before merging but a constantly get and error (r(459): variable D_id does not uniquely identify the observations). Further, I have tried joinby but seems as if there are then created more variables and then I do not know what to drop and what to keep. How can I "glue" these datasets together without merge m:m (which I can understand is not a really good tool) as in reality there a more obs in both master and using dataset.
--> The end goal of the merge/joinby is to have a complete dataset where I can look up how many grants from dataset1 (id like g.1,g.2,g.3, etc.) there is for each of the internal applicants and get full info on application date, demographics, etc. (but they also might appear several times meaning they have done several projects with the company).
2) Creating new variables: From the final data set I will then need to create new variables for total grants (count of unique id per Mainid), amount of grants prior/post (unique ids in dataset1) gotten before and after the project data (dataset2) in accordance to the projects data/ status of Yes/No or maybe score, so a count of some kind that consider the Appid for each Mainid or D_id (those are corresponding) on the date (start, end, year).
This is to then analyze if a person have better changes of getting grants after engaging in projects (that will then be the treatment) --> because these variables and observations should ideally help me perform a discontinuity regression with those who didn't do project vs. those who did (that is based on the score maybe like 1-3 Yes/3-6 No). How can i make new variables that are needed to perform this discontinuity regression and analyzing the effect of a treatment and what other descriptive statistics or causality are important to consider in such an analysis?
3) Forecasting for new project applicants: further, as there is a time limit included in this analysis and with some observations are from 2022 and 2023 it can be hard to measure an outcome already, would it be possible to create a forecast model for upcoming applicants and how would one go about it with the data available?
4) Deleting observations in one variable: lastly a rather simple questions, in the data i have some variables that I can drop (not included in the dataex) but further also some observations within a variable, how can you delete those or filter without having to type a lot in a drop command?
I really hope you can help! I have inserted dataex for dataset1 and dataset2 below. Thanks a lot in advance!

dataset1:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str3 id str6 title str2 language str3 ref str10(start_date end_date) int(start_year end_year) str19 active_years str6 org int amount str3 valuta int eur str8 D_id "g.1" "titl" "no" "1" "1/1/2014" "12/31/2017" 2014 2017 "2014;2015;2016;2017" "grid.2" . "" . "ur.02.1" "g.2" "tit1" "da" "OP2" "1/1/2018" "1/1/2019" 2018 2019 "2018;2019" "grid.3" 600 "DKK" 100 "ur.01" "g.3" "tit1" "en" "N/A" "5/4/2017" "" 2017 . "2017" "grid.4" 500 "DKK" 200 "ur.01" "g.4" "tit2" "da" "OP3" "10/1/2017" "9/30/2018" 2017 2018 "2017;2018" "grid.1" 400 "DKK" 300 "ur.01" "g.5" "tit2" "da" "OP4" "10/1/2016" "10/1/2017" 2016 2017 "2016;2017" "grid.1" 300 "DKK" 400 "ur.01" "g.6" "tit3" "en" "N/A" "12/13/2010" "" 2010 . "2010" "grid.5" 200 "DKK" 500 "ur.01" "g.7" "tit" "no" "HT1" "1/1/2019" "" 2019 . "2019" "grid.6" . "" . "ur.03.01" "g.8" "title4" "se" "L/U" "1/1/2022" "12/31/2024" 2022 2024 "2022;2023;2024" "grid.7" 100 "EUR" 600 "ur.03.02" "g.9" "titles" "en" "10" "1/1/2015" "12/31/2017" 2015 2017 "2015;2016;2017" "grid.8" . "" . "ur.044" end
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str3 name str6 Mainid str5 D_id str12 Mainname str5 Appid byte App str2 Officer str3 Status double period int(data start end) byte age str5 group str1 Gen str3 Classification1 byte Score int Year "Pio" "id: 12" "ur.01" "Jack Sparrow" "OP: 1" 1 "PJ" "Yes" 1.828224e+12 21166 21194 21823 45 "40-49" "X" "BB" 1 2015 "Ilm" "id: 12" "ur.01" "Jack Sparrow" "OP: 2" 2 "DJ" "No" 1.7646336e+12 20430 20606 20940 42 "40-49" "X" "BB" 2 2022 "Xp" "id: 12" "ur.01" "Jack Sparrow" "NO: 1" 3 "SJ" "No" 1.9852128e+12 22862 . . 50 "50-59" "X" "C" 3 2023 "OPP" "id: 12" "ur.01" "Jack Sparrow" "NO: 2" 4 "LJ" "No" 2.007072e+12 23100 . . 51 "50-59" "X" "" 4 2022 "HU" "id: 12" "ur.01" "Jack Sparrow" "NO: 3" 5 "PJ" "No" 1.9859904e+12 22912 . . 50 "50-59" "X" "I" 5 2014 "KLM" "id: 12" "ur.01" "Jack Sparrow" "OP: 3" 6 "SJ" "No" 1.7333568e+12 20071 20098 20466 42 "40-49" "X" "BB" 6 2019 "SAS" "id: 12" "ur.01" "Jack Sparrow" "NO: 4" 7 "PJ" "No" 1.8635616e+12 21192 . . 47 "40-49" "X" "I N" 1 2017 "Pio" "id: 14" "ur.07" "Mia Name" "OP: 4" 8 "PJ" "Yes" 1.8924192e+12 21557 21559 21562 47 "40-49" "Y" "" 2 2019 "Xp" "id: 14" "ur.07" "Mia Name" "NO: 7" 9 "PJ" "No" 1.9852128e+12 22862 . . 50 "50-59" "Y" "" 3 2022 end format %tcnn/dd/ccYY_hh:MM period format %tdnn/dd/CCYY data format %tdnn/dd/CCYY start format %tdnn/dd/CCYY end