How to keep data only for randomly selected sample from my master dataset for further append?

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 500

How to keep data only for randomly selected sample from my master dataset for further append?

08 Jan 2024, 08:25

Hi everyone,

I have a master dataset of the following form:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long id str19 id_date long idcontrato double(date_contract_start date_contract_end)
1001 "1001_18887_21700"    1001 18887 21700
1001 "1001_21701_22431"  451697 21701 22431
1001 "1001_22432_22645" 1236132 22432 22645
1001 "1001_22646_22676" 1730454 22646 22676
1001 "1001_22677_22735" 2082075 22677 22735
1001 "1001_22736_23010" 2172904 22736 23010
1001 "1001_23011_23069" 2872183 23011 23069
1001 "1001_23070_."     3107888 23070     .
1005 "1005_18800_21639"    1005 18800 21639
1005 "1005_21640_21651"  420392 21640 21651
1005 "1005_21652_22066"  432684 21652 22066
1005 "1005_22067_22431"  720923 22067 22431
1005 "1005_22432_22456" 1124767 22432 22456
1005 "1005_22457_22645" 1288758 22457 22645
1005 "1005_22646_22676" 1742918 22646 22676
1005 "1005_22677_22735" 2036693 22677 22735
1005 "1005_22736_22888" 2322897 22736 22888
1005 "1005_22889_23010" 2598018 22889 23010
1005 "1005_23011_23041" 2728124 23011 23041
1005 "1005_23042_23130" 2991589 23042 23130
end
format %td date_contract_start
format %td date_contract_end

I have created a random sample of 2,000 unique households, following some advices given on Statalist previously:

Code:

keep if inrange(date_contract_end, td(01jan2021), td(13nov2023)) | missing(date_contract_end)


// {
// --- This code is replicated from Statalist, with a few local adaptations. All credits go to Nick Cox --- //
 
egen tag = tag(id)

* set seed : random sample for better reproducibility

set seed 09012024 // today's date

gen shuffle = .  
gen sampled = .

    qui replace shuffle = runiform() & missing(fecha_operativa_pv) // households without Solar Panels
    sort tag shuffle

   * the tagged values are at the end; we select 2000 of them
    qui replace sampled = inrange(_n, _N-1999, _N)

    * spread selection
    qui bysort id (sampled) : replace sampled = sampled[_N]

    levelsof id if sampled

// }

keep if sampled
sort id date_contract_start

I have monthly files about household consumption from January 2021 to July 2023. Below is a -dataex- of January 2021:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long id str19 id_date long idcontrato double(date_contract_start date_contract_end)
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
1001 "1001_21701_22431" 451697 21701 22431
end
format %td date_contract_start
format %td date_contract_end

What I want, please:

Read in one monthly data at a time (as my dataset is huge), and keep only the data for those 2,000 households,
Append each month to have a dataset with 2,000 households, and their full data.

One condition to be fullfilled: we should have IDs that have data from January 2021 to November 2023 without interruption (i.e. no drop-outs or entries).
Thank you in advance for your help!

Michael

Last edited by Michael Duarte Goncalves; 08 Jan 2024, 09:06.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

08 Jan 2024, 09:36

While this can, in principle, be done, as a practical matter it is going to be unwieldy at best. The problem is that when you do a random sample in any given month, with no data from the other months in play, there is no way to know which of the observations will ultimately pass the test of having uninterrupted data for the period Jan 21 to Nov 2023.

So I think you have to start with the entire data set (all of the months combined) and then drop the id's which lack uninterrupted data from Jan 21 to Nov 2023, and then pick your random sample of 2000.

How to drop the id's which lack uninterrupted data is dealt with in your other post from earlier today.
1 like
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#3

09 Jan 2024, 00:41

Hi Clyde Schechter,

Thank you very much for your time and suggestion!
However, I have a small issue:
A single monthly dataset represents around 4,000,000 observations. I've tried to combine all the months, but due to limited memory capacity (and time-consuming procedure), stata crashes. As a result, it's very difficult to do this with my computer.
Would you know how I could overcome this problem, please?

Thank you again for your help.
Best,

Michael
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#4

09 Jan 2024, 03:31

Hi again,

Does anyone have a solution for #3, please?
Thank you!

Michael
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

09 Jan 2024, 09:19

Well, this may or may not even be possible. Here's an approach that might work.

First, I notice that in the example data you show, there are many observations that are all identical. These are taking up space but conveying no additional information (at least none that is relevant to the task at hand.) Next, if I understand the situation correctly, in order to qualify for inclusion in your sample it is necessary (though not sufficient) that an id occur in every one of your monthly files. So we may be able to economize a bit by cutting down to those.

Code:

clear*
local filelist // REPLACE THIS BY A COMMAND TO CREATE A LOCAL MACRO WITH THE NAMES OF ALL OF THE MONTHLY FILES

local n_files: wordcount of `filelist'

use id using `"`:word 1 of `filelist''"'
duplicates drop
sort id

forvalues i = 2/`n_files' {
    frame create next_file
    frame next_file {
        use id using `"`:word `i' of `filelist''"'
        duplicates drop
        sort id
    }
    frlink 1:1 id , frame(next_file)
    drop if missing(next_file)
    drop next_file
    frame drop next_file
}

//    AT THIS POINT MEMORY CONTAINS A LIST OF ALL THE ID'S THAT OCCUR IN EVERY
//    MONTHLY FILE. 
//    NOW BRING IN ALL OF THEIR DATA
forvalues i = 1/`n_files' {
    merge 1:m id using `"`:word `i' of `filelist''"', keep(match) nogenerate
}

//    NOW APPLY CODE TO KEEP ONLY THOSE WITH UNINTERRUPTED SERVICE, AS SHOWN IN OTHER THREAD
//    THEN PICK RANDOM SAMPLE OF 2000, AS SHOWN EARLIER

Comment

Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#6

09 Jan 2024, 09:50

Hi Clyde Schechter,

Yes, you understood correctly. Thank you.
I'll try out code in #5 and keep you posted on its development! Thank you so much for your help!

Could you please explain me some new features that I haven't had a chance to see yet? What are:
-frame-

-frlink 1:1-

Furthermore, could you give me some insights about why you are using `"` and'"' in:

Code:

use id using `"`:word 1 of `filelist''"'

And what's the main difference with the usual ` ' signs?

Thank you again for your help! I'm really happy to learn so much with you!
Michael
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

09 Jan 2024, 10:24

-frame-s allow Stata to have two or more different data sets in active memory at the same time. They were introduced in version 16. -frlink- is a command that enables you to link observations in one frame to (partially) matching observations in another frame--it is somewhat similar to -merge- in that regard.

Furthermore, could you give me some insights about why you are using `"` and'"' in:

Code:

use id using `"`:word 1 of `filelist''"'

The outermost `" "' are just the usual compound double quotes that are used around string literals in Stata. I am using them here because the contents of that string, namely `:word 1 of `filelist'' is going to be a filename, and it may contain embedded blanks. If I knew for certain that none of the filenames has embedded blanks, I could omit the outer `" "'. Actually, I could have done it slightly more simply as "`:word 1 of `filelist''". The ordinary double quotes, " ", would serve the same function of dealing with embedded blanks. The compound double quotes, `" "', also deal with the possibility of quotes themselves occurring within the string--but that is not really a worry here because filenames cannot contain quotes.

As for `:word 1 of `filelist'', that is a macro function that picks out the first filename from the local macro filename that contains a list of all of them.
1 like
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#8

10 Jan 2024, 01:04

Hi again Clyde Schechter,

It's all very clear, as usual! Thank you for your excellent explanations and the passion you put into them.
Best,

Michael
Comment

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 500

10 Jan 2024, 02:10

Hi,

I need some help about #5. I tried the code provided, but I hit a snag:

Code:

clear all
version 17
set more off
set varabbrev off


global path "C:/Users/miduarte/Desktop/Ongoing_Projects/test_HolaLuz_Data/New_dataset_15112023/stata/2_export_telemedida/data/temp"

cd "${path}" // change that to your local/dynamic path

use "export_contratos_cleaned_merge_monthlyf.dta", clear

cd "${path}/.."

// { --------------------------------------------------------------
// Code provided by Clyde Schechter from statalist. All credits to him.
clear*
local filelist "clean_export_telemedida_202101.dta clean_export_telemedida_202102.dta clean_export_telemedida_202103.dta clean_export_telemedida_202104.dta clean_export_telemedida_202105.dta clean_export_telemedida_202106.dta clean_export_telemedida_202107.dta clean_export_telemedida_202108.dta clean_export_telemedida_202109.dta clean_export_telemedida_202110.dta clean_export_telemedida_202111.dta clean_export_telemedida_202112.dta clean_export_telemedida_202201.dta clean_export_telemedida_202202.dta clean_export_telemedida_202203.dta clean_export_telemedida_202204.dta clean_export_telemedida_202205.dta clean_export_telemedida_202206.dta clean_export_telemedida_202207.dta clean_export_telemedida_202208.dta clean_export_telemedida_202209.dta clean_export_telemedida_202210.dta clean_export_telemedida_202211.dta clean_export_telemedida_202212.dta clean_export_telemedida_202301.dta clean_export_telemedida_202302.dta clean_export_telemedida_202303.dta clean_export_telemedida_202304.dta clean_export_telemedida_202305.dta clean_export_telemedida_202306.dta clean_export_telemedida_202307.dta"
// REPLACE THIS BY A COMMAND TO CREATE A LOCAL MACRO WITH THE NAMES OF ALL OF THE MONTHLY FILES

local n_files: word count of `filelist'
display `n_files'
use id using `"`:word 1 of `filelist''"'
duplicates drop
sort id

forvalues i = 2/`n_files' {
    frame create next_file
    frame next_file {
        use id using `"`:word `i' of `filelist''"'
        duplicates drop
        sort id
    }
    frlink 1:1 id , frame(next_file)
    drop if missing(next_file)
    drop next_file
    frame drop next_file
}

//    AT THIS POINT MEMORY CONTAINS A LIST OF ALL THE ID'S THAT OCCUR IN EVERY
//    MONTHLY FILE.
//    NOW BRING IN ALL OF THEIR DATA
forvalues i = 1/`n_files' {
    merge 1:m id using `"`:word `i' of `filelist''"', keep(match) nogenerate
}

//    NOW APPLY CODE TO KEEP ONLY THOSE WITH UNINTERRUPTED SERVICE, AS SHOWN IN OTHER THREAD
//    THEN PICK RANDOM SAMPLE OF 2000, AS SHOWN EARLIER

// -------------------------------------------------------------------- }

I obtained the following error:

Code:

invalid file specification
r(198);

end of do-file

r(198);

Could anyone give me some insights why I obtain this error, please?
Thank you so much in advance.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

10 Jan 2024, 08:33

As I do not have a set up of files like yours to test this on, I cannot really tell.

At what point in the code did you get this error message? If you cannot tell where the message came from, do

Code:

set tracedepth 1 set trace on

just after the clear* command, and you will then see where the error message is coming from. And if it isn't clear to you at that point what the problem is, post back the output from the error message itself back to the last command that executed successfully.
1 like
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#11

11 Jan 2024, 03:56

Hi Clyde Schechter,

Once again, thank you for your time and for your unfailing willingness to help.
I will try:

Code:

set tracedepth 1 set trace on

and get back to you if I can't find or understand my mistake. Thank you for your help!
Michael
Comment

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 500

#12

11 Jan 2024, 04:07

after setting -trace on-, here is where I obtain the error message:

Code:

- frame create next_file
  --------------------------------------------------------------------------------------------------------------------------------------------------------------------- begin frame ---
  - version 16
  - gettoken subcmd 0 : 0
  - if ("`subcmd'" == "put") {
  = if ("create" == "put") {
    frame_put `"`_byvars'"' `0'
    exit
    }
  - if ("`_byvars'" != "") {
  = if ("" != "") {
    error 190
    }
  - if ("`subcmd'" == "post") {
  = if ("create" == "post") {
    frame_post `0'
    exit
    }
  - if ("`subcmd'" == "create") {
  = if ("create" == "create") {
  - frame_create `0'
  = frame_create  next_file
  - exit
  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- end frame ---
- frame next_file {
- use id using `"`:word `i' of `filelist''"'
= use id using `""'
invalid file specification
  duplicates drop
  sort id
  }
  frlink 1:1 id , frame(next_file)
  drop if missing(next_file)
  drop next_file
  frame drop next_file
  }
r(198);

end of do-file

r(198);

Michael

Last edited by Michael Duarte Goncalves; 11 Jan 2024, 04:16.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#13

11 Jan 2024, 09:09

I don't think this is a code problem. I think you are not running it correctly.

From the trace output, it seems that Stata thinks that `"`:word `i' of `filelist''"' reduces to the empty string `""'. This can only happen if either local macro filelist doesn't contain as many as `i' filenames, or if local macro filelist doesn't exist at all. The code clearly creates local macro filelist at the top, and since `i' is constrained to run between 2 and `n_files', where `n_files' is the number of files in it.

However, if you are running the code line-by-line or in chunks, you can cause local macro filelist to disappear. It is important to remember that a chunk of code that is run by itself is considered a program by Stata, and local macros exist only in the programs within which they are defined. So if there is any interruption in execution between the place where local macro filelist is defined and where its contents are later accessed, it will go out of existence and will be interpreted as an empty string. I think this is what is happening. Whenever you run code that uses local macros, it is crucial that all of the code starting from where a local macro is defined until the last time it is accessed must be run uninterrupted, in one fell swoop. Try that, and I think it will run for you.
Comment

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 500

#14

11 Jan 2024, 09:42

Thank you for your advice! Indeed, I used to use this in pieces. But when I do ctrl+d again on the all chunk, taking into account the created locals, I still get the error.

It's probably a silly mistake on my part, and I apologize if that's the case. Here is the used code:

Code:

clear all
version 17
set more off
set varabbrev off

global path "C:/Users/miduarte/Desktop/Ongoing_Projects/test_HolaLuz_Data/New_dataset_15112023/stata/2_export_telemedida/data/temp"

cd "${path}/.."

// { --------------------------------------------------------------
// Code provided Clyde Schechter from statalist. All credits to him.
// ALL ERRORS ARE MINE!



clear*
local filelist "clean_export_telemedida_202101.dta clean_export_telemedida_202102.dta clean_export_telemedida_202103.dta clean_export_telemedida_202104.dta clean_export_telemedida_202105.dta clean_export_telemedida_202106.dta clean_export_telemedida_202107.dta clean_export_telemedida_202108.dta clean_export_telemedida_202109.dta clean_export_telemedida_202110.dta clean_export_telemedida_202111.dta clean_export_telemedida_202112.dta clean_export_telemedida_202201.dta clean_export_telemedida_202202.dta clean_export_telemedida_202203.dta clean_export_telemedida_202204.dta clean_export_telemedida_202205.dta clean_export_telemedida_202206.dta clean_export_telemedida_202207.dta clean_export_telemedida_202208.dta clean_export_telemedida_202209.dta clean_export_telemedida_202210.dta clean_export_telemedida_202211.dta clean_export_telemedida_202212.dta clean_export_telemedida_202301.dta clean_export_telemedida_202302.dta clean_export_telemedida_202303.dta clean_export_telemedida_202304.dta clean_export_telemedida_202305.dta clean_export_telemedida_202306.dta clean_export_telemedida_202307.dta"

local n_files: word count of `filelist'
display `n_files'
use id using `"`:word 1 of `filelist''"'
duplicates drop
sort id

forvalues i = 2/`n_files' {
    frame create next_file
    frame next_file {
        use id using `"`:word `i' of `filelist''"'
        duplicates drop
        sort id
    }
    frlink 1:1 id , frame(next_file)
    drop if missing(next_file)
    drop next_file
    frame drop next_file
}


forvalues i = 1/`n_files' {
    merge 1:m id using `"`:word `i' of `filelist''"', keep(match) nogenerate
}
// ----------------------------------------------------------}

Thanks again for your help, really!
Best,

Michael

Last edited by Michael Duarte Goncalves; 11 Jan 2024, 09:44.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

#15

11 Jan 2024, 10:09

OK, my apologies. I created a setup of files similar to yours and was able to reproduce your error. In addition, after fixing that, I discovered another more serious problem further down in the code. But I think with these repairs, you will get what was intended. (Whether it will solve your overlying problem of not being able to fit all of this in memory remains to be seen.)

Code:

clear*
tempfile building
save `building', emptyok

local filelist "clean_export_telemedida_202101.dta clean_export_telemedida_202102.dta clean_export_telemedida_202103.dta clean_export_telemedida_202104.dta clean_export_telemedida_202105.dta clean_export_telemedida_202106.dta clean_export_telemedida_202107.dta clean_export_telemedida_202108.dta clean_export_telemedida_202109.dta clean_export_telemedida_202110.dta"

local n_files: word count `filelist' // N.B.  NO "of" IN THIS COMMAND
display `n_files'
use id using `"`:word 1 of `filelist''"'
duplicates drop
sort id

forvalues i = 2/`n_files' {
    frame create next_file
    frame next_file {
        use id using `"`:word `i' of `filelist''"'
        duplicates drop
        sort id
    }
    frlink 1:1 id , frame(next_file)
    drop if missing(next_file)
    drop next_file
    frame drop next_file
}

tempfile matchlist
save `matchlist'


forvalues i = 1/`n_files' {
    use `"`:word `i' of `filelist''"', clear
    merge m:1 id using `matchlist', keep(match)
    append using `building'
    save `"`building'"', replace
// ----------------------------------------------------------
}

Announcement

How to keep data only for randomly selected sample from my master dataset for further append?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment