Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to keep data only for randomly selected sample from my master dataset for further append?

    Hi everyone,

    I have a master dataset of the following form:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long id str19 id_date long idcontrato double(date_contract_start date_contract_end)
    1001 "1001_18887_21700"    1001 18887 21700
    1001 "1001_21701_22431"  451697 21701 22431
    1001 "1001_22432_22645" 1236132 22432 22645
    1001 "1001_22646_22676" 1730454 22646 22676
    1001 "1001_22677_22735" 2082075 22677 22735
    1001 "1001_22736_23010" 2172904 22736 23010
    1001 "1001_23011_23069" 2872183 23011 23069
    1001 "1001_23070_."     3107888 23070     .
    1005 "1005_18800_21639"    1005 18800 21639
    1005 "1005_21640_21651"  420392 21640 21651
    1005 "1005_21652_22066"  432684 21652 22066
    1005 "1005_22067_22431"  720923 22067 22431
    1005 "1005_22432_22456" 1124767 22432 22456
    1005 "1005_22457_22645" 1288758 22457 22645
    1005 "1005_22646_22676" 1742918 22646 22676
    1005 "1005_22677_22735" 2036693 22677 22735
    1005 "1005_22736_22888" 2322897 22736 22888
    1005 "1005_22889_23010" 2598018 22889 23010
    1005 "1005_23011_23041" 2728124 23011 23041
    1005 "1005_23042_23130" 2991589 23042 23130
    end
    format %td date_contract_start
    format %td date_contract_end

    I have created a random sample of 2,000 unique households, following some advices given on Statalist previously:

    Code:
    keep if inrange(date_contract_end, td(01jan2021), td(13nov2023)) | missing(date_contract_end)
    
    
    // {
    // --- This code is replicated from Statalist, with a few local adaptations. All credits go to Nick Cox --- //
     
    egen tag = tag(id)
    
    * set seed : random sample for better reproducibility
    
    set seed 09012024 // today's date
    
    gen shuffle = .  
    gen sampled = .
    
        qui replace shuffle = runiform() & missing(fecha_operativa_pv) // households without Solar Panels
        sort tag shuffle
    
       * the tagged values are at the end; we select 2000 of them
        qui replace sampled = inrange(_n, _N-1999, _N)
    
        * spread selection
        qui bysort id (sampled) : replace sampled = sampled[_N]
    
        levelsof id if sampled
    
    // }
    
    keep if sampled
    sort id date_contract_start

    I have monthly files about household consumption from January 2021 to July 2023. Below is a -dataex- of January 2021:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long id str19 id_date long idcontrato double(date_contract_start date_contract_end)
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    1001 "1001_21701_22431" 451697 21701 22431
    end
    format %td date_contract_start
    format %td date_contract_end
    What I want, please:
    1. Read in one monthly data at a time (as my dataset is huge), and keep only the data for those 2,000 households,
    2. Append each month to have a dataset with 2,000 households, and their full data.
    One condition to be fullfilled: we should have IDs that have data from January 2021 to November 2023 without interruption (i.e. no drop-outs or entries).
    Thank you in advance for your help!

    Michael
    Last edited by Michael Duarte Goncalves; 08 Jan 2024, 10:06.

  • #2
    While this can, in principle, be done, as a practical matter it is going to be unwieldy at best. The problem is that when you do a random sample in any given month, with no data from the other months in play, there is no way to know which of the observations will ultimately pass the test of having uninterrupted data for the period Jan 21 to Nov 2023.

    So I think you have to start with the entire data set (all of the months combined) and then drop the id's which lack uninterrupted data from Jan 21 to Nov 2023, and then pick your random sample of 2000.

    How to drop the id's which lack uninterrupted data is dealt with in your other post from earlier today.

    Comment


    • #3
      Hi Clyde Schechter,

      Thank you very much for your time and suggestion!
      However, I have a small issue:
      A single monthly dataset represents around 4,000,000 observations. I've tried to combine all the months, but due to limited memory capacity (and time-consuming procedure), stata crashes. As a result, it's very difficult to do this with my computer.
      Would you know how I could overcome this problem, please?

      Thank you again for your help.
      Best,

      Michael

      Comment


      • #4
        Hi again,

        Does anyone have a solution for #3, please?
        Thank you!

        Michael

        Comment


        • #5
          Well, this may or may not even be possible. Here's an approach that might work.

          First, I notice that in the example data you show, there are many observations that are all identical. These are taking up space but conveying no additional information (at least none that is relevant to the task at hand.) Next, if I understand the situation correctly, in order to qualify for inclusion in your sample it is necessary (though not sufficient) that an id occur in every one of your monthly files. So we may be able to economize a bit by cutting down to those.

          Code:
          clear*
          local filelist // REPLACE THIS BY A COMMAND TO CREATE A LOCAL MACRO WITH THE NAMES OF ALL OF THE MONTHLY FILES
          
          local n_files: wordcount of `filelist'
          
          use id using `"`:word 1 of `filelist''"'
          duplicates drop
          sort id
          
          forvalues i = 2/`n_files' {
              frame create next_file
              frame next_file {
                  use id using `"`:word `i' of `filelist''"'
                  duplicates drop
                  sort id
              }
              frlink 1:1 id , frame(next_file)
              drop if missing(next_file)
              drop next_file
              frame drop next_file
          }
          
          //    AT THIS POINT MEMORY CONTAINS A LIST OF ALL THE ID'S THAT OCCUR IN EVERY
          //    MONTHLY FILE. 
          //    NOW BRING IN ALL OF THEIR DATA
          forvalues i = 1/`n_files' {
              merge 1:m id using `"`:word `i' of `filelist''"', keep(match) nogenerate
          }
          
          //    NOW APPLY CODE TO KEEP ONLY THOSE WITH UNINTERRUPTED SERVICE, AS SHOWN IN OTHER THREAD
          //    THEN PICK RANDOM SAMPLE OF 2000, AS SHOWN EARLIER

          Comment


          • #6
            Hi Clyde Schechter,

            Yes, you understood correctly. Thank you.
            I'll try out code in #5 and keep you posted on its development! Thank you so much for your help!

            Could you please explain me some new features that I haven't had a chance to see yet? What are:
            1. -frame-
            2. -frlink 1:1-
            Furthermore, could you give me some insights about why you are using `"` and'"' in:

            Code:
             
             use id using `"`:word 1 of `filelist''"'
            And what's the main difference with the usual ` ' signs?

            Thank you again for your help! I'm really happy to learn so much with you!
            Michael

            Comment


            • #7
              -frame-s allow Stata to have two or more different data sets in active memory at the same time. They were introduced in version 16. -frlink- is a command that enables you to link observations in one frame to (partially) matching observations in another frame--it is somewhat similar to -merge- in that regard.

              Furthermore, could you give me some insights about why you are using `"` and'"' in:

              Code:

              use id using `"`:word 1 of `filelist''"'
              The outermost `" "' are just the usual compound double quotes that are used around string literals in Stata. I am using them here because the contents of that string, namely `:word 1 of `filelist'' is going to be a filename, and it may contain embedded blanks. If I knew for certain that none of the filenames has embedded blanks, I could omit the outer `" "'. Actually, I could have done it slightly more simply as "`:word 1 of `filelist''". The ordinary double quotes, " ", would serve the same function of dealing with embedded blanks. The compound double quotes, `" "', also deal with the possibility of quotes themselves occurring within the string--but that is not really a worry here because filenames cannot contain quotes.

              As for `:word 1 of `filelist'', that is a macro function that picks out the first filename from the local macro filename that contains a list of all of them.

              Comment


              • #8
                Hi again Clyde Schechter,

                It's all very clear, as usual! Thank you for your excellent explanations and the passion you put into them.
                Best,

                Michael

                Comment


                • #9
                  Hi,

                  I need some help about #5. I tried the code provided, but I hit a snag:

                  Code:
                  clear all
                  version 17
                  set more off
                  set varabbrev off
                  
                  
                  global path "C:/Users/miduarte/Desktop/Ongoing_Projects/test_HolaLuz_Data/New_dataset_15112023/stata/2_export_telemedida/data/temp"
                  
                  cd "${path}" // change that to your local/dynamic path
                  
                  use "export_contratos_cleaned_merge_monthlyf.dta", clear
                  
                  cd "${path}/.."
                  
                  // { --------------------------------------------------------------
                  // Code provided by Clyde Schechter from statalist. All credits to him.
                  clear*
                  local filelist "clean_export_telemedida_202101.dta clean_export_telemedida_202102.dta clean_export_telemedida_202103.dta clean_export_telemedida_202104.dta clean_export_telemedida_202105.dta clean_export_telemedida_202106.dta clean_export_telemedida_202107.dta clean_export_telemedida_202108.dta clean_export_telemedida_202109.dta clean_export_telemedida_202110.dta clean_export_telemedida_202111.dta clean_export_telemedida_202112.dta clean_export_telemedida_202201.dta clean_export_telemedida_202202.dta clean_export_telemedida_202203.dta clean_export_telemedida_202204.dta clean_export_telemedida_202205.dta clean_export_telemedida_202206.dta clean_export_telemedida_202207.dta clean_export_telemedida_202208.dta clean_export_telemedida_202209.dta clean_export_telemedida_202210.dta clean_export_telemedida_202211.dta clean_export_telemedida_202212.dta clean_export_telemedida_202301.dta clean_export_telemedida_202302.dta clean_export_telemedida_202303.dta clean_export_telemedida_202304.dta clean_export_telemedida_202305.dta clean_export_telemedida_202306.dta clean_export_telemedida_202307.dta"
                  // REPLACE THIS BY A COMMAND TO CREATE A LOCAL MACRO WITH THE NAMES OF ALL OF THE MONTHLY FILES
                  
                  local n_files: word count of `filelist'
                  display `n_files'
                  use id using `"`:word 1 of `filelist''"'
                  duplicates drop
                  sort id
                  
                  forvalues i = 2/`n_files' {
                      frame create next_file
                      frame next_file {
                          use id using `"`:word `i' of `filelist''"'
                          duplicates drop
                          sort id
                      }
                      frlink 1:1 id , frame(next_file)
                      drop if missing(next_file)
                      drop next_file
                      frame drop next_file
                  }
                  
                  //    AT THIS POINT MEMORY CONTAINS A LIST OF ALL THE ID'S THAT OCCUR IN EVERY
                  //    MONTHLY FILE.
                  //    NOW BRING IN ALL OF THEIR DATA
                  forvalues i = 1/`n_files' {
                      merge 1:m id using `"`:word `i' of `filelist''"', keep(match) nogenerate
                  }
                  
                  //    NOW APPLY CODE TO KEEP ONLY THOSE WITH UNINTERRUPTED SERVICE, AS SHOWN IN OTHER THREAD
                  //    THEN PICK RANDOM SAMPLE OF 2000, AS SHOWN EARLIER
                  
                  // -------------------------------------------------------------------- }
                  I obtained the following error:

                  Code:
                  invalid file specification
                  r(198);
                  
                  end of do-file
                  
                  r(198);
                  Could anyone give me some insights why I obtain this error, please?
                  Thank you so much in advance.

                  Comment


                  • #10
                    As I do not have a set up of files like yours to test this on, I cannot really tell.

                    At what point in the code did you get this error message? If you cannot tell where the message came from, do
                    Code:
                    set tracedepth 1
                    set trace on
                    just after the clear* command, and you will then see where the error message is coming from. And if it isn't clear to you at that point what the problem is, post back the output from the error message itself back to the last command that executed successfully.

                    Comment


                    • #11
                      Hi Clyde Schechter,

                      Once again, thank you for your time and for your unfailing willingness to help.
                      I will try:

                      Code:
                      set tracedepth 1
                      set trace on
                      and get back to you if I can't find or understand my mistake. Thank you for your help!
                      Michael

                      Comment


                      • #12
                        after setting -trace on-, here is where I obtain the error message:

                        Code:
                        - frame create next_file
                          --------------------------------------------------------------------------------------------------------------------------------------------------------------------- begin frame ---
                          - version 16
                          - gettoken subcmd 0 : 0
                          - if ("`subcmd'" == "put") {
                          = if ("create" == "put") {
                            frame_put `"`_byvars'"' `0'
                            exit
                            }
                          - if ("`_byvars'" != "") {
                          = if ("" != "") {
                            error 190
                            }
                          - if ("`subcmd'" == "post") {
                          = if ("create" == "post") {
                            frame_post `0'
                            exit
                            }
                          - if ("`subcmd'" == "create") {
                          = if ("create" == "create") {
                          - frame_create `0'
                          = frame_create  next_file
                          - exit
                          ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- end frame ---
                        - frame next_file {
                        - use id using `"`:word `i' of `filelist''"'
                        = use id using `""'
                        invalid file specification
                          duplicates drop
                          sort id
                          }
                          frlink 1:1 id , frame(next_file)
                          drop if missing(next_file)
                          drop next_file
                          frame drop next_file
                          }
                        r(198);
                        
                        end of do-file
                        
                        r(198);
                        Michael
                        Last edited by Michael Duarte Goncalves; 11 Jan 2024, 05:16.

                        Comment


                        • #13
                          I don't think this is a code problem. I think you are not running it correctly.

                          From the trace output, it seems that Stata thinks that `"`:word `i' of `filelist''"' reduces to the empty string `""'. This can only happen if either local macro filelist doesn't contain as many as `i' filenames, or if local macro filelist doesn't exist at all. The code clearly creates local macro filelist at the top, and since `i' is constrained to run between 2 and `n_files', where `n_files' is the number of files in it.

                          However, if you are running the code line-by-line or in chunks, you can cause local macro filelist to disappear. It is important to remember that a chunk of code that is run by itself is considered a program by Stata, and local macros exist only in the programs within which they are defined. So if there is any interruption in execution between the place where local macro filelist is defined and where its contents are later accessed, it will go out of existence and will be interpreted as an empty string. I think this is what is happening. Whenever you run code that uses local macros, it is crucial that all of the code starting from where a local macro is defined until the last time it is accessed must be run uninterrupted, in one fell swoop. Try that, and I think it will run for you.

                          Comment


                          • #14
                            Thank you for your advice! Indeed, I used to use this in pieces. But when I do ctrl+d again on the all chunk, taking into account the created locals, I still get the error.

                            It's probably a silly mistake on my part, and I apologize if that's the case. Here is the used code:

                            Code:
                            clear all
                            version 17
                            set more off
                            set varabbrev off
                            
                            global path "C:/Users/miduarte/Desktop/Ongoing_Projects/test_HolaLuz_Data/New_dataset_15112023/stata/2_export_telemedida/data/temp"
                            
                            cd "${path}/.."
                            
                            // { --------------------------------------------------------------
                            // Code provided Clyde Schechter from statalist. All credits to him.
                            // ALL ERRORS ARE MINE!
                            
                            
                            
                            clear*
                            local filelist "clean_export_telemedida_202101.dta clean_export_telemedida_202102.dta clean_export_telemedida_202103.dta clean_export_telemedida_202104.dta clean_export_telemedida_202105.dta clean_export_telemedida_202106.dta clean_export_telemedida_202107.dta clean_export_telemedida_202108.dta clean_export_telemedida_202109.dta clean_export_telemedida_202110.dta clean_export_telemedida_202111.dta clean_export_telemedida_202112.dta clean_export_telemedida_202201.dta clean_export_telemedida_202202.dta clean_export_telemedida_202203.dta clean_export_telemedida_202204.dta clean_export_telemedida_202205.dta clean_export_telemedida_202206.dta clean_export_telemedida_202207.dta clean_export_telemedida_202208.dta clean_export_telemedida_202209.dta clean_export_telemedida_202210.dta clean_export_telemedida_202211.dta clean_export_telemedida_202212.dta clean_export_telemedida_202301.dta clean_export_telemedida_202302.dta clean_export_telemedida_202303.dta clean_export_telemedida_202304.dta clean_export_telemedida_202305.dta clean_export_telemedida_202306.dta clean_export_telemedida_202307.dta"
                            
                            local n_files: word count of `filelist'
                            display `n_files'
                            use id using `"`:word 1 of `filelist''"'
                            duplicates drop
                            sort id
                            
                            forvalues i = 2/`n_files' {
                                frame create next_file
                                frame next_file {
                                    use id using `"`:word `i' of `filelist''"'
                                    duplicates drop
                                    sort id
                                }
                                frlink 1:1 id , frame(next_file)
                                drop if missing(next_file)
                                drop next_file
                                frame drop next_file
                            }
                            
                            
                            forvalues i = 1/`n_files' {
                                merge 1:m id using `"`:word `i' of `filelist''"', keep(match) nogenerate
                            }
                            // ----------------------------------------------------------}
                            Thanks again for your help, really!
                            Best,


                            Michael
                            Last edited by Michael Duarte Goncalves; 11 Jan 2024, 10:44.

                            Comment


                            • #15
                              OK, my apologies. I created a setup of files similar to yours and was able to reproduce your error. In addition, after fixing that, I discovered another more serious problem further down in the code. But I think with these repairs, you will get what was intended. (Whether it will solve your overlying problem of not being able to fit all of this in memory remains to be seen.)

                              Code:
                              clear*
                              tempfile building
                              save `building', emptyok
                              
                              local filelist "clean_export_telemedida_202101.dta clean_export_telemedida_202102.dta clean_export_telemedida_202103.dta clean_export_telemedida_202104.dta clean_export_telemedida_202105.dta clean_export_telemedida_202106.dta clean_export_telemedida_202107.dta clean_export_telemedida_202108.dta clean_export_telemedida_202109.dta clean_export_telemedida_202110.dta"
                              
                              local n_files: word count `filelist' // N.B.  NO "of" IN THIS COMMAND
                              display `n_files'
                              use id using `"`:word 1 of `filelist''"'
                              duplicates drop
                              sort id
                              
                              forvalues i = 2/`n_files' {
                                  frame create next_file
                                  frame next_file {
                                      use id using `"`:word `i' of `filelist''"'
                                      duplicates drop
                                      sort id
                                  }
                                  frlink 1:1 id , frame(next_file)
                                  drop if missing(next_file)
                                  drop next_file
                                  frame drop next_file
                              }
                              
                              tempfile matchlist
                              save `matchlist'
                              
                              
                              forvalues i = 1/`n_files' {
                                  use `"`:word `i' of `filelist''"', clear
                                  merge m:1 id using `matchlist', keep(match)
                                  append using `building'
                                  save `"`building'"', replace
                              // ----------------------------------------------------------
                              }

                              Comment

                              Working...
                              X