Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is there a way to use -collapse- in stata, do some data changes, and then return to full data with all changes?

    Hi everyone,
    • Is there a way to use -collapse- to aggregate data, do some data modifications and then return to the full data with changes?
    I am doing one-by-one the modifications on car brands from my dataset.
    I have to do this for 69 car brands:

    Code:
    keep if description == "PEUGEOT"
    replace model = itrim(trim(model))
    gen num = 1
    collapse (sum) num, by(description model COD_PROPULSION cilindrada potenciafiscal weight_max)
    tab model [aw=num]
    
    strgroup model, generate(similar_model1) threshold(0.15) first normalize(shorter) force
    
    replace model = "1007" if inrange(similar_model1, 8, 9)
    
    replace model = "106" if inrange(similar_model1, 10, 13)
    
    replace model = "107" if inrange(similar_model1, 14, 15)
    
    replace model = "108" if inrange(similar_model1, 16, 70) ///
                            | (similar_model1 == 1 & ustrpos(model, "108")>0)
    ...
    Is there a way to come back to the "original version", with the changes made in my -model- variable?
    Here a small dataex from my -collapse-:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str23 description str31 model str1 COD_PROPULSION long cilindrada double potenciafiscal long weight_max double num long similar_model1
    "PEUGEOT" "108" "0" 1199 8.73 1240  1 26
    "PEUGEOT" "108" "0" 1199 8.73 1240  1 27
    "PEUGEOT" "108" "0" 1199 8.73 1240  1  1
    "PEUGEOT" "108" "0" 1199 8.73 1240 25  1
    "PEUGEOT" "108" "0"  998 7.82 1240  1  1
    "PEUGEOT" "108" "0"  998 7.82 1240  1 24
    end

    Thank you in advance for your help!
    Michael

  • #2
    I'm not sure I fully understand what is going on here, but I think what you want to do is:
    1. -preserve- your data before the code you showed (or save it as a file)
    2. -save- the results of the collapse/other data manipulation as a tempfile
    3. -restore- your original data (or -use- it from the previously saved file)
    4. -merge- with the tempfile you just saved using the -update- and -replace- options
    5. go back to step 1 with the next value of variable description
    Added: I don't quite understand what purpose the -collapse- command serves in this. It seems to me that the modifications you make thereafter could all be done on the uncollapsed data, with one exception. But that exception, the calculation of variable num, does not bite. You can get the same thing with -by description model COD_PROPULSION cilindrada potenciafiscal weight_max, sort: gen num = _N-.

    Or perhaps the -strgoup- command will not work with the uncollapsed data. I'm not familiar with that command.

    But apart from that possibility, it looks to me like you could simply take your code, remove the -collapse- command, wrap the whole thing in a loop over the values of variable description, adding -& description == the_looping_parameter- to all the if-conditions of the commands in the loop, and be done with it. That would be simpler, and much faster in a large data set, than repeatedly -collapse-ing the data and then re-inflating it with -merge-.
    Last edited by Clyde Schechter; 11 Jan 2024, 10:29.

    Comment


    • #3
      Hi Clyde Schechter,

      Yes, exactly, I want 1. to 5., but I don't know how to implement it correctly.

      To be honest, I don't know why I use -collapse- either, I just follow what my PI tells me without asking too many questions... It could be that we are using that because we will then later on do a merge. But I am not sure about what I am saying.
      Thank you again for your help Clyde! Your help is really appreciated!

      Michael

      P.S.: What is the -update- option in -merge-, please? I mean, what is its purpose?

      Comment


      • #4
        Code:
        preserve
        keep if description == "PEUGEOT"
        replace model = itrim(trim(model))
        gen num = 1
        collapse (sum) num, by(description model COD_PROPULSION cilindrada potenciafiscal weight_max)
        tab model [aw=num]
        
        strgroup model, generate(similar_model1) threshold(0.15) first normalize(shorter) force
        
        replace model = "1007" if inrange(similar_model1, 8, 9)
        
        replace model = "106" if inrange(similar_model1, 10, 13)
        
        replace model = "107" if inrange(similar_model1, 14, 15)
        
        replace model = "108" if inrange(similar_model1, 16, 70) ///
                                | (similar_model1 == 1 & ustrpos(model, "108")>0)
        ...
        
        tempfile holding
        save `holding'
        
        restore
        merge m:1 description model COD_PROPULSION cilindrada potenciafiscal weight_max ///
            using `holding', update replace
        I'm not going to try to manage your relationship with your PI, who I don't know, from afar. But really, if you are doing this for 69 different brands, you will be doing 69 preserves, 69 restores, 69 saves, and 69 merges. That's thrashing the disk 276 times. And unless there is something about strgroup that really requires the collapsing, it is a lot of wasted effort. (OK, the -preserve-s and -restore-s might be being done in memory with -frame-s behind the scenes. But even 138 disk operations that are unnecessary is a waste.) Most PIs would appreciate a politely phrased suggestion that might improve the performance of the code; most would see it as a sign of initiative and understanding. If you are working for one of those few PIs who cannot tolerate being questioned or the thought that somebody else might be able to improve on their work, well, you have my sympathy. In that case, grin and bear it until you are out from under him or her, and for your next position try to find someone better to work with.

        Comment


        • #5
          Clyde Schechter:

          Thank you so much for your answer and support!
          Thanks also for your feedback, I have by far not all your expertise, and it's always nice to learn a little more about stata thanks to you.

          I'll try to convince him of that.
          Thank you again for shared code.

          Lovely day.
          Michael

          P.S.: -tempfile- is useful for some purposes. I didn't know that command. Thanks!
          Last edited by Michael Duarte Goncalves; 11 Jan 2024, 10:53.

          Comment

          Working...
          X