Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating conditional mean

    Hi,

    I hope someone can help me out with the following:
    Given a data set like this:
    HTML Code:
    CompanyID Capex   Post
    1          100      0
    1          200      1
    2          300      0
    2          200      1
    where Post equals 0 or 1 if an observation is before or after a certain point in time, respectively.

    How do I calculate mean (Capex) only of those companies where a company's mean (Capex) & Post == 1 is bigger than mean (Capex) & Post == 0?
    Using the example data, I want mean (Capex) to be 150.

    Thanks in advance.
    Last edited by Arne VS; 06 May 2020, 08:56.

  • #2
    Does this give you what you want?
    Code:
    bysort CompanyID (Post): gen diff = Capex[2]-Capex[1]
    bysort CompanyID: egen mean = mean(Capex) if diff>0

    Comment


    • #3
      Not quite, I will elaborate.
      I need to check first if a company's mean(Capex) for the years where Post==1 is bigger than a company's mean(Capex) for the years where Post==0.
      If this is true, then calculate mean(Capex) of all company observations where the above rule is satisfied (so calculate mean for Capex values both Post==0 and Post==1 cases).

      HTML Code:
      CompanyID Year Capex Post
      1         2012  100   0
      1         2013  120   0
      1         2014  200   1
      1         2015  300   1
      
      2         2012  200   0
      2         2013  140   0
      2         2014  100   1
      2         2015  200   1
      Using company 1 from the above example: calculate mean capex for company 1 where Post==0: 220/2=110, calculate mean capex for company 1 where Post==1: 500/2=250.
      Since mean where Post==1 is bigger than mean where Post==0 for company 1, calculate mean capex for every value of capex for that company: 720/4=180.
      Iterate this for all companies and display mean(Capex) of all companies that satisfied the constraint mentioned. I am only interested in the output of the end mean, not the mean per company.

      Comment


      • #4
        Arne,

        I changed the data for company 2 so that its Post=1 Capex mean would be lower than its Post=0 Capex mean. If I understand what you're asking, this should do it.
        Code:
        clear
        input CompanyID Year Capex Post
        1         2012  100   0
        1         2013  120   0
        1         2014  200   1
        1         2015  300   1
        2         2012  200   0
        2         2013  140   0
        2         2014  100   1
        2         2015  140   1
        end
        
        preserve
            collapse Capex, by(CompanyID Post)
            bysort CompanyID: gen diff = Capex[2]-Capex[1]
            levelsof CompanyID if diff>0, local(comp) separate(,)
        restore
        
        bysort CompanyID: egen comp_mean = mean(Capex) if inlist(CompanyID,`comp')
        egen mean = mean(Capex)
        The variable mean is a constant that is the mean of companies that have higher Post=1 than Post=0 means.

        Lance

        Comment


        • #5
          The new variable mean does not seem to show the correct value. Using the provided data, the mean should be 180. 162,5 is the mean which includes values of company 2, which does not satisfy the constraint.
          Also, when I try running the code on my full dataset, I get the following error message when running:
          HTML Code:
          . bysort CompanyID: egen comp_mean = mean(Capex) if inlist(CompanyID,`comp')
          expression too long
          r(130);
          Last edited by Arne VS; 06 May 2020, 10:52.

          Comment


          • #6
            I can't reproduce the error you are getting. Are you running the full set of code at the same time? The local macro comp needs to still be in memory for the besort command after the restore.

            The comp_mean variable contains the value of 180. Do you want that value to be a constant for all observations? This would do that.

            Code:
            clear
            input CompanyID Year Capex Post
            1         2012  100   0
            1         2013  120   0
            1         2014  200   1
            1         2015  300   1
            2         2012  200   0
            2         2013  140   0
            2         2014  100   1
            2         2015  140   1
            end
            
            preserve
                collapse Capex, by(CompanyID Post)
                bysort CompanyID: gen diff = Capex[2]-Capex[1]
                levelsof CompanyID if diff>0, local(comp) separate(,)
            restore
            
            bysort CompanyID: egen comp_mean = mean(Capex) if inlist(CompanyID,`comp')
            gen mean = comp_mean[1]
            Of course, that assumes that the first observation satisfied your criteria.

            Comment


            • #7
              Originally posted by Lance Erickson View Post
              Arne,

              I changed the data for company 2 so that its Post=1 Capex mean would be lower than its Post=0 Capex mean. If I understand what you're asking, this should do it.
              Code:
              clear
              input CompanyID Year Capex Post
              1 2012 100 0
              1 2013 120 0
              1 2014 200 1
              1 2015 300 1
              2 2012 200 0
              2 2013 140 0
              2 2014 100 1
              2 2015 140 1
              end
              
              preserve
              collapse Capex, by(CompanyID Post)
              bysort CompanyID: gen diff = Capex[2]-Capex[1]
              levelsof CompanyID if diff>0, local(comp) separate(,)
              restore
              
              bysort CompanyID: egen comp_mean = mean(Capex) if inlist(CompanyID,`comp')
              egen mean = mean(Capex)
              The variable mean is a constant that is the mean of companies that have higher Post=1 than Post=0 means.

              Lance
              Using this set of code, comp_mean is calculated for companies which meet the constraint. This is correct.
              Then, I need to have a new mean calculated for all comp_mean.
              Basically, mean(comp_mean). But of course, only one comp_mean for a given CompanyID should be used in calculating the new mean, since for some observations, I have only 1 Post==0 observation, while for others I have 2 or more.
              HTML Code:
              comp_mean    mean
              180         162.5
              180         162.5
              180         162.5
              180         162.5
                          162.5
                          162.5
                          162.5
                          162.5
              I would pick the comp_mean for the relevant CompanyID and calculate the new mean for those values of comp_mean. Example: comp_mean for CompanyID1 + comp_mean for CompanyIDx... divided by the amount of companies that have a value for comp_mean.

              Comment


              • #8
                ok...I think this gets you where you want now...assuming I haven't still misunderstood something. I added another company that satisfies the criterion to make is easier to see if the code does what is intended.
                Code:
                clear
                input CompanyID Year Capex Post
                1         2012  100   0
                1         2013  120   0
                1         2014  200   1
                1         2015  300   1
                2         2012  200   0
                2         2013  140   0
                2         2014  100   1
                2         2015  140   1
                3         2012  100   0
                3         2013  120   0
                3         2014  500   1
                3         2015  800   1
                end
                
                preserve
                    collapse Capex, by(CompanyID Post)
                    bysort CompanyID: gen diff = Capex[2]-Capex[1]
                    levelsof CompanyID if diff>0, local(comp) separate(,)
                restore
                
                bysort CompanyID: egen comp_mean = mean(Capex) if inlist(CompanyID,`comp')
                
                preserve
                    duplicates drop CompanyID, force
                    egen mean = mean(comp_mean) if comp_mean!=.
                    replace mean = mean[_n-1] if mean==.
                    tempfile mean
                    save "`mean'"
                restore
                
                merge m:1 CompanyID using "`mean'", keepusing(mean)
                Although there might be a more efficient way to do it, it seems this should work fine as long as the first company satisfies the criterion.

                Comment


                • #9
                  I would like to thank you for all the help you have provided! This displays the mean I am interested in.
                  However, for my actual (large) dataset, stata returns an error because there are too many ID's in inlist. Is there any way to get around this?

                  Comment


                  • #10
                    The inlist function is limited to 250 arguments, and the name of the variable counts as one of the arguments. One idea would be to loop through sets of companies. I got this to work with the toy dataset and I *think* it will work with your larger dataset.
                    Code:
                    clear
                    input CompanyID Year Capex Post
                    1         2012  100   0
                    1         2013  120   0
                    1         2014  200   1
                    1         2015  300   1
                    2         2012  200   0
                    2         2013  140   0
                    2         2014  100   1
                    2         2015  140   1
                    3         2012  100   0
                    3         2013  120   0
                    3         2014  500   1
                    3         2015  800   1
                    end
                    
                    preserve
                        collapse Capex, by(CompanyID Post)
                        bysort CompanyID: gen diff = Capex[2]-Capex[1]
                        local l = 0
                        forvalues i = 1/`=ceil(_N/249)' {
                            local f = 1+`l'
                            if "`i'"=="1" local l = `f'+`=_N-1'
                            if "`i'"!="1" local l = `f'+248
                            levelsof CompanyID if diff>0 in `f'/`l', local(comp`i') separate(,)
                            if "`i'"=="1" local inlist inlist(CompanyID,`comp`i'')
                            if "`i'"!="1" local inlist `inlist' | inlist(CompanyID,`comp`i'')
                        }
                    restore
                    
                    bysort CompanyID: egen comp_mean = mean(Capex) if `inlist'
                    
                    preserve
                        duplicates drop CompanyID, force
                        egen mean = mean(comp_mean) if comp_mean!=.
                        replace mean = mean[_n-1] if mean==.
                        tempfile mean
                        save "`mean'"
                    restore
                    
                    merge m:1 CompanyID using "`mean'", keepusing(mean)
                    I have to imagine there is a more efficient way to do this...perhaps someone else will see the thread and have a better solution.

                    Comment


                    • #11
                      When running this I get the error: "observation numbers out of range r(198)" after the loop.
                      After the line:
                      HTML Code:
                      bysort CompanyID: egen comp_mean = mean(Capex) if `inlist'
                      I get the same error
                      "expression too long r(130);"

                      Comment


                      • #12
                        I found the problem causing the out of range error. The code below fixes it. To figure it out, I had to create a dataset that was larger. The first part of the code does that.

                        This approach runs into an internal limit if there are too many companies. The code will work for 550 companies. I didn't fine tune the limit but 600 companies trips the limit. If you have more companies than that, one idea is to break the data into smaller datasets, run the code and then append them back together again. Again, though, I imagine there is a more efficient approach to do this that others might identify for you.
                        Code:
                        clear 
                        set obs 550
                        gen CompanyID = _n
                        expand 4
                        egen Year = seq(), from(2012) to(2015)
                        gen Capex = round(uniform()*999)
                        gen Post=Year>=2014
                        
                        preserve
                            collapse Capex, by(CompanyID Post)
                            bysort CompanyID: gen diff = Capex[2]-Capex[1]
                            local l = 0
                            forvalues i = 1/`=ceil(_N/249)' {
                                local f = 1+`l'
                                local l = `f'+248
                                if "`i'"=="`=ceil(_N/249)'" local l = _N
                                levelsof CompanyID if diff>0 in `f'/`l', local(comp`i') separate(,)
                                if "`i'"=="1" local inlist inlist(CompanyID,`comp`i'')
                                if "`i'"!="1" local inlist `inlist' | inlist(CompanyID,`comp`i'')
                            }
                        restore
                        
                        bysort CompanyID: egen comp_mean = mean(Capex) if `inlist'
                        
                        preserve
                            duplicates drop CompanyID, force
                            egen mean = mean(comp_mean) if comp_mean!=.
                            replace mean = mean[_n-1] if mean==.
                            tempfile mean
                            save "`mean'"
                        restore
                        
                        merge m:1 CompanyID using "`mean'", keepusing(mean)

                        Comment


                        • #13
                          Thank you for the help!
                          My dataset consists of around 3500 companies, so I will try breaking my dataset into several smaller sets and try running the code.

                          Comment

                          Working...
                          X