Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi Lucien,
    I don't know why this problem is occurring - initialize is a standard parallel command. My last suggestion on this topic is to check that you have latest version of parallel installed.
    As for coefficients, Stata certainly does calculate coefficients - your above program saves them to a dataset. And you certainly can graph the coefficients estimates contained within said dataset. I don't know exactly what kind of graph you want, but I suspect that you may be looking for a combination of twoway scatter (for coefficient estimates) and twoway rspike (for CI estimates).
    Something like
    Code:
    graph twoway rspike ub_90 lb_90 i_pos || scatter beta i_pos
    All the best,
    Matt
    Last edited by Matthew Alexander; 26 Dec 2021, 11:29.

    Comment


    • #17
      Matthew, you are right, I just reinstalled the latest version of the program as you suggested.
      And a big thank you for the graphics code, it works and I think it's better than what I was doing.
      When I run the program now it seems to work without errors. But there is a problem with the output of the results.

      Normally with 10 iterations with loop, here is the command and the result I should have.
      Code:
      preserve
                      
                      drop beta se i_pos lb_* ub_*
                      g i_pos = .
                      g beta = .
                      g se = .
                      g lb_90 = .
                      g lb_95 = .
                      g lb_99 = .
                      g ub_90 = .
                      g ub_95 = .
                      g ub_99 = .
                      
                      qui {
                          sum rp_partner_rank, d
                          
                          forvalues i = `r(min)' (1) 10 {
                                  
                                      xi: reg rp_avg_pc_epi_gap_abs lag_fdi_in_all $contrlsorder i.year i.pccountry i.rpcountry if rp_partner_rank <= `i', vce(cl id)
                                      *eststo
                                      lincom _b[lag_fdi_in_all], l(95)
                                      replace beta = r(estimate) in `i'
                                      replace se = r(se) in `i'
                                      replace i_pos = `i' in `i'
                                      replace lb_95 = r(lb) in `i'
                                      replace ub_95 = r(ub) in `i'
                                      
                                      lincom _b[lag_fdi_in_all], l(90)
                                      replace lb_90 = r(lb) in `i'
                                      replace ub_90 = r(ub) in `i'
                                      
                                      lincom _b[lag_fdi_in_all], l(99)
                                      replace lb_99 = r(lb) in `i'
                                      replace ub_99 = r(ub) in `i'
                                  
                          
                          }
                      } 
                  
                      
                      *twoway rarea ub_90 lb_90 i_pos , astyle(ci) || ///
                      *line beta i_pos
                      
                      graph twoway rspike ub_90 lb_90 i_pos || scatter beta i_pos
      
                      save "$DataCreated\asup", replace
                  restore
      Click image for larger version

Name:	Graph3.png
Views:	1
Size:	51.7 KB
ID:	1642410


      normally I should have the variable i_pos should be length 10.

      With the iteration program without preserve with 4 clusters: the length is 40. I have the variable i_pos : 1 ... 10 1... 10 1 ... 10 1...10. Moreover the betas are different in the same i_pos (1 1 1 1). I have the impression that STATA creates subsamples (4 clusters), and makes 10 iterations in each subsample (cluster) created.

      Code:
      *parallel setclusters 4
                  parallel initialize 4, force s("C:\Stata 16 MP\StataMP-64.exe")
                  
                  capture program drop savereg
                  program define savereg 
                      qui {
                          sum rp_partner_rank, d
                          
                          forvalues i = `r(min)' (1) 10 {
                                  
                                      xi: reg rp_avg_pc_epi_gap_abs lag_fdi_in_all $contrlsorder i.year i.pccountry i.rpcountry if rp_partner_rank <= `i', vce(cl id)
                                      *eststo
                                      lincom _b[lag_fdi_in_all], l(95)
                                      replace beta = r(estimate) in `i'
                                      replace se = r(se) in `i'
                                      replace i_pos = `i' in `i'
                                      replace lb_95 = r(lb) in `i'
                                      replace ub_95 = r(ub) in `i'
                                      
                                      lincom _b[lag_fdi_in_all], l(90)
                                      replace lb_90 = r(lb) in `i'
                                      replace ub_90 = r(ub) in `i'
                                      
                                      lincom _b[lag_fdi_in_all], l(99)
                                      replace lb_99 = r(lb) in `i'
                                      replace ub_99 = r(ub) in `i'
                                  
                          
                          }
                      } 
                  
                  end
                  
                  parallel, prog(savereg): savereg
      Click image for larger version

Name:	Graph4.png
Views:	1
Size:	56.2 KB
ID:	1642411

      Comment


      • #18
        Hi Lucien,
        Happy to help.
        Personally, I do not use the setclusters syntax. This does not mean it is not appropriate in your case, though I note it is not included in help file here https://github.com/gvegayon/parallel
        If you are having issues then I advise you to drop
        Code:
        setclusters 4
        and simply use the following
        Code:
         parallel initialize 4, force s("C:\Stata 16 MP\StataMP-64.exe")
        If this new problem still persists, try explaining it again in as clear a manner as possible - I didn't fully understand the explanation above.
        All the best,
        Matt
        Last edited by Matthew Alexander; 26 Dec 2021, 12:42.

        Comment


        • #19
          I'll add too that you are correct about how parallel functions.
          Stata splits the dataset by the number of cores specified - and then simultaneously iterates over each of these subsets.

          Comment


          • #20
            Hi Mattiew
            I tried both codes (cluster and initialized)... both give the same result.
            I'll give up and try to find alternatives with R software. It should do the job without worries but if my director is more familiar with STATA.
            By the way, I exchanged Mr. George Vega Yon author of the parallel command, and I sent him this discussion page to follow us and intervene. But I think he is currently disconnected. Let's wait and see, what he thinks.
            Anyway, I want to thank you very much for all your help. It is really wonderful to have agents like you. I wish you success in your projects. And I hope that Stata will improve its parallelization orders in the future. Thanks a lot
            Lucien

            Comment


            • #21
              Hi Lucien,
              It is somewhat disappointing to hear that you have decided to give up on Stata for this particular analysis.
              I am certain that you are but one step away from achieving what you want.
              If you explain one last time this new problem related to the range of the variable - i_pos - then I do believe we can "crack" the case.
              If not, then I'm happy to have helped regardless.
              All the very best,
              Matt

              Comment


              • #22
                In fact, I've gone over the above. And the issue, I think, is that you are not telling parallel to do what you want.
                The issue is that you are running the regression within parallel.
                Parallel splits the datasets into 4 subsets in the first instance, thus when you include an estimation command within parallel you are estimating 4 separate regressions at each value between 1 and 10 within each subset of your dataset.
                I assume that you want to run the regression on the full population - to do this you will need to insert the regression command outside of the parallel program.
                For your needs, one option is to write a separate regression loop and save the estimates.
                That is
                Code:
                forvalues i = 1/10 {
                     xi: reg rp_avg_pc_epi_gap_abs lag_fdi_in_all $contrlsorder i.year i.pccountry i.rpcountry if rp_partner_rank <= `i', vce(cl id)  
                     estimates save est`i'
                }
                Then you should set your parallel program to run after the initial regression loop. Of course, I suspect that the whole reason for you wanting to use parallel is to speed up computation time of the regressions rather than post-estimation via lincom etc. Which is no longer case with the above code.
                A broader point is that it is somewhat unusual to estimate regressions by restricting the sample using if i_pos < i. Assuming i_pos is some kind of group identifier, estimates from such regressions are not directly comparable between groups - rather estimates represent the effect within that particular group. A more standard method would be to estimate the full model just once and then get marginal predictions/effects by values of i_pos using the excellent and in-built - margins - command and its - by - option. The margins documentation is very clear and helpful, and you can easily save the results and thus produce the sort of plots seen above.
                Hope this helps,
                Matt
                Last edited by Matthew Alexander; 26 Dec 2021, 14:15.

                Comment


                • #23
                  I didn't understand this:
                  poste #21: If you explain one last time this new problem related to the range of the variable - i_pos -
                  But by the way the variable i-pos gives the position of the partners for a country. If it is 1, it is the first partner; 2, the second; and n the n-th partner.
                  I want to see the effect of the lag_fdi_in_all variable on the dependent variable, depending on the partners considered.


                  Yes, I understand what you mean. But by the way, what I'm looking to do is much more complicated. There are still several other regressions that I have to run, between 14 and 20 (very long to run), and for each of the interest variables entering the regression (e.g. lag_fdi_in_all in the code), I have to compare the interest coefficients according to the position of the investor partners (rp_partner_rank), commercial partners (rp_trade_rank), and other partners....
                  But for a single regression, the position of the partners can go from 1 to 200. And I was looking to save especially time via parallelization with 12 cores if it is possible.
                  But alas, it's going to be complicated to get this time-saving.
                  I never used the command for marginal predictions/effects. I will read the documentation on it and try to use it to see what happens.
                  Thanks again for the suggestion.
                  Many thanks Matthew. I'll keep you posted on what happens next.
                  Thanks again.
                  Last edited by Lucien AHOUANGBE; 26 Dec 2021, 16:34.

                  Comment


                  • #24
                    Well, Lucien, that certainly does sound like a rather complicated endeavour. Make sure that what you are doing is really what you want to do before investing what I imagine will be a lot of time.
                    I assume that when you say you want to estimate " the effect depending on the partners considered", what you mean is that you want to estimate the effect for each partner. If so, then it seems to me that a more efficient, natural and accessible method would be to estimate a single regression model, and then calculate Average Marginal Effects for each predictor using - margins - at each value of i_pos, i_com and so forth using option -by- or -over.
                    Specifically, the syntax would be something like
                    Code:
                    reg rp_avg_pc_epi_gap_abs lag_fdi_in_all $contrlsorder i.year i.pccountry i.rpcountry, vce(cl id)
                    
                    margins, dydx(lag_fdi_in_all) over(rp_partner_rank) atmeans
                    Food for thought at the very least.
                    I wish you all the best, feel free to update me on the project.
                    Matt
                    Last edited by Matthew Alexander; 26 Dec 2021, 17:07.

                    Comment


                    • #25
                      But by the way the variable i-pos gives the position of the partners for a country. If it is 1, it is the first partner; 2, the second; and n the n-th partner.
                      No, sorry, I made a mistake above.
                      In fact, we want to take into account the effect of the size of partner countries, but not individually as with the code you sent me, but by taking into account their importance in trade with the host country.
                      the variable i-pos gives the position of the partners for a country. If it is 1, it is the first and largest partner; 2, the second-largest partner; and n the n-th partner.

                      So for example, i-pos <=15 means that we want to take into account the first 15 largest partners, i-pos <=100 means that we want to take into account the first 100 largest partners.

                      I was trying to understand your code. And I think it only take the effect with each partner individually, according to its position with the variable i_pos.
                      I was trying to see if we can take into account the first 15 at the same time. And I think it must exist on stata, I'll continue the research too.
                      But I don't know if you can understand me, if not I will try to explain better.

                      I don't know if it will work but I have an idea with your code to combine the subpop option with foreach, like this

                      Code:
                       reg rp_avg_pc_epi_gap_abs lag_fdi_in_all $contrlsorder i.year i.pccountry i.rpcountry, vce(cl id)
                      
                      foreach i = 1/10 {
                      margins, dydx(lag_fdi_in_all) subpop(rp_partner_rank <= `i') atmeans
                      // store the result
                      }

                      But I have started a job again for the moment and the memory is saturated. I wait for the memory to free up a bit and then I run a code like this.
                      Thanks a lot Matthiew
                      Last edited by Lucien AHOUANGBE; 26 Dec 2021, 18:56.

                      Comment


                      • #26
                        Hi Lucien,
                        I now understand what you want to do, and why you want to use parallel. This will certainly be a computationally costly analysis.
                        That said, I believe we were very close to achieving what you want with the earlier parallel program. The only problem was that parallel produced an estimate for a given i_pos value within each cluster/subset, hence 40 estimates when there should have been just 10.
                        For the final time, I now believe I know what the problem was. In short, parallel splits the dataset into four subsets. Therefore, when you replace the ith observation in each of the subsets with the ith coefficient value, what is effectively happening is that for 3 of four subsets the ith observation does not correspond with the ith coefficient. For example, say you used 5 cores/clusters and your loop went from 1/10, the fifth parallel subset will estimate the coefficient where i_pos is a) <= 9 and b) <=10. But the 9th and 10th observation in the fifth subset will not correspond to i_pos = 9 and I_pos = 10.
                        The solution, I believe, is to use a postfile rather than directly replacing the values in the subset. Try the following

                        Code:
                        
                        parallel initialize 4, force s("C:\Stata 16 MP\StataMP-64.exe")
                                    
                                    capture program drop savereg
                                    program define savereg
                        
                                    tempname tempf
                                    postfile `tempf' beta se i_pos lb_95 ub_95 lb_90 ub_90 lb_99 ub_99 using mypostfile, replace
                        
                                  
                                        qui {
                                            sum rp_partner_rank, d
                                            
                                            forvalues i = `r(min)' (1) 10 {
                                                    
                                                        xi: reg rp_avg_pc_epi_gap_abs lag_fdi_in_all $contrlsorder i.year i.pccountry i.rpcountry if rp_partner_rank <= `i', vce(cl id)
                                                        *eststo
                                                        lincom _b[lag_fdi_in_all], l(95)
                                                        local beta = r(estimate)
                                                        local se = r(se)
                                                        local i_pos = `i'
                                                        local lb_95 = r(lb)
                                                        local ub_95 = r(ub)
                                                        
                                                        lincom _b[lag_fdi_in_all], l(90)
                                                        local lb_90 = r(lb)
                                                        local ub_90 = r(ub)
                                                        
                                                        lincom _b[lag_fdi_in_all], l(99)
                                                        local lb_99 = r(lb)
                                                        local ub_99 = r(ub)
                        
                                                        post `tempf' (`beta') (`se') (`i_pos') (`lb_95') (`ub_95') (`lb_90') (`ub_90') (`lb_99') (`ub_99')
                                                    
                                            
                                            }
                                        }
                                    
                              postclose `tempf'
                                    
                         end
                          
                         parallel, prog(savereg): savereg
                        Then, open the file mypostfile, which is saved within your working directory. Inspect the results. I think they will be what you are looking for.
                        Matt
                        Last edited by Matthew Alexander; 26 Dec 2021, 19:27.

                        Comment


                        • #27
                          If this is still not giving you what you want, my very final suggestion is to run the same program above, but with the parallel -by- option like so
                          Code:
                           
                            parallel, prog(savereg) by(rp_partner_rank): savereg
                          or by(i_pos), I'm not sure which you want.
                          Best,
                          Matt

                          Comment


                          • #28
                            Hello Matthew,
                            Sorry for the delay. I had logged out since my last message.
                            Thank you for the code. It made me learn a lot about new things.
                            But unfortunately it still doesn't work.
                            I have improved the code to know the number of observations entering each sub sample.
                            I normally have 105254 observations but with the parellelization I end up with about 26314 in each sub sample and still 40 beta instead of 10.
                            Code:
                            . sum
                            
                                Variable |        Obs        Mean    Std. Dev.       Min        Max
                            -------------+---------------------------------------------------------
                                    beta |         40     .001144    .0018185  -.0019402   .0059244
                                      se |         40    .0014964    .0007972   .0005478   .0038555
                                   i_pos |         40         5.5    2.908872          1         10
                                   lb_95 |         40   -.0018164    .0012459  -.0044984   .0004264
                                   ub_95 |         40    .0041044    .0031552   .0004887   .0124325
                            -------------+---------------------------------------------------------
                                   lb_90 |         40    -.001335    .0012267  -.0040737   .0011571
                                   ub_90 |         40     .003623    .0029247   .0001881   .0112049
                                   lb_99 |         40   -.0027679    .0014181  -.0059822  -.0004713
                                   ub_99 |         40    .0050559    .0036146   .0010598    .014845
                                       N |         40     26313.5    .5063697      26313      26314
                            For this:
                            parallel, prog(savereg) by(rp_partner_rank): savereg
                            There is an error when I run it.
                            Code:
                            .  parallel, prog(savereg) by(rp_partner_rank): savereg
                            Data not sorted
                            r(5);
                            Thank you very much for the support.
                            For the moment I'm going to leave this part, I'm going to try to move forward on the results I have now and try to find a solution. I will get back to you as soon as possible.
                            Lucien
                            Last edited by Lucien AHOUANGBE; 27 Dec 2021, 18:44.

                            Comment


                            • #29
                              Hi Lucien,
                              I looked it in the issue further. In short, parallel splits the dataset into n subsets, and so I do not think it is possible run regressions within parallel on the full sample as you want to do.
                              Perhaps there is some kind of workaround, though I do not know it. The best solution may be simply to allow your pc time to compute the many regressions you want to run. I know you have a large number of observations, but it is still possible. I also noted that you cluster your standard errors by cli and id. If your data is panel data (that is, repeated observations of clusters) then I strongly advise you to use xtreg to account for unobserved differences between clusters (unobserved heterogeneity). Option - re - will estimate a random effects model. And - fe - will estimates fixed effects.
                              Let me know if you would like to more,
                              Matt

                              Comment


                              • #30
                                Hi all,

                                Lucien, as Matthew points out, the default behavior of parallel is to split the dataset into how many threads you are using. Nonetheless, parallel is perfectly capable of doing what you are trying to do. What you need to do is:
                                1. Write a program that loads the dataset with use ...
                                2. Within that program, make use of the parallel macros $PLL_CHILDREN (global, number of threads) and $pll_instance (global, takes values 1 through $PLL_CHILDREN) to control what chunk of the loop is done per thread.
                                3. At the end of the program, you can do something like save "iteration_$pll_instance`'.dat", replace to make each thread save a different version of the file.
                                4. Then you can use parallel append, do(yourprogram) e("iteration_%g.dat, 1/$PLL_CHILDREN").
                                More examples on how to use the parallel macros here and here.

                                HIH

                                George
                                Last edited by George Vega; 30 Dec 2021, 12:29. Reason: didn't like how [CODE] looked like

                                Comment

                                Working...
                                X