Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Takes forever to run regression

    I searched this forum and there were several posts about taking forever to run a specific regression, and I am facing this old problem for running a logistic regression.

    My code:
    Code:
    quietly logistic UP sys_rtsenetload_ramp_absGW lba_rtseload_ramp_absGW temperature windspeed PL_status ibn.cpname_cat ibn.lbaname_cat ibn.mpname_cat ibn.MarketDate ibn.hour_of_week_cat
    My dataset: ~45 million observation

    The dependent variable UP is dichotomous or, 0/1. Hence I choose logistic regression as my baseline regression, before trying other methods. I also need to include several dummies to control for different fixed effects. The regression has been running for two weeks and I have no idea when it may complete.

    I appreciate any suggestions or I simply should continue waiting until the regression completes.

    Best,
    CHT




  • #2
    Chen:
    the trivial fix is to run you regression code on a subsample of your dataset (say, 100,000 observations) and see if any weird behaviour comes alive.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      45 million observations is not the most important measure here. What is more important for convergence is the number of parameters you are trying to estimate, which is a metric for complexity of model. Your predictors are

      Code:
      lba_rtseload_ramp_absGW  
      
      temperature  
      
      windspeed  
      
      PL_status  
      
      ibn.cpname_cat  
      
      ibn.lbaname_cat  
      
      ibn.mpname_cat  
      
      ibn.MarketDate  
      
      ibn.hour_of_week_cat
      I am concerned at the large number of factor variables there. Hour of week? Is that 7 x 24 = 168 distinct values? Market date? How many distinct values?

      Conversely, what kind of dependence do you expect for temperature and windspeed? For example, energy consumption for buildings can go up as it gets colder (heating) and as it gets warmer (air conditioning), but not symmetrically. I have no idea what your response indicates, but dependence on climatic variables can reasonably be nonlinear and even non-monotonic.

      In short, extreme difficulties in fitting a model don't reflect dataset size so much as a poorly specified model. I'd start simpler and see where the problems set in.

      Comment


      • #4
        Thanks for replies.

        I am analyzing the performance of wind generation, and given the "unpredictable" nature of wind, you can imagine the reason of including both Hour-of-week and Market date dummies. My dataset covers year 2014 to 2018, hence there are over 1,800 distinct values for Market date, and over 7000 distinct values for hour-of-week (by year, month, week, hour).

        If to start simpler, how would you suggest to simplify the time dummies that I currently have? Thank you.

        Comment


        • #5
          So you're trying to fit in a model with of the order of 10,000 parameters!

          Often you can try fitting trend in year and also sinusoids in month of year, day of week, and hour of day. That could be perhaps 2 parameters for trend and 4 parameters for each sinusoid. (two pairs of sine and cosine terms often suffice).

          Comment

          Working...
          X