Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multilevel model with mixed unable to allocate matrix Stata/MP

    Hi,

    I am a novice Stata user and I am trying to implement a multilevel model with mixed command in a panel data. I am using Stata MP 16.0 edition.

    There are 3 stores with 25 product categories that are sold to customers living in various zip codes during the period of 26 months. Some customers live within the area of store and some don't. I first implemented a zipcode-time fixed effects specification with reghdfe command to analyze the effect of store opening on total sales. The unit of analyis is zipcode-month. I successfully ran the following: reghdfe log_sales store_open_dummies, absorb (ibn.zipcode i.month) vce (cluster zipcode)

    Now, I am trying to expand the model to product categories so that I can account for heterogenous effect across categories. Some products may benefit from the store opening more than others. The unit of analysis becomes zipcode-month-category. The equation that I am trying to implement should have a fixed effect intercept for each category across months (month-category fixed effect), random intercept for each category across zip codes (zipcode-category random intercept) and random slopes (coef) for each store across categories of store_open_dummies. Stores are independent from each other and within each store, there is a dependency across categories since categories all come from the same distribution. The dataset has ~240,000 observations.

    I tried the following 4-level model. Is this the accurate way to do this?
    mixed log_sales c.month, || store: || category: store_open_dummies, cov(uns) || zipcode:

    Unfortunately, I got the following error:

    " unable to allocate matrix;
    You have attempted to create a matrix with too many rows or columns or attempted to fit a model with too many variables.

    You are using Stata/MP which supports matrices with up to 65534 rows or columns. This is the maximum matrix size.

    If you are using factor variables and included an interaction that has lots of missing cells, try set emptycells drop to reduce the
    required matrix size; see help set emptycells.

    If you are using factor variables, you might have accidentally treated a continuous variable as a categorical, resulting in lots of
    categories. Use the c. operator on such variables."


    Is there anything I can do to avoid this error so that the model runs? I tried some other versions of this command and I got the same error except the following which worked ok.
    mixed log_sum_bask_rev store_open_dummies || category: || zipcode:


    Thank you in advance for any help, input, feedback...


  • #2
    The problem is most likely arising from the -cov(uns)- specification, exacerbated by a large (undisclosed) number of store_open_dummies. If there are N random effects (intercepts and slopes) and you specify -cov(uns)- the matrix will require O(N2) entries. You can probably keep the store_open_dummy random slopes if you eliminate -cov(uns)-. This amounts to assuming that the random slopes for the open store dummies are independent--is there a basis for thinking this is unreasonable? Using unstructured covariance rarely makes much of an impact on the coefficients of the main variables in the analysis. When it runs it makes estimation slow, and, as here, sometimes it breaks limits.

    By the way, you should not be specifying random slopes for the store_open_dummies without also specifying them in the fixed-effects level of the model. To do that is to constrain the average slopes of those dummies being zero--which is a far more restrictive assumption than, say, -cov(independent)-, and probably a lot less reasonable, if not outlandish.

    Comment


    • #3
      The reasoning as it follows: Some categories such as televisions and headphones may benefit similarly being in the store (e.g. they are both products that you may like to test before purchasing), so it is assumed in the model that category random slopes for the open_store_dummies within the same store can be correlated. We are mainly interested in the average effect size of various product categories being present in a store, so I am not sure but specifying -cov(uns)- may also help to tighten the confidence intervals of the mean estimates of open_store_dummies for each product category. What do you think?

      Per your feedback, I will try this command: mixed log_sales c.month store_open_dummies , || store: || category: store_open_dummies || zipcode:

      But I think it assumes that months are nested within zip codes. In the specification, I need a month-category fixed effect, zipcode-category random intercept and random slopes for each store across categories of store_open_dummies. So should it be instead the following comamnd? mixed log_sales c.month c.zipcode store_open_dummies , || store: || category: store_open_dummies

      Thank you Clyde.


      Comment


      • #4
        Using c.zipcode makes no sense. There is no way in which zipcodes are quantitative information that can be ordered, added, subtracted etc. They are just numerical labels for categories. It has to be i.zipcode. Month can be treated either way. i.month would specify a separate effect on sales each month--monthly shocks. c.month models a continuing trend over time: monotonically increasing (or monotonically decreasing). Which of those makes sense in your context, I wouldn't know.

        From your description, I am now thinking that in fact what you have are crossed store and category effects nested within zipcode. Is that right? Does each store have its own separate selection of products, or do all the stores carry the same (or most of the same) products? If the latter, these effects are crossed. But stores would be nested within zip-code. This is a very complex design, and I think that when properly modeled it will prove even more memory intensive than what you have tried so far. But it seems to me you are looking at something like:

        Code:
        mixed log_sales c.month /* or i.month ? */ i.(store_open_dummies) || zipcode: || _all: R.store || category:
        Note: An equivalent formulation of the random effects here would be -|| zipcode: || _all: R.category || store:-. You should pick the version that has -_all: R.- attached to the whichever of store and category has the fewest distinct values. That will save memory and also reduce compute time.

        Added: One other thought that would simplify this more. Since you apparently are giving up on the store_open_dummies random slopes, do you really need category: to be a random effect. Why not just have it as a fixed effect. Then you avoid the problem of crossed random effects with store.

        Code:
        mixed log_sales c.month /* or i. month ? */ i.(store_open_dummies) i.category || zip:code || store:
        Last edited by Clyde Schechter; 17 Mar 2022, 12:56.

        Comment


        • #5
          I don't know if you're married to using the mixed effects estimator, but what's your question that you're investigating in the first place? You say you look at store openings on product sales? What are we really interested in estimating here?

          Comment


          • #6
            I could not overcome yet the memory problem. The code below runs for a day and breaks with the following error. My understanding is that the matrix with 80548 rows and 469 columns cannot be allocated. I have a good set-up I think. Intel i-7-7600U, 64-bit OS, 32 GB Ram and I am using Stata/MP 16.0 version. I could not find a way to allocate more memory to Stata. My understanding is that Stata uses what is available, so there is no way to increase memory. Is this correct?


            Code:
             
             mixed log_sales i.month store_open_dummies , || store: || category: store_open_dummies || zipcode:
            Click image for larger version

Name:	stata_error.PNG
Views:	1
Size:	23.6 KB
ID:	1655011

            Comment


            • #7
              If I recall, setting memory was something we had to do in the Elder Days of Stata. I of course was a little kid then in such times, but my point is that I don't think memory or your setting it is the issue

              Comment


              • #8
                Thanks Jared. Maybe Clyde have some insights?? What could be the potential issues? Is it really the limitation of Stata/MP? I'd prefer to stay in Stata but does it make sense to implement in R maybe..

                Comment


                • #9
                  My bigger question for you though that I asked in my previous post is "What's the design of your paper?" You say you're looking at the effect of stores opening, this implies to me that you're interested in doing some kind of treatment effects analysis, right? Before we talk about the statistical estimator to use, I'm concerned with how you're even setting up your paper to begin with. Cuz these are the really important details, even if your estimator did what you want it to do, having your paper rooted in one or a blend of econometric frameworks to estimate effects is crucial, if the causal effect of these stores is what you're really after.

                  Comment


                  • #10
                    There is very little you can do to manage memory allocation in modern Stata, and usually, when you try those things, you end up making matters worse, not better. Stata has very good memory management at this point in its evolution. Also, memory management is not entirely up to Stata: the OS gets a vote, too, as they say. And the OS's vote is going to also depend on what else is running. If there is something else running in the background that is hogging up a huge amount of memory, then rerunning the model later when that application has finished (or terminating it) may resolve the problem. Or not. The only surefire solution is to simplify the model, which means eliminating variables. You have conveniently abbreviated "store_open_dummies," but I can only infer that there are an enormous number of them. And I'm having a hard time imagining how that can be. I suppose there are some store chains that have stores located every few blocks in major metropolitan areas. Anyway, my thought would be to see if there is some more compact way to represent this information. Rather than having a separate indicator for each store opening, maybe you could get by with using a variable that counts the number of store openings within the past [whatever] time interval witihn [whatever] geographic distance of the store referenced in the current observation. That would just be a single variable! Or maybe you need a few such variables, one corresponding to different categories of stores (e.g. stand-alone store vs part of a mall, or size categories or something like that). I don't know anything about this subject matter, so I can't offer specific advice. But if you are going to continue this in Stata, I think you will have to get creative about this information.

                    I should add that I am not trying to suggest that you discard useful information here. Unless this really is one of those chains that has a store every few blocks (e.g. Starbucks), it is hard for me to imagine that consumer purchasing behavior is measurably influenced by thousands of different store opening events, most of which they probably have no possible awareness of. I'm suggesting that you are over-detailing your model, not just in terms of computational issues, but in terms of real-world plausibility. So think about it. As I say, this is way out of my area and I can't offer specific advice. It's just hard for me to believe that a model of a consequence of consumer behavior can really involve enough variables to breach memory limits of Stata on a modern computer.

                    Comment

                    Working...
                    X