Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • statsby and storing predicted values

    I have done some searching for a solution to this problem but I cannot find the exact answer.

    I have used statsby and saved the coefficents of a linear regression as a new data set.

    I want to know if the predict command can be used with statsby to also save predicted values of the linear regression?

    My alternative idea is to use the data set with the saved coefficients to manually code the regression and get predicted values - but this seems a little messy especially as I am running the regression for 888000 groups.

    Thank you

  • #2
    I dont believe statsby would provide predictions, but maybe someone will correct me.

    Your question is somewhat unclear. Predict works on individual observations, statsby on groups. So you could implement your alternative idea by collapsing original data on groups and multiplying coefficients with variable values. The code to do so would be exactly as elaborate for 2 groups as for 800,000 groups, so I dont see an issue there.
    I do wonder what your purpose with 880,000 groups is. Maybe some more explanation on the point of your exercise would help give better suggestions.

    Comment


    • #3
      Short answer is No. statsby saves one observation for each by group. But you want a dataset with the same number of observations as the original values. But what you could do is merge the coefficients dataset back with the original data and run the calculation.

      Alternatively, you could use a command like rangestat (SSC) which isn't a reduction command. Here is a dopey example.

      Code:
      . webuse grunfeld, clear
      
      . rangestat (reg) invest mvalue, int(year 0 0)
      
      . gen predicted = b_cons + b_mvalue * mvalue
      rangestat ran a series of regressions and left the coefficients in the dataset as new variables. Then it's just a matter of doing all the calculations for all the regressions: but, white magic and see how educating some at Hogwarts helps the community, it's just one command.

      There are other commands in this territory with loosely similar goals, but I am most familiar with rangestat. You need to download it first with

      Code:
      ssc install rangestat
      and then read the help.

      Comment


      • #4
        Thanks for clarifying my first thoughts - that statsby is not appropriate for storing predictions.

        The point of the exercise is to have a linear regression for each age (0-110+), sex and year (4 points in time) group to identify the extent to which place of residence (5 places) effects a health outcome. I now want the predicted values for the 5 places of residence.

        I think the solution is to get the predicted results for the individual observations and, if need be, to merge with the coefficient data set.

        Thanks for your quick response.

        Comment


        • #5
          I think the solution is to get the predicted results for the individual observations and, if need be, to merge with the coefficient data set.

          No; that's a misunderstanding of the advice.

          You have a choice. (Dopey) You can use statsby, then merge with the original and then do the calculation. I really don't recommend that, because (smart) you can do it all in place with any merge at all.

          The example in #3 is deliberately one you can imitate yourself so that you can see what happens. edit the data before and after to understand what is going on.

          Comment


          • #6
            Thanks Nick - this seems to be the exact solution.

            I am now struggling to put rangestat into a loop or use it with bysort.

            For example with the 'webuse grunfeld' example data set how would I make the regression run within each company?

            Comment


            • #7
              Code:
               
               rangestat (reg) invest mvalue, int(company 0 0)
              The 0 0 are offsets. The syntax in this case is that the interval runs from company+0 to company+0. Here it's a long-winded way to say for each distinct company.

              Otherwise put, interval() specifies the interval that defines a subset or subseries for regression. If you want the interval to be a single integer, that's fine.The program doesn't mind or care. The only rule is that the interval is numeric. So string identifiers need to be mapped to distinct integers.

              The leading use case for rangestat is statistics within ranges, possibly overlapping, for time series, but the syntax allows several other applications too.

              Comment


              • #8
                Thanks so much Nick this solves the problem perfectly!

                Comment

                Working...
                X