Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -egen, mean- in Mata: Can somebody explain this Mata code by William Gould, and make it work?

    Good afternoon,

    I found the following code, supposed to do what -egen, mean()- does but in Mata, by William Gould on this thread here: https://www.stata.com/statalist/arch.../msg00582.html

    1. Can you make the code work? (It currently does not work.)

    2. Can you explain what the lines that I have marked and numbered do in 1-2 sentences?

    Code:
    clear mata
    
    
    
    mata:
           void function mean_by2(string scalar var, string scalar groupid)
            {
                    real scalar     i, j, j0, j1, sum, mean
                    real colvector  id, y, p, y2
                    //transmoprhic        info    // 1) What is this and why does it generate an error? I changed it to:
                    real matrix info
    
                    st_view(id, ., groupid)
                    st_view(y, ., var)
                    p = order(id, 1)    // 2) Is this something like "sort id" in Stata? 
    
                    info=panelsetup(my_y, 1)    // 3) Where did this my_y popped up from? It is not defined anywhere before? 
                    st_view(y2, ., st_addvar("double", "_y2"))
    
                    for (i=1; i<=rows(info); i++) {
                            j0 = info[i, 1]
                            j1 = info[i, 2]
                            sum = 0
                            mean = mean(my_y[|j0\j1|])    // 4) What does |j0\j1| do? 
                            for (j=j0; j<=j1; j++) y2[p[j]] = mean    // 5) What does this thing do? 
                    }
            }
    end
    
    
    
    clear 
    
    set obs 1000000
    
    set more off
    
    egen id=seq(), block(100)
    
    gen y=rnormal()
    
    timer on 1
    
    egen double x1=mean(y), by(id)
    
    timer off 1
    
    timer on 2
    
    mata: mean_by2("y","id")
    
    timer off 2
    
    timer list
    How the code does not work:

    Firstly, around my comment 1) when the declaration is
    transmoprhic info
    I get the following error:
    Code:
    . mata:
    ------------------------------------------------- mata (type end to exit) --------------------------------
    :        void function mean_by2(string scalar var, string scalar groupid)
    >         {
    >                 real scalar     i, j, j0, j1, sum, mean
    >                 real colvector  id, y, p, y2
    >                 transmoprhic        info        // 1) What is this and why does it generate an error? I 
    > changed it to:
    invalid expression
    (17 lines skipped)
    ----------------------------------------------------------------------------------------------------------
    r(3000);
    
    end of do-file
    
    r(3000);
    When I change the declaration to
    real matrix info
    the code goes through, but

    Second, the code does not seem to be doing anything. If I understand correctly the Mata code is supposed to generate a variable y2 which is identical to the variable x1 generated by -egen, mean()-.




  • #2
    First, here is the working code:

    Code:
    clear mata
    mata:
            void function mean_by2(string scalar var, string scalar groupid)
            {
                    real scalar     i, j, j0, j1, sum, mean
                    real colvector  id, y, p, y2
                    transmorphic info
    
                    st_view(id, ., groupid)
                    st_view(y, ., var)
                    p = order(id, 1)
                    
                    info=panelsetup(id, 1)
                    st_view(y2, ., st_addvar("double", "_y2"))
    
                    for (i=1; i<=rows(info); i++) {
                            j0 = info[i, 1]
                            j1 = info[i, 2]
                            sum = 0
                            mean = mean(y[|j0\j1|])
                            for (j=j0; j<=j1; j++) y2[p[j]] = mean
                    }
            }
    end
    clear
    set obs 1000000
    set more off
    egen id=seq(), block(100)
    gen y=rnormal()
    timer on 1
    egen double x1=mean(y), by(id)
    timer off 1
    timer on 2
    mata: mean_by2("y","id")
    timer off 2
    Resulting in:

    Code:
    . timer list
       1:      7.03 /       49 =       0.1434
       2:    209.68 /       40 =       5.2421
    
    . count if x1-_y2!=0
      0
    Some answers to your questions:

    1 transmoprhic is a typo, which should be transmorphic.

    2 The order() function returns a permutation function which will later be used to place values of y2 within a given id in the correct observation numbers (see help mf_order and help m1_permutation)

    3 I believe my_y is a typo or a placeholder which was accidentally left in the final code. I have replaced it with the id matrix in the panelsetup() function and with the y matrix in the mean() function

    4 j0 is the first observation number for the id number currently being iterated on, and j1 is the last. So for id = 1, j0 is equal to 1 and j1 is equal to 99. This code calculates the mean of y within that range of observation numbers.

    5 The last line creates a nested loop within the range of j0 and j1 for the j0 and j1 created in the parent loop, iterating by 1, and replaces y2 with the mean calculated in the previous line for each observation within that range.

    Hope this is useful. I mostly figured out the various functions and their behavior by briefly reviewing the Mata documentation, which is a great resource.

    NB: While the above code does what it is supposed to do, my amendments may not accord with Bill Gould (StataCorp)'s original intention. I leave it to him (or anyone else) to propose alternative amendments or provide more detailed answers to your questions.
    Last edited by Ali Atia; 14 Jun 2021, 15:42.

    Comment


    • #3
      Thank you very much, Ali Atia. Things are mostly clear now. However, can you share further expertise on the following points:

      2. Is there any reason to use order() function instead of sorting either in Mata, or sorting in Stata before we pass the objects to Mata? (This Mata code is not particularly fast, and these orders and permutations are giving me headache when I read the explanation in the manual...)

      4. The subscripting x[|beginobs\endobs|] is clear, do you know what the subscripting x[1\4] does? As the following example shows it does something, but I cannot figure out what:
      Code:
      . sysuse auto, clear
      (1978 Automobile Data)
      
      . mata: x= st_data(.,"price")
      
      . mata: mean = mean(x[1\4])
      
      . mata: mean
        4457.5
      
      . mata: mean = mean(x[|1\4|])
      
      . mata: mean
        4365.75
      
      . summ price in 1/4
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
             price |          4     4365.75     497.315       3799       4816
      
      . summ price
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
             price |         74    6165.257    2949.496       3291      15906



      5. Is there not any more elegant syntax to assign the same quantity to multiple rows in Mata, while avoiding the loop in 5.? For example if I want to assign the number 7 to rows from 1 to 4 of the vector x that we created from price above, the following seem natural but do not work:

      Code:
      . mata: x[|1\4|] = 7
                       <istmt>:  3301  subscript invalid [1]
      r(3301);
      
      . mata: x[1..4,1] = 7
                       <istmt>:  3301  subscript invalid [1]
      r(3301);


      Comment


      • #4
        2 - as long as the data is sorted by id before using the mata function, you do not need to use order(). So with this amended code:

        Code:
        mata:
                void function mean_by2(string scalar var, string scalar groupid)
                {
                        real scalar     i, j, j0, j1, sum, mean
                        real colvector  id, y, y2
                        transmorphic info
        
                        st_view(id, ., groupid)
                        st_view(y, ., var)
        
                        info=panelsetup(id, 1)
                        st_view(y2, ., st_addvar("double", "_y2"))
        
                        for (i=1; i<=rows(info); i++) {
                                j0 = info[i, 1]
                                j1 = info[i, 2]
                                sum = 0
                                mean = mean(y[|j0\j1|])
                                for (j=j0; j<=j1; j++) y2[j] = mean
                        }
                }
        end
        This will work:
        Code:
        clear
        set obs 1000000
        set more off
        egen id=seq(), block(100)
        gen y=rnormal()
        timer on 1
        egen double x1=mean(y), by(id)
        timer off 1
        timer on 2
        sort id
        mata: mean_by2("y","id")
        timer off 2
        . count if x1-_y2!=0
          0
        But this won't:

        Code:
        sort y
        mata: mean_by2("y","id")
        . count if x1-_y2!=0
          1,000,000
        4. The subscripting x[1\4] refers to row 1 and row 4 of x (see help m2 subscript):

        Code:
        . sysuse auto, clear
        (1978 Automobile Data)
        
        . mata: x= st_data(.,"price")
        
        . mata: x[1\4]
                  1
            +--------+
          1 |  4099  |
          2 |  4816  |
            +--------+
        
        . mata: x[|1\4|]
                  1
            +--------+
          1 |  4099  |
          2 |  4749  |
          3 |  3799  |
          4 |  4816  |
            +--------+
        
        . sum price if _n==1 | _n==4
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
               price |          2      4457.5    506.9956       4099       4816
        
        . mata: mean(x[1\4])
          4457.5
        So the mean you found is (4099+4816)/2.

        5. I am not familiar with any way to do what you ask, though I agree that a more elegant solution than looping through rows would be nice to have if it doesn't already exist unbeknownst to me.

        On a final note, you could also do the sorting and filling operations in Stata in order to avoid the nested loop and order() function, though speed gains from doing that are very minor:

        Code:
        clear mata
        cap program drop mean_by2
        mata:
                void function mean_by2(string scalar var, string scalar groupid)
                {
                        real scalar     i, j0, j1
                        real colvector  id, y, y2
                        transmorphic info
        
                        st_view(id, ., groupid)
                        st_view(y, ., var)
                        
                        info=panelsetup(id, 1)
                        st_view(y2, ., st_addvar("double", "_y2"))
        
                        for (i=1; i<=rows(info); i++) {
                                j0 = info[i, 1]
                                j1 = info[i, 2]
                                y2[j0] = mean(y[|j0\j1|])
                        }
                }
        end
        program define mean_by2
        sort `4'
        mata: mean_by2`0'
        by `4' :replace _y2=_y2[1]
        end
        
        clear
        set obs 1000000
        set more off
        egen id=seq(), block(100)
        gen y=rnormal()
        timer on 1
        egen double x1=mean(y), by(id)
        timer off 1
        timer on 2
        mean_by2("y","id")
        timer off 2
        Resulting in:

        Code:
        . timer list
           1:      4.74 /       30 =       0.1578
           2:    180.51 /       21 =       8.5959
        
        . count if x1-_y2!=0
          0
        Last edited by Ali Atia; 14 Jun 2021, 17:32.

        Comment


        • #5
          Originally posted by Joro Kolev View Post
          2. Is there any reason to use order() function instead of sorting either in Mata, or sorting in Stata before we pass the objects to Mata?
          I have not really read the details in this thread but perhaps the idea is not to change the sort order of the underlying data (which then perhaps would need to be changed back).


          Originally posted by Joro Kolev View Post
          5. Is there not any more elegant syntax to assign the same quantity to multiple rows in Mata, while avoiding the loop in 5.? For example if I want to assign the number 7 to rows from 1 to 4 of the vector x
          The term "elegant" is obviously subjective here. Personally, I would not want to see a scalar being "magically" repeated to fit the length of a vector. You can do that manually, though:

          Code:
          x[|1\4|] = J(4, 1, 7)

          Comment


          • #6
            The question which I had in mind was slightly more general -- "how to assign a vector to multiple rows of a matrix in Mata" -- and doing a Google search did not return anything useful.

            Given that the elegant solution with subscripting that you do not find so elegant did not work, I came up with a slightly more general version to what you propose. E.g., lets say that I have the matrix Matrix and the vector row

            Code:
            . mata: Matrix = J(7,3,1)
            
            . mata: row = (1,2,3)
            and I want to assign the vector row to the rows 1,2, and 3 of matrix Matrix. Then I can use the Kronecker product in combination with the matrix J:

            Code:
            . mata: Matrix[|1,1\3,3|] = J(3,1,1)#row
            
            . mata: Matrix
                   1   2   3
                +-------------+
              1 |  1   2   3  |
              2 |  1   2   3  |
              3 |  1   2   3  |
              4 |  1   1   1  |
              5 |  1   1   1  |
              6 |  1   1   1  |
              7 |  1   1   1  |
                +-------------+
            Originally posted by daniel klein View Post


            The term "elegant" is obviously subjective here. Personally, I would not want to see a scalar being "magically" repeated to fit the length of a vector. You can do that manually, though:

            Code:
            x[|1\4|] = J(4, 1, 7)

            Comment


            • #7
              Kronecker products are not needed here; J() accepts matrices as input.

              Code:
              Matrix[1::3,] = J(3, 1, row)
              will do.

              Comment

              Working...
              X