Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to count values the matrix way

    Hi
    I have to share the following code with you.
    I like oneliners or almost oneliner functions in coding.
    And I like trying doing it using matrix code.

    So here is how to generate a count table in Mata the matrix way:
    Code:
    : uniformseed(1234)
    : x = round(uniform(20,1) * 10, 1)    // The data
    
    : y = uniqrows(x)    // Get the unique values of x
    : y, rowsum(J(rows(y), 1, x') :== y)    // And now the count table
           1   2
        +---------+
      1 |  1   5  |
      2 |  3   5  |
      3 |  4   2  |
      4 |  5   2  |
      5 |  6   1  |
      6 |  7   2  |
      7 |  9   3  |
        +---------+
    I haven't tested it against for-loops, but simplicity in code do also count.

    And actually
    Code:
    (J(rows(y), 1, x') :== y)'
    is a simple way of generating the dummy columns related to x
    Have fun!
    Kind regards

    nhb

  • #2
    My solution involves to sort x beforehand, but Niels' solution can be quite slow if there are many different values in y and x is big. It is also costly to transpose x if x is big. Panelsum() is undocumented.

    Code:
    mata:
    _sort(x,1)
    info=panelsetup(x,1)
    y,panelsum(x,info)
    end

    Comment


    • #3
      Your code is doing sums, not counts

      But actually I could get both counts and sums by:
      Code:
      :y, rowsum(J(rows(y), 1, x') :== y), (J(rows(y), 1, x') :== y) * x
              1    2    3
          +----------------+
        1 |   1    5    5  |
        2 |   3    5   15  |
        3 |   4    2    8  |
        4 |   5    2   10  |
        5 |   6    1    6  |
        6 |   7    2   14  |
        7 |   9    3   27  |
          +----------------+
      I'll be back with some speed tests on this, interesting!!!
      Last edited by Niels Henrik Bruun; 05 Aug 2015, 00:34.
      Kind regards

      nhb

      Comment


      • #4
        Hi again
        I hadn't expected the following result, but running the code:
        Code:
        cls
        clear
        mata mata clear
        
        mata
            uniformseed(1234)
            x = round(uniform(1e3, 1) * 100, 1)    // 1000 observations of 100 different values
            y = uniqrows(x)    // y is necessary for both methods
        end
        
        timer on 1    // The matrix way to get the sums of unique values
        mata: y, (J(rows(y), 1, x') :== y) * x
        timer off 1
        
        timer on 2    // Getting the sums by panelsum
        mata:
            _sort(x,1)
            info = panelsetup(x,1)
            y, panelsum(x, info)
        end
        timer off 2
        
        timer list
        gives:
        Code:
        . timer list
           1:    115.17 /        4 =      28.7930
           2:   2028.33 /        1 =    2028.3260
        So the reason for using panelsum isn't speed.
        And transposing isn't as bad as sorting.

        On second thought I can't think of a case where I need that sort of output
        So what remains is that the matrix way is surprisingly fast compared to using the buildin panelsum.
        Last edited by Niels Henrik Bruun; 05 Aug 2015, 02:09.
        Kind regards

        nhb

        Comment


        • #5
          first of it should have been of course in order to get counts (was late, no access to internet where Stata is located...)

          Code:
          mata:
          _sort(x,1)
          info=panelsetup(x,1)
          y,panelsum(J(rows(x),1,1),info)
          end
          I did some tests myself and did not obtain those differences in timing. I found that panelsum() was faster. Moreover I ran your code on my machine and did not obtain this big difference in timing. Which version of Stata do you have? I have Stata IC 14.0 (64-bits) running on Windows Server 2008. Finally you should try with x bigger in size and with more different values to see the difference in timing.


          Comment


          • #6
            Sorry!
            Redesigning my code a bit:
            Code:
            cls
            clear
            mata mata clear
            
            capture program drop test
            program define test
                args ne ve
                mata: uniformseed(1234)
                mata: x = round(uniform(`=1e`ne'', 1) * `=1e`ve'', 1)
                mata: y = uniqrows(x)
            
                timer clear
                timer on 1    // The matrix way
                    quietly mata: y, rowsum(J(rows(y), 1, x') :== y)
                timer off 1
            
                timer on 2    // "The panel way"
                    mata: _sort(x,1)
                    mata: info=panelsetup(x, 1)
                    quietly mata: y, panelsum(J(rows(x),1,1), info)
                timer off 2
            
                timer list
            end
            
            test 5 1
            test 7 1
            test 6 2
            test 6 3
            gives:
            Code:
            . test 5 1    // 100000 observations of 10 values
               1:      0.01 /        1 =       0.0120
               2:      0.06 /        1 =       0.0570
            . test 7 1    // 10000000 observations of 10 values
               1:      1.29 /        1 =       1.2950
               2:     11.99 /        1 =      11.9870
            . test 6 2    // 1000000 observations of 100 values
               1:      1.12 /        1 =       1.1190
               2:      0.91 /        1 =       0.9120
            . test 6 3    // 1000000 observations of 1000 values
               1:     11.03 /        1 =      11.0290
               2:      1.00 /        1 =       1.0010
            So you're absolutely right, Christophe.
            But as long as I keep the number of values low (eg around 10) as is usually the case with categorical variables then the matrix way is the fastest.

            PS: I've reedited this post since I forgot to reset the timer. And hence more correct results were available - Sorry!
            Last edited by Niels Henrik Bruun; 05 Aug 2015, 03:31. Reason: timer clear moved to program test
            Kind regards

            nhb

            Comment

            Working...
            X