Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • using _n var in sum directly

    Hi,

    I was wondering if anyone knows if you can directly use the system variables _n

    ideally i would like to do
    Code:
    summ _n if myvar==1
    I am trying to make an index

    i realise i can
    Code:
    gen index = _n
    summ index if myvar==1
    drop index
    but i would rather not create and drop a variable

    grateful for any suggestions


  • #2
    I don't know a way to do this. I would say that if the order of your observations is worth summarizing, it's worth making it a variable (insert emoticon of your choice, or not).

    Comment


    • #3
      Thanks Nick,

      I am going to put this down as "if you don't know its not probably possible"

      This is my final solution. Which makes everything quicker.


      Code:
      cap program drop myindex
      program define myindex , rclass
          syntax if , sort(varlist)
          
          sort `sort'
          tempvar i
          gen long `i' = _n
          su `i' `if' , mean
          
          local first= `r(min)'
          local last= `r(max)'
          local obs= `last' - `first'
          di "_________________________________________"
          di ""
          di "First observation of sub index:" `first'
          di "Last observation of sub index: " `last'
          di "Observations in subindex:      " `obs'
          di "_________________________________________"
          
          return local first `first'
          return local last `last'
          return local obs `obs'
      end
      .

      Code:
      myindex if myvar==5, sort(myvar)
      local b =r(first)
      local e = r(last)
      then i am using

      Code:
      gen mynewvar = .
      replace mynewvar = 1 in `b'/`e'

      Comment


      • #4
        I see. The problem has some overlap with that in https://www.statalist.org/forums/for...6832-listfirst

        Comment


        • #5
          Isn't

          Originally posted by Adrian Sayers View Post

          Code:
          cap program drop myindex
          program define myindex , rclass
          syntax if , sort(varlist)
          
          sort `sort'
          tempvar i
          gen long `i' = _n
          su `i' `if' , mean
          
          local first= `r(min)'
          local last= `r(max)'
          local obs= `last' - `first'
          di "_________________________________________"
          di ""
          di "First observation of sub index:" `first'
          di "Last observation of sub index: " `last'
          di "Observations in subindex: " `obs'
          di "_________________________________________"
          
          return local first `first'
          return local last `last'
          return local obs `obs'
          end
          .

          Code:
          myindex if myvar==5, sort(myvar)
          local b =r(first)
          local e = r(last)
          then i am using

          Code:
          gen mynewvar = .
          replace mynewvar = 1 in `b'/`e'
          unnecessarily complicated for

          Code:
          generate mynewvar = 1 if myvar == 5
          What am I missing here?
          Last edited by daniel klein; 15 May 2024, 12:11.

          Comment


          • #6
            Hi Dan,
            unnecessarily complicated for
            Quite possibly.

            When you have big datasets the
            Code:
            if myvar==5
            is evaluated over the entire dataset, which can be really slow if you have big dataset and lots of things to change.

            whereas
            replace mynewvar = 1 in `b'/`e'
            is only evaluated in the specific range of interest of the dataset.

            The time cost is creating a variable and summarising it.

            but the strategy can be 100's of times fastter in big datasets.

            then if you have 100's of variables to fix. The speed benefits really stack up.

            Comment


            • #7
              How exactly is

              Code:
              if myvar==5
              any slower than

              Code:
              su `i' `if' , mean
              It's the exact same if qualifier evaluated over exactly the same dataset, isn't it?

              Comment


              • #8
                you run
                Code:
                 
                 su `i' `if' , mean
                once and then use the results repeatedly.

                the if used in the sum, seems to be quicker than if used in generate or replace.

                its kind of equivalent to indexes in SQL

                its probably not worth the effort if you have a dataset smaller than 500K

                Comment


                • #9
                  There is a difference in if and in. If you use if Stata needs to look at each observation and evaluate if the condition is true for that observation or not. With in you directly index which observations you want to use, so you avoid that loop over all observations.
                  ---------------------------------
                  Maarten L. Buis
                  University of Konstanz
                  Department of history and sociology
                  box 40
                  78457 Konstanz
                  Germany
                  http://www.maartenbuis.nl
                  ---------------------------------

                  Comment


                  • #10
                    Originally posted by Adrian Sayers View Post
                    you run
                    Code:
                    su `i' `if' , mean
                    once and then use the results repeatedly.
                    Oh, I see. So you are going to have something like

                    Code:
                    replace newvarname1 = exp in `b'/`e'
                    replace newvarname2 = exp in `b'/`e'
                    ...
                    replace newvarname3 = exp in `b'/`e'
                    using the same range repeatedly. I can see how that is indeed faster than repeating the if qualifier.

                    Must be a really big dataset and lots of variables to make a substantive difference, though. Here is what I get for 100,000,000 observations:

                    Code:
                    . clear
                    
                    . set obs 100000000
                    Number of observations (_N) was 0, now 100,000,000.
                    
                    . generate r100 = runiformint(0,100)
                    
                    .
                    . timer clear
                    
                    .
                    . timer on 1
                    
                    . myindex if r100 == 42 , sort(r100)
                    _________________________________________
                    
                    First observation of sub index:41590659
                    Last observation of sub index: 42583182
                    Observations in subindex: 992523
                    _________________________________________
                    
                    . timer off 1
                    
                    . local b = r(first)
                    
                    . local e = r(last)
                    
                    . timer on 2
                    
                    . generate mynewvar = .
                    (100,000,000 missing values generated)
                    
                    . replace mynewvar = 1 in `b'/`e'
                    (992,524 real changes made)
                    
                    . timer off 2
                    
                    . timer on 3
                    
                    . generate mynewvar2 = 1 if r100 == 42
                    (99,007,476 missing values generated)
                    
                    . timer off 3
                    
                    .
                    . timer list
                       1:     39.53 /        1 =      39.5280
                       2:      2.44 /        1 =       2.4390
                       3:      3.09 /        1 =       3.0910
                    The in range is about half a second faster than if; give or take. But setting up the index takes almost 40 seconds. Thus, you need at least to change 80 variables in a 100,000,000 observation dataset to break even. We are then talking about a total running time of less than 5 minutes either way.


                    EDIT

                    The comparison above is misleading. I should have generated -- not additionally replaced -- the variable in both approaches

                    Code:
                    generate mynewvar = .
                    replace mynewvar = 1 in `b'/`e'
                    should just be

                    Code:
                    generate mynewvar = 1 in `b'/`e'
                    With this modification, the in approach is indeed much faster than if:

                    Code:
                    . timer list
                       1:     40.10 /        1 =      40.1000
                       2:      0.92 /        1 =       0.9170
                       3:      3.10 /        1 =       3.0960
                    You would still need to do a lot of work to get the 100 times speed gains claimed in #6. For a 500K dataset, the differences are trivial even for 100 variables; here is one run in a 500,000 observations dataset

                    Code:
                     timer list
                       1:      0.10 /        1 =       0.0950
                       2:      0.01 /        1 =       0.0050
                       3:      0.01 /        1 =       0.0130

                    I am not saying there is no use case for this; I am just trying to put things into perspective for those interested in this thread.
                    Last edited by daniel klein; 15 May 2024, 14:07.

                    Comment


                    • #11
                      Hi Daniel,

                      I think the speeds depend on the dataset and the fraction that your ​​​​​in over.

                      I tend to find data sets with loads of string seem to work more slowly.

                      It don't fully understand why some datasets work faster than others.

                      i also use hashsort, which is quicker than sort , but sort is much faster than it used to be.

                      Anyhow, it's lots faster than splitting and appending datasets.

                      I have used indexing with tab to calculate ranges over multiple levels previously. Which saves on the time cost of summing repeatedly.





                      Comment

                      Working...
                      X