Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata version of R's apply command?

    I essentially want to calculate the mean and standard deviation of each column of a matrix. But of course its not that simple.

    Here is my code so far:

    forval aa=1/100 {
    local bb=`aa'/100
    gen ln_port_`aa'=(`bb'*X) + ((1-`bb')*Y)
    }

    This creates of course 100 vectors of the format ln_port_1, ln_port_2,....,ln_port_100

    I would like to then create two new vectors; vector one would contain the standard deviations of each of the 100 vectors and vector 2 would contain the means of the 100 vectors.

    Thanks for any help.

  • #2
    I'd do something like this in mata. I have no idea whether my proposition is the simplest solution by any means; But here it goes:



    Code:
    mkmat ln_port_*, matrix(B)
    
    
    mata
    
    A = st_matrix("B")
    
    avgs = colsum(A)/rows(A)
    
    sdevs = ( colsum( (A-avgs#J(rows(A),1,1)) :* (A-avgs#J(rows(A),1,1)) ) / rows(A) ):^.5
    
    st_matrix("avgs",avgs)
    st_matrix("sdevs",sdevs)
    
    end
    Or in Stata:

    collapse (mean) ln_port_1-ln_port_100

    gives you a dataset containing the means,

    collapse (sd) ln_port_1-ln_port_100

    one with the standard deviations. You can load these into vectors in Stata using -mkmat- if you actually need vectors.

    Comment


    • #3
      The request is confusing due to the terminology used. In Stata, the -browse- command will show a matrix looking window with the loaded database. We refer to rows as observations and columns as variables. You are generating variables with the -gen- command, so I presume you want variables when you say "vectors". Additionally, you say you want two vectors (I presume variables, again), and each observation of these two variables are to contain the mean and standard deviation of the already present 100 variables. Is this all correct? Is something missing?

      Notice that Stata can work with matrices also (different than a loaded database); and Mata (Stata's matrix programming language) with matrices and vectors, which makes the request even more confusing. Maybe you should explain your ultimate goal with all this. That could get you some advice on the general procedure to follow. Trying to map some function/command from another software to some Stata function/command is likely not the best strategy.

      See -help summarize-, -help saved results-, -help matrix-, -help mata-.
      Last edited by Roberto Ferrer; 14 Sep 2014, 09:10.
      You should:

      1. Read the FAQ carefully.

      2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

      3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

      4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

      Comment


      • #4
        If you are working in Stata, not Mata, -egen- is your friend.

        Code:
        egen mean_ln_port = rowmean(ln_port_*)
        egen sd_ln_port = rowsd(ln_port_*)

        Comment


        • #5
          alternative solution in Mata


          Code:
          cap drop mean_ln_por
          cap drop sd_ln_por
          
          mata:
          
          ln_por = st_data(.,"ln_por_*")
          M = mean(ln_por)'
          SD = sqrt(diagonal(quadvariance(ln_por)) 
          
          st_addvar(""double", tokens( "mean_ln_por sd_ln_por"))
          st_store((1,rows(M)),tokens( "mean_ln_por sd_ln_por"),(M,SD)) // you can also use getmata and put mata instead of st_data/st_addvar/st_store
          
          end
          
          list mean_ln_por sd_ln_por in 1/100
          otherwise you can loop and store the results in two new variables (or a matrix, I chose to store in them in 2 variables)

          Code:
          gen double mean_ln_por = .
          gen double sd_ln_por      =.
          forval i = 1/100 {
                       qui sum ln_por_`i'
                       replace mean_ln_por = `r(sum)' in `i'
                       replace sd_ln_por      = `r(sd)'    in `i'
          
          }
          Best
          Christophe


          Comment


          • #6
            Clyde
            I had the same reaction as you, but I think that what is wanted is not the means and standard deviations across the rows, but for each variable that the mean and standard deviation is stored in two new variables with 100 non missing observations.

            Comment


            • #7
              Originally posted by Roberto Ferrer View Post
              We refer to rows as observations and columns as variables. You are generating variables with the -gen- command, so I presume you want variables when you say "vectors".
              I don't think this is the case here. As R operates in data structures not data sets, vector is the simplest data structure that contains a collection of numbers or whatever else (arithmetic expression). Vector does not have to correspond to any variable or data in traditional sense, it can be empty, logical and so on.

              Originally posted by bennfine View Post
              I would like to then create two new vectors; vector one would contain the standard deviations of each of the 100 vectors and vector 2 would contain the means of the 100 vectors.
              To my mind the closest solution in Stata is can be achived with use the scalar command. In Stata scalars differ from the variables that do not correspond to observations but can hold strings or arithmetic expressions, very broadly on the lines of R vector concept. In practice, you could also run your loops and using stored results push desired values into the macro. If you intend to use the vector as you would in R then presumably your option is evaluating your scalar when needed (a discussion on similar problem is available here).


              Kind regards,
              Konrad
              Version: Stata/IC 13.1

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                If you are working in Stata, not Mata, -egen- is your friend.

                Code:
                egen mean_ln_port = rowmean(ln_port_*)
                egen sd_ln_port = rowsd(ln_port_*)
                very close! thanks for the other answers but didn't realize the question was ill posed or requiring a matrix package.

                This solution is essentially what I want but I want summary on the columns, not on the rows.

                So essentially egen mean_ln_port = colmean(ln_port_*).......but there doesn't seem to be a colmean command???


                ............

                Now let's examine this paragraph that another person posted:

                The request is confusing due to the terminology used. In Stata, the -browse- command will show a matrix looking window with the loaded database. We refer to rows as observations and columns as variables. You are generating variables with the -gen- command, so I presume you want variables when you say "vectors". Additionally, you say you want two vectors (I presume variables, again), and each observation of these two variables are to contain the mean and standard deviation of the already present 100 variables. Is this all correct? Is something missing?

                ....variables, vectors..i do find this confusing. so yes I guess I am intermixing the two. so your summary is correct. I have 100 variiables, each one of length n. I want to create two new variables. variable1 would be of length n and contain the means of the 100 other variables. variable2 would be of length n and contain the standard deviations of the 100 variables.

                Comment


                • #9
                  -egen- also has functions mean() and (sd) that provide "column" (in Stata we call them variables) summary statistics. But why would you want to create a whole new variable that just contains the same number in every observation? Perhaps what you want is to create 200 scalars, two for each of your ln_port_ variables.

                  Code:
                  forvalues j = 1/100 {
                       sum ln_port_`j'
                       scalar mean_`j' = r(mean)
                       scalar sd_`j' = r(sd)
                  }
                  Is that what you're trying to do?

                  Comment


                  • #10
                    Originally posted by bennfine View Post
                    ....variables, vectors..i do find this confusing. so yes I guess I am intermixing the two. so your summary is correct. I have 100 variiables, each one of length n. I want to create two new variables. variable1 would be of length n and contain the means of the 100 other variables. variable2 would be of length n and contain the standard deviations of the 100 variables.
                    Clyde has already mentioned the usual approach to this problem which is not saving to a variable, but to a scalar or a macro. That is why in my original post I suggested -help summarize- and -help saved results-. -help foreach- and -help forvalues- should have been suggested too. If you insist on having those results (mean and sd) in variables, you can check -help post-. With -post- you can save those results to a different dataset, each in its own variable.

                    The idea of saving results like yours to the same database is sometimes convenient, sometimes not. Suppose your n < 100 (observations). Then you have to expand the database to fit in the results; this can be done, without much difficulty, but I don't think it's the cleanest way to work with data. -post- simply puts them in a different database.

                    I strongly recommend reading at least the introductory chapters of the Stata User's Manual. This will get you going with the basic terminology. Solving problems on your own, and asking others for help will be greatly facilitated.
                    Last edited by Roberto Ferrer; 14 Sep 2014, 12:38.
                    You should:

                    1. Read the FAQ carefully.

                    2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

                    3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

                    4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

                    Comment


                    • #11
                      I guess I will use "post" more often in the future.

                      Comment


                      • #12
                        It seems that there is a potentially easier solution that has been overlooked:

                        Code:
                        clear
                        
                        // Create data set with 1,000 observations/records in it
                        set obs 1000
                        
                        // Set the random number seed
                        set seed 7779311
                        
                        // Loop over the values 1-100
                        forv i = 1/100 {
                             
                            // Generate 100 random variables x1-x100 from normal distributions
                            qui: g double x`i' = rnormal()
                        
                        }
                        
                        // Compute the means and standard deviations for the 100 variables and store the results
                        tabstat x1-x100, s(mean sd) c(s) save
                        
                        // See what was stored
                        return list
                        
                        // If you want to save the means and standard deviations in a matrix you can use this
                        mat meansds = r(StatTotal)
                        
                        // And you can view the matrix as well
                        mat li meansds
                        In other words, tabstat does any/everything that is required. The closest thing to the apply functions in R in the Stata environment is probably the by prefix. The difference is that not all commands/functions are byable in Stata.

                        Comment


                        • #13
                          wbuchanan's solution is a nice one. If you are sure you want to put that information in variables, you can use -svmat-:

                          Code:
                          <snip>
                          
                          tabstat x1-x100, s(mean sd) c(s) save
                          
                          mat meansds = r(StatTotal)'
                          
                          svmat meansds, names(col)
                          
                          list mean sd in 1/101
                          Note I have used the transpose of -r(StatTotal)-.
                          You should:

                          1. Read the FAQ carefully.

                          2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

                          3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

                          4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

                          Comment


                          • #14
                            Another possible Stata solution is to use collapse if you are just looking for the column means and sds. Collapse can also by used with "by", which could produce subgroup means and sds, similar to -tapply- in R

                            Example:

                            sysuse auto
                            keep price mpg weight length
                            collapse (mean) price mpg weight length (sd) sdprice=price sdmpg=mpg sdweight=weight sdlength=length


                            Tim

                            Comment


                            • #15
                              As Roberto writes wbuchanan's solution is a nice one. It has the drawback that tabstat only allows fweights and aweights. If you use other kinds of weights collapse or using summarize within a loop (combined with a postfile or not depending on your goal) are good alternatives.
                              Christophe

                              Comment

                              Working...
                              X