Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Adding observations based on combination of existing two variables


    Dear Statalisters,
    I have industry pair data, the example below shows that I have 2 row industries (Industry_row, 11A0, 11B0), and 5 column industries (Industry_column 1121, 11A0, 3122, 1129, 2111).
    Industry_row Industry_column xvar
    11A0 1121 2
    11A0 11A0 1
    11A0 3122 4
    11B0 1129 2
    11B0 2111 3
    I want to add industry pair observations for the combinations that are not currently in the dataset and assign zero to xvars, that is I want to add
    11A0 11B0 0
    11A0 1129 0
    11A0 2111 0
    11B0 1121 0
    11B0 11A0 0
    11B0 3122 0
    I need this for creating a square matrix of Industry_row* Industry_column , My real data has 70,000 records.
    If you could help with some programming tips, that will be greatly appreciated.

    Thanks,

    Rochelle

  • #2
    It might take a while with 70,000 rows, but fillin is probably what you want.

    Code:
    version 13.1
    
    clear *
    set more off
    
    input str4(Industry_row Industry_column) byte xvar
    11A0 1121 2
    11A0 11A0 1
    11A0 3122 4
    11B0 1129 2
    11B0 2111 3
    end
    
    *
    * Begin here
    *
    fillin Industry_row Industry_column
    replace xvar = 0 if missing(xvar)
    
    list, noobs separator(0) abbreviate(30)
    
    exit

    Comment


    • #3
      What method are you planning to use to convert your dataset to a matrix? If you initialize it with J(i ,j,0) you shouldn't have to add observations in advance to fill in the gaps.

      Comment


      • #4
        fillin will give you almost what you want. For your example dataset, you will not get 11A0 11B0 0.

        Another approach is to get all pairwise combinations using cross. This is consistent with the square matrix you say you want; that is, you want all industries to appear in both rows and columns (according to my interpretation). The issue is that the result will have more observations (i.e. combinations) than your example shows (but that can probably be fixed if needed). So maybe there's a contradiction in your original question. Anyway, below some code:
        Code:
        clear all
        set more off
        
        *----- original dataset -----
        
        input str4(irow icol) byte xvar
        11A0 1121 2
        11A0 11A0 1
        11A0 3122 4
        11B0 1129 2
        11B0 2111 3
        end
        
        sort irow icol
        tempfile orig
        save "`orig'"
        list
        
        *----- stacked dataset -----
        
        stack irow icol, into(iall) clear
        drop _stack
        list
        
        bysort iall: keep if _n == 1
        list
        
        tempfile all
        save "`all'"
        
        *----- all pairwise combinations -----
        
        rename iall iall0
        cross using "`all'"
        
        rename (iall0 iall) (irow icol)
        list
        
        *----- merge in known values of -xvar- -----
        
        merge 1:1 irow icol using "`orig'"
        drop _merge
        
        replace xvar = 0 if missing(xvar)
        list, sepby(irow)
        Last edited by Roberto Ferrer; 19 Apr 2014, 10:32.
        You should:

        1. Read the FAQ carefully.

        2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

        3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

        4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

        Comment


        • #5
          You could also use:

          Code:
          <snip>
          
          *----- all pairwise combinations -----
          
          gen iall2 = iall
          fillin iall iall2
          drop _fillin
          
          rename (iall iall2) (irow icol)
          
          <snip>
          but I haven't tested speed.

          See

          The Stata Journal (2005)
          5, Number 1, pp. 135–136
          Stata tip 17: Filling in the gaps
          Nicholas J. Cox
          http://www.stata-journal.com/sjpdf.h...iclenum=dm0011
          You should:

          1. Read the FAQ carefully.

          2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

          3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

          4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

          Comment


          • #6
            Many thanks to Joe Coveney, Joe Canner, and Roberto !!! My response to Roberto and Joe are shown below. Thanks for your time !!!

            *********************
            To Roberto,

            I got all pairwise combinations (I have 450 unique irow and 473 unique icol, so the symmetric matrix is 473*473 for all combinations). My next step is to compute eigenvector for this matrix.

            1. after your program " merge 1:1 irow icol using "`orig'"
            drop _merge

            replace xvar = 0 if missing(xvar)

            should I save the data (I never wrote matrix code before),

            should I say save newdata,replace

            2. to compute eigenvector for this matrix, should I use

            matrix symeigen X lambda = newdata

            *********************
            To Joe (Joe Canner)

            concerning your question -What method are you planning to use to convert your dataset to a matrix? If you initialize it with J(i ,j,0) you shouldn't have to add observations in advance to fill in the gaps.

            I plan to use Nick's code to create a symmetric matrix (see below), could you give some reference on how to use J(i ,j,0) (I never used matrix in stata)

            qui tab industry_row
            local nvals = r(r)

            egen i = seq(), block(`nvals')
            egen j = seq(), to(`nvals')

            matrix A=J(`nvals',`nvals',.)

            forval n = 1/`=_N' {
            matrix A[`=i[`n']', `=j[`n']'] = xvar[`n']
            if comio[`n'] != comio[`=`n'-1'] {
            local rownames `rownames' `=comio[`n']'
            }
            }
            matrix rownames A = `rownames'
            matrix colnames A = `rownames'

            *********************

            Comment


            • #7
              Try: matrix A=J(`nvals',`nvals',0)

              instead of: matrix A=J(`nvals',`nvals',.)

              This will initialize matrix A to all zeros and then the subsequent code will fill in the pairs that have values of xvar. If I understand your problem correctly, this should be what you want without having to do any pre-processing of the data set.

              Comment


              • #8
                Rstata,

                In the FAQ you are asked to use your complete real name in this forum.
                I suggest you try the editing options when posting (made available by pressing the advanced editor option at the top right of the editor window). It makes reading code, quotes, etc. easier.

                The foillowing is complete code up to the eigenvectors. You can maybe modify it to accommodate for Joe Canner's suggestion of not looping over all elements of the matrix.

                Code:
                clear all
                set more off
                
                *----- original dataset -----
                
                input str4(irow icol) byte xvar
                11A0 1121 2
                11A0 11A0 1
                11A0 3122 4
                11B0 1129 2
                11B0 2111 3
                end
                
                sort irow icol
                tempfile orig
                save "`orig'"
                list
                
                *----- stacked dataset -----
                
                stack irow icol, into(iall) clear
                drop _stack
                list                
                
                bysort iall: keep if _n == 1
                list
                
                *----- all pairwise combinations -----
                
                gen iall2 = iall
                fillin iall iall2
                drop _fillin
                
                rename (iall iall2) (irow icol)
                list
                
                *----- merge-in known values of -xvar- -----
                
                merge 1:1 irow icol using "`orig'"
                drop _merge
                
                replace xvar = 0 if missing(xvar)
                list, sepby(irow)
                
                *----- create matrix -----
                
                quietly tab irow
                local nvals = r(r)
                
                egen i = seq(), block(`nvals')
                egen j = seq(), to(`nvals')
                
                matrix A = J(`nvals',`nvals',.)
                
                forval n = 1/`=_N' {
                    matrix A[`=i[`n']', `=j[`n']'] = xvar[`n']
                    if irow[`n'] != irow[`=`n'-1'] {
                        local rownames `rownames' `=irow[`n']'
                    }
                }
                
                matrix rownames A = `rownames'
                matrix colnames A = `rownames'
                
                matrix list A
                
                *----- eigenvalues -----
                
                mata
                
                MA = st_matrix("A")
                X = .
                L = .
                
                eigensystem(MA, X, L)
                X
                L
                
                end
                The eigenvectors are computed within Mata. See help mata, help st_matrix and help mf_eigensystem.

                Are you sure your matrix is symmetric (equal to its transpose)? Is it not some kind of input-output matrix? These have same row and column headers but are not symmetric in the mathematical sense.
                You should:

                1. Read the FAQ carefully.

                2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

                3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

                4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

                Comment


                • #9
                  It seems fair to guess that you are Rochelle Zhang. That, and not Rstata, is an appropriate identifier for this forum.

                  Either way, please explain what you want to do that was not discussed in the long thread starting at http://www.stata.com/statalist/archi.../msg00921.html

                  Comment


                  • #10
                    Nick and Roberto,
                    Yes, I am Rochelle Zhang. I reregistered and will use Rochelle Zhang for future posting. I apologize for this indiscretion on my part.

                    In response to Roberto lastest post,
                    1. yes, I am using input output data. my stata data has just over 70,000 for pairs of irow and icol. My count shows 450 unique irows and 473 unique columns. so after I use your program (4/19/2014 post) , I got a table with 223,729 rows, so it should be a square matrix (473*473).

                    In response to Nick's latest post:

                    In the future I will remember to provide cross reference for my earlier posting http://www.stata.com/statalist/archi.../msg00921.html.

                    You are correct it is a continuation of that post. My thought process is
                    1. create the symmetric matrix (industry_row and industry_col), for the pairs that were not in my data , I assign zero to that pair,
                    2. my logic for doing step 1 is that to generate eigenvector , matrix symeigen - Stata calls for a symmetric matrix.
                    if my thinking is incorrect, please let me know.

                    Thanks very much,

                    Rochelle

                    Comment


                    • #11
                      Rochelle, In my last post I show how to compute the eigenvectors using Mata. My guess is that your matrix is not symmetric, but just square. In input-output jargon, it is a symmetric table, which is different than a symmetric matrix.
                      You should:

                      1. Read the FAQ carefully.

                      2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

                      3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

                      4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

                      Comment


                      • #12
                        Thank you, Roberto.

                        Your comment on my matrix being not symmetric is spot on.

                        1. I want to use input and output dataset from BEA to create a symmetric matrix . My initial post Industry_row is the input industry (453 unique industries), and industry_column is the output industry ((470 unique industries). The xvar is a value I computed for my analysis.

                        2. I am following the eigenvector centrality (Bonacich, 1972) (http://www.tandfonline.com/doi/pdf/1...X.1972.9989806) . My first step is to create the adjacency matrix A based on my existing data, so use my pretend data :

                        +++++++++++
                        input str4(Industry_row Industry_column) byte xvar
                        11A0 1121 2
                        11A0 11A0 1
                        11A0 3122 4
                        11B0 1129 2
                        11B0 2111 3
                        end
                        ++++++++++++++

                        after I run your code, I get a square matrix with all industry pairs (input-output pairs),
                        11A0 11B0 1121 3122 1129 2111
                        11A0 1 0 2 4 0 0
                        11B0 0 0 0 0 2 3
                        1121 0 0 0 0 0 0
                        3122 0 0 0 0 0 0
                        1129 0 0 0 0 0 0
                        2111 0 0 0 0 0 0

                        +++++++++
                        Next I want to make it symmetrical (following another study, the authors took the maximum value of the upper and lower triangles), and the resulting matrix should look like
                        11A0 11B0 1121 3122 1129 2111
                        11A0 1 0 2 4 0 0
                        11B0 0 0 0 0 2 3
                        1121 2 0 0 0 0 0
                        3122 4 0 0 0 0 0
                        1129 0 2 0 0 0 0
                        2111 0 3 0 0 0 0
                        basically, if the original data has 11A0 1121 with xvar =2, we will assign 1121 11A0 pair's xvar a value of 2 (replace 0).

                        next, the eigenvector and eigenvalue will be computed based on this symmetric matrix .

                        My questions:
                        1. How do I make my square matrix symmetric ?

                        2. with Roberto's mata code, how do I save the eigenvector and eigenvalue into stata datasets.

                        thanks a lot,

                        Rochelle


                        Comment


                        • #13
                          Rochelle,

                          1. I think all you would need to do to create a symmetric matrix is to populate xvar into the two corresponding elements of the matrix, (i,j) and (j,i), when you create it:

                          Code:
                           matrix A = J(`nvals',`nvals',0)  
                          
                          forval n = 1/`=_N' {    
                            matrix A[`=i[`n']', `=j[`n']'] = xvar[`n']    
                            matrix A[`=j[`n']', `=i[`n']'] = xvar[`n']    
                            if irow[`n'] != irow[`=`n'-1']  {
                                   local rownames `rownames' `=irow[`n']'    
                            }
                          } 
                          2. To save eigenvalues and eigenvectors into Stata datasets from within Mata (following Roberto's code), check out the Mata st_store() function. The details will depend on how you want to store the results (i.e., as a symmetric matrix or as a list of ordered pairs as in your original dataset). You can also generate eigenvectors and eigenvalues without using Mata and then use the matrix svmat command to convert the matrices to data sets:

                          Code:
                          matrix symeigen X L = A
                          matrix svmat X, names(...)
                          matrix svmat L, names(...)
                          Again, the details (what to put in the names() option) depend on what you are trying to do.

                          Regards,
                          Joe

                          Comment


                          • #14
                            Originally posted by Rochelle Zhang View Post
                            Next I want to make it symmetrical (following another study, the authors took the maximum value of the upper and lower triangles), ...

                            basically, if the original data has 11A0 1121 with xvar =2, we will assign 1121 11A0 pair's xvar a value of 2 (replace 0).
                            You need to be careful here or you might end up overwriting values that were not meant to be overwritten. (I believe your comment on taking the maximum is relevant here.) The rules for replacing values should be very clear.

                            This looks like material for another thread. Try to make your problem as abstract as possible without leaving out details that could be of use to people trying to help (this balancing act may be difficult); context can be very helpful. Instead of keeping the same thread going on to solve incoming problems, consider if a new thread with an appropriate title should be opened.
                            Last edited by Roberto Ferrer; 22 Apr 2014, 12:43.
                            You should:

                            1. Read the FAQ carefully.

                            2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

                            3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

                            4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

                            Comment

                            Working...
                            X