Adding observations based on combination of existing two variables

Rochelle Zhang

Join Date: Feb 2025

Posts: 0
#1

Adding observations based on combination of existing two variables

19 Apr 2014, 07:42

Dear Statalisters,
I have industry pair data, the example below shows that I have 2 row industries (Industry_row, 11A0, 11B0), and 5 column industries (Industry_column 1121, 11A0, 3122, 1129, 2111).
Industry_row Industry_column xvar
11A0 1121 2
11A0 11A0 1
11A0 3122 4
11B0 1129 2
11B0 2111 3
I want to add industry pair observations for the combinations that are not currently in the dataset and assign zero to xvars, that is I want to add
11A0 11B0 0
11A0 1129 0
11A0 2111 0
11B0 1121 0
11B0 11A0 0
11B0 3122 0
I need this for creating a square matrix of Industry_row* Industry_column , My real data has 70,000 records.
If you could help with some programming tips, that will be greatly appreciated.

Thanks,

Rochelle
Tags: None

Joseph Coveney

Join Date: Apr 2014
Posts: 4347

19 Apr 2014, 09:11

It might take a while with 70,000 rows, but fillin is probably what you want.

Code:

version 13.1

clear *
set more off

input str4(Industry_row Industry_column) byte xvar
11A0 1121 2
11A0 11A0 1
11A0 3122 4
11B0 1129 2
11B0 2111 3
end

*
* Begin here
*
fillin Industry_row Industry_column
replace xvar = 0 if missing(xvar)

list, noobs separator(0) abbreviate(30)

exit

Comment

Joe Canner

Join Date: Mar 2014

Posts: 580
#3

19 Apr 2014, 10:23

What method are you planning to use to convert your dataset to a matrix? If you initialize it with J(i ,j,0) you shouldn't have to add observations in advance to fill in the gaps.
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#4

19 Apr 2014, 10:29

fillin will give you almost what you want. For your example dataset, you will not get 11A0 11B0 0.

Another approach is to get all pairwise combinations using cross. This is consistent with the square matrix you say you want; that is, you want all industries to appear in both rows and columns (according to my interpretation). The issue is that the result will have more observations (i.e. combinations) than your example shows (but that can probably be fixed if needed). So maybe there's a contradiction in your original question. Anyway, below some code:

Code:

clear all set more off *----- original dataset ----- input str4(irow icol) byte xvar 11A0 1121 2 11A0 11A0 1 11A0 3122 4 11B0 1129 2 11B0 2111 3 end sort irow icol tempfile orig save "`orig'" list *----- stacked dataset ----- stack irow icol, into(iall) clear drop _stack list bysort iall: keep if _n == 1 list tempfile all save "`all'" *----- all pairwise combinations ----- rename iall iall0 cross using "`all'" rename (iall0 iall) (irow icol) list *----- merge in known values of -xvar- ----- merge 1:1 irow icol using "`orig'" drop _merge replace xvar = 0 if missing(xvar) list, sepby(irow)

Last edited by Roberto Ferrer; 19 Apr 2014, 10:32.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#5

19 Apr 2014, 11:47

You could also use:

Code:

<snip> *----- all pairwise combinations ----- gen iall2 = iall fillin iall iall2 drop _fillin rename (iall iall2) (irow icol) <snip>

but I haven't tested speed.

See

The Stata Journal (2005)
5, Number 1, pp. 135–136
Stata tip 17: Filling in the gaps
Nicholas J. Cox
http://www.stata-journal.com/sjpdf.h...iclenum=dm0011

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Rochelle Zhang

Join Date: Feb 2025

Posts: 0
#6

19 Apr 2014, 15:42

Many thanks to Joe Coveney, Joe Canner, and Roberto !!! My response to Roberto and Joe are shown below. Thanks for your time !!!

*********************
To Roberto,

I got all pairwise combinations (I have 450 unique irow and 473 unique icol, so the symmetric matrix is 473*473 for all combinations). My next step is to compute eigenvector for this matrix.

1. after your program " merge 1:1 irow icol using "`orig'"
drop _merge

replace xvar = 0 if missing(xvar)

should I save the data (I never wrote matrix code before),

should I say save newdata,replace

2. to compute eigenvector for this matrix, should I use

matrix symeigen X lambda = newdata

*********************
To Joe (Joe Canner)

concerning your question -What method are you planning to use to convert your dataset to a matrix? If you initialize it with J(i ,j,0) you shouldn't have to add observations in advance to fill in the gaps.

I plan to use Nick's code to create a symmetric matrix (see below), could you give some reference on how to use J(i ,j,0) (I never used matrix in stata)

qui tab industry_row
local nvals = r(r)

egen i = seq(), block(`nvals')
egen j = seq(), to(`nvals')

matrix A=J(`nvals',`nvals',.)

forval n = 1/`=_N' {
matrix A[`=i[`n']', `=j[`n']'] = xvar[`n']
if comio[`n'] != comio[`=`n'-1'] {
local rownames `rownames' `=comio[`n']'
}
}
matrix rownames A = `rownames'
matrix colnames A = `rownames'

*********************
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#7

19 Apr 2014, 16:24

Try: matrix A=J(`nvals',`nvals',0)

instead of: matrix A=J(`nvals',`nvals',.)

This will initialize matrix A to all zeros and then the subsequent code will fill in the pairs that have values of xvar. If I understand your problem correctly, this should be what you want without having to do any pre-processing of the data set.
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#8

20 Apr 2014, 01:05

Rstata,

In the FAQ you are asked to use your complete real name in this forum.
I suggest you try the editing options when posting (made available by pressing the advanced editor option at the top right of the editor window). It makes reading code, quotes, etc. easier.

The foillowing is complete code up to the eigenvectors. You can maybe modify it to accommodate for Joe Canner's suggestion of not looping over all elements of the matrix.

Code:

clear all set more off *----- original dataset ----- input str4(irow icol) byte xvar 11A0 1121 2 11A0 11A0 1 11A0 3122 4 11B0 1129 2 11B0 2111 3 end sort irow icol tempfile orig save "`orig'" list *----- stacked dataset ----- stack irow icol, into(iall) clear drop _stack list bysort iall: keep if _n == 1 list *----- all pairwise combinations ----- gen iall2 = iall fillin iall iall2 drop _fillin rename (iall iall2) (irow icol) list *----- merge-in known values of -xvar- ----- merge 1:1 irow icol using "`orig'" drop _merge replace xvar = 0 if missing(xvar) list, sepby(irow) *----- create matrix ----- quietly tab irow local nvals = r(r) egen i = seq(), block(`nvals') egen j = seq(), to(`nvals') matrix A = J(`nvals',`nvals',.) forval n = 1/`=_N' { matrix A[`=i[`n']', `=j[`n']'] = xvar[`n'] if irow[`n'] != irow[`=`n'-1'] { local rownames `rownames' `=irow[`n']' } } matrix rownames A = `rownames' matrix colnames A = `rownames' matrix list A *----- eigenvalues ----- mata MA = st_matrix("A") X = . L = . eigensystem(MA, X, L) X L end

The eigenvectors are computed within Mata. See help mata, help st_matrix and help mf_eigensystem.

Are you sure your matrix is symmetric (equal to its transpose)? Is it not some kind of input-output matrix? These have same row and column headers but are not symmetric in the mathematical sense.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35173
#9

20 Apr 2014, 06:10

It seems fair to guess that you are Rochelle Zhang. That, and not Rstata, is an appropriate identifier for this forum.

Either way, please explain what you want to do that was not discussed in the long thread starting at http://www.stata.com/statalist/archi.../msg00921.html
Comment
Rochelle Zhang

Join Date: Feb 2025

Posts: 0
#10

20 Apr 2014, 14:33

Nick and Roberto,
Yes, I am Rochelle Zhang. I reregistered and will use Rochelle Zhang for future posting. I apologize for this indiscretion on my part.

In response to Roberto lastest post,
yes, I am using input output data. my stata data has just over 70,000 for pairs of irow and icol. My count shows 450 unique irows and 473 unique columns. so after I use your program (4/19/2014 post) , I got a table with 223,729 rows, so it should be a square matrix (473*473).

In response to Nick's latest post:

In the future I will remember to provide cross reference for my earlier posting http://www.stata.com/statalist/archi.../msg00921.html.

You are correct it is a continuation of that post. My thought process is
create the symmetric matrix (industry_row and industry_col), for the pairs that were not in my data , I assign zero to that pair,

my logic for doing step 1 is that to generate eigenvector , matrix symeigen - Stata calls for a symmetric matrix.

if my thinking is incorrect, please let me know.

Thanks very much,

Rochelle
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#11

20 Apr 2014, 21:28

Rochelle, In my last post I show how to compute the eigenvectors using Mata. My guess is that your matrix is not symmetric, but just square. In input-output jargon, it is a symmetric table, which is different than a symmetric matrix.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Rochelle Zhang

Join Date: Feb 2025

Posts: 0
#12

21 Apr 2014, 18:57

Thank you, Roberto.

Your comment on my matrix being not symmetric is spot on.

1. I want to use input and output dataset from BEA to create a symmetric matrix . My initial post Industry_row is the input industry (453 unique industries), and industry_column is the output industry ((470 unique industries). The xvar is a value I computed for my analysis.

2. I am following the eigenvector centrality (Bonacich, 1972) (http://www.tandfonline.com/doi/pdf/1...X.1972.9989806) . My first step is to create the adjacency matrix A based on my existing data, so use my pretend data :

+++++++++++
input str4(Industry_row Industry_column) byte xvar
11A0 1121 2
11A0 11A0 1
11A0 3122 4
11B0 1129 2
11B0 2111 3
end
++++++++++++++

after I run your code, I get a square matrix with all industry pairs (input-output pairs),
11A0 11B0 1121 3122 1129 2111

11A0 1 0 2 4 0 0

11B0 0 0 0 0 2 3

1121 0 0 0 0 0 0

3122 0 0 0 0 0 0

1129 0 0 0 0 0 0

2111 0 0 0 0 0 0

+++++++++
Next I want to make it symmetrical (following another study, the authors took the maximum value of the upper and lower triangles), and the resulting matrix should look like

11A0 11B0 1121 3122 1129 2111

11A0 1 0 2 4 0 0

11B0 0 0 0 0 2 3

1121 2 0 0 0 0 0

3122 4 0 0 0 0 0

1129 0 2 0 0 0 0

2111 0 3 0 0 0 0

basically, if the original data has 11A0 1121 with xvar =2, we will assign 1121 11A0 pair's xvar a value of 2 (replace 0).

next, the eigenvector and eigenvalue will be computed based on this symmetric matrix .

My questions:
1. How do I make my square matrix symmetric ?

2. with Roberto's mata code, how do I save the eigenvector and eigenvalue into stata datasets.

thanks a lot,

Rochelle
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#13

22 Apr 2014, 11:34

Rochelle,

1. I think all you would need to do to create a symmetric matrix is to populate xvar into the two corresponding elements of the matrix, (i,j) and (j,i), when you create it:

Code:

matrix A = J(`nvals',`nvals',0) forval n = 1/`=_N' { matrix A[`=i[`n']', `=j[`n']'] = xvar[`n'] matrix A[`=j[`n']', `=i[`n']'] = xvar[`n'] if irow[`n'] != irow[`=`n'-1'] { local rownames `rownames' `=irow[`n']' } }

2. To save eigenvalues and eigenvectors into Stata datasets from within Mata (following Roberto's code), check out the Mata st_store() function. The details will depend on how you want to store the results (i.e., as a symmetric matrix or as a list of ordered pairs as in your original dataset). You can also generate eigenvectors and eigenvalues without using Mata and then use the matrix svmat command to convert the matrices to data sets:

Code:

matrix symeigen X L = A matrix svmat X, names(...) matrix svmat L, names(...)

Again, the details (what to put in the names() option) depend on what you are trying to do.

Regards,
Joe
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#14

22 Apr 2014, 12:35

Originally posted by Rochelle Zhang View Post

Next I want to make it symmetrical (following another study, the authors took the maximum value of the upper and lower triangles), ...

basically, if the original data has 11A0 1121 with xvar =2, we will assign 1121 11A0 pair's xvar a value of 2 (replace 0).

You need to be careful here or you might end up overwriting values that were not meant to be overwritten. (I believe your comment on taking the maximum is relevant here.) The rules for replacing values should be very clear.

This looks like material for another thread. Try to make your problem as abstract as possible without leaving out details that could be of use to people trying to help (this balancing act may be difficult); context can be very helpful. Instead of keeping the same thread going on to solve incoming problems, consider if a new thread with an appropriate title should be opened.

Last edited by Roberto Ferrer; 22 Apr 2014, 12:43.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment

Announcement

Adding observations based on combination of existing two variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment