Basic matrix question from unexperienced user

Willem Vanlaer

Join Date: May 2014

Posts: 52
#1

Basic matrix question from unexperienced user

12 May 2014, 10:05

Hello,

I have only recently started using STATA, so forgive me for what will probably be something very straightforward.

Let's say you have 2 matrices:

A =
a
b
c
d

B = (1,2,3)

Now, I would like to merge these two into following matrix:

C =
a 1
a 2
a 3
b 1
b 2
b 3
c 1
c 2
c 3
d 1
d 2
d 3

So a new matrix, with
- the number of rows = number of rows A * number of columns in B
- number of columns = 2

Which commands do I use to go from A to B?

Thank you in advance,

Willem
Tags: None
James Fiedler

Join Date: Mar 2014

Posts: 29
#2

12 May 2014, 10:37

Are A and B the same type of matrix? In other words, are eltype(A) and eltype(B) equal? I ask because your A seems to be a string matrix. If the matrices are not the same type, then you generally cannot combine them in this way.

I will assume they are the same type. The following code, used interactively, should give you what you want.

Code:

A = (1, 2, 3, 4) B = (1, 2, 3) lenA = length(A) lenB = length(B) C = J(lenA * lenB, 2, .) for (i = 1; i <= lenA; i++) { for (j = 1; j <= lenB; j++) { C[(i-1)*lenB + j, .] = A[i], B[j] } }

You'll probably have to make changes to fit your needs. For example, if you have string matrices, use C = J(lenA * lenB, 2, ""). Also, this code assumes A and B will both be vectors (matrices with a single row or a single column).

Finally, a minor point: it's actually "Stata", not "STATA".
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#3

12 May 2014, 11:40

Originally posted by Willem Vanlaer View Post

So a new matrix, with
- the number of rows = number of rows A * number of columns in B
- number of columns = 2

Which commands do I use to go from A to B?

As James indicated, a matrix in Mata must contain either all numeric or all string values, so we'll assume that both A and B are of the same type. In addition, we'll assume (as in your example) that A is n x 1 and B is 1 x m. Given this, you can obtain C with
vec(J(cols(B),1,A')), J(rows(A),1,B')
I haven't thought about whether this is the most efficient way to do this, but that probably won't matter unless your actual matrices are really large (in which case, optimizing the code would probably also involve considering what you are doing both upstream and downstream).

Last edited by Phil Schumm; 12 May 2014, 11:44.
Comment
Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#4

13 May 2014, 01:08

If A and B are real, you can also use the Kronecker product (#). Mata: C = A#B' See help mata op_kronecker Christophe
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#5

15 May 2014, 09:26

Dear all,

Thank you for your reply's.
I have gone into more detail on each reply.
I will also say which is my concrete 'problem', but I will do that in the post below.

Originally posted by James Fiedler View Post

Are A and B the same type of matrix? In other words, are eltype(A) and eltype(B) equal? I ask because your A seems to be a string matrix. If the matrices are not the same type, then you generally cannot combine them in this way.

I will assume they are the same type. The following code, used interactively, should give you what you want.

Code:

A = (1, 2, 3, 4) B = (1, 2, 3) lenA = length(A) lenB = length(B) C = J(lenA * lenB, 2, .) for (i = 1; i <= lenA; i++) { for (j = 1; j <= lenB; j++) { C[(i-1)*lenB + j, .] = A[i], B[j] } }

You'll probably have to make changes to fit your needs. For example, if you have string matrices, use C = J(lenA * lenB, 2, ""). Also, this code assumes A and B will both be vectors (matrices with a single row or a single column).

Finally, a minor point: it's actually "Stata", not "STATA".

1. Both are indeed the same type of matrix. I should have stated that more clearly.
2. When I copy your code, it does exactly what I want. However, it does seem a bit complicated for what should be a relatively basic issue.
3. Thanks. I will refer to it as "Stata" from now on.

Originally posted by Phil Schumm

As James indicated, a matrix in Mata must contain either all numeric or all string values, so we'll assume that both A and B are of the same type. In addition, we'll assume (as in your example) that A is n x 1 and B is 1 x m. Given this, you can obtain C with
vec(J(cols(B),1,A')), J(rows(A),1,B')
I haven't thought about whether this is the most efficient way to do this, but that probably won't matter unless your actual matrices are really large (in which case, optimizing the code would probably also involve considering what you are doing both upstream and downstream).

This is an easy way, which helped me get what I want.

Originally posted by Christophe Kolodziejczyk

If A and B are real, you can also use the Kronecker product (#). Mata: C = A#B' See help mata op_kronecker Christophe

This does not seem to work..
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#6

15 May 2014, 09:27

The concrete issue I am facing is the following:

I have data on different variables in Excel.
1 tab holds data one 1 variable.
The way it is formatted is rows are countries and colomns are years.
E.g.
Countries/Years 1961 1962
Belgium 0,44 0,46
France 0,60 0,62

What I want in Stata is to have a dataset for which the:
- 1 row is 1 country for 1 year (e.g. France, 1962)
- a different colomn for a different variable (e.g. 1 colomn for public debt, 1 colomn for private debt).
E.g
Countries Years Public Debt
Belgium 1961 0,44
Belgium 1962 0,46
France 1961 0,60
France 1962 0,62

What I have done thus far is:
1. Used Phil Schumm's suggestion to the colomns for the countries and the years ok
2. To get the data for public debt like I want to I did the following:
2.a. Import Data from Excel (automatically, each variable is public debt data for 1 year, for all the countries)
2.b. putmata *
2.c. mata
2.d. PDMat = (A,B,C,D,etc.) [where A,B,C, etc. are the different variables]
2.e. PubDebt = vec(PDMat)

Now I have all the data in the format I want.

I have 2 remaining questions:

A. Is it really necessary to manually type in the names of all variables in part "2.d" of my code?
B. How do I turn the "PubDebt" vector into a variable (on which I can later perform some regression analyses)?
I want to perform the process I have mentioned above several times (i.e. as many times as I have tabs in Excel, which equals the amount of variables I have)

I have tried to be as clear as possible.
Your help is much appreciated.

Willem

Last edited by Willem Vanlaer; 15 May 2014, 09:55.
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#7

15 May 2014, 14:38

It sounds like you're trying to reinvent the reshape command using Mata—a good programming exercise, perhaps, but probably not how you want to spend your time. Suppose you have an Excel sheet called "Public Debt" that looks like this:

Code:

country year_1961 year_1962 ------- --------- --------- Belgium 0.44 0.46 France 0.6 0.62 etc.

and other similar sheets called "Private Debt", etc. Then the following code should give you what you want:

Code:

clear tempfile mydata save `"`mydata'"', emptyok loc sheets `""Public Debt" "Private Debt""' foreach sheet of local sheets { import excel mydata.xlsx, sheet("`sheet'") clear first gen sheet = subinstr("`sheet'"," ","",.) append using `"`mydata'"' save `"`mydata'"', replace } reshape long year_, i(country sheet) j(year) reshape wide year_, i(country year) j(sheet) string ren year_* *

There are a few housekeeping details WRT variable naming, etc. that you may need to play around with, but in general, a "double reshape" is the way to handle this problem.
Comment
Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#8

16 May 2014, 03:03

My apology. I misunderstood what you intented to do. I understood that you wanted to make only one column.
Anyway you can still use the kronecker product to produce the first column as demonstrated below

Code:

. . mata: ------------------------------------------------- mata (type end to exit) --------- : : // There are 5 countries and 3 years of data : Country = (1,2,3,4,5)' : year = (1,2,3) : : id = Country#J(cols(year),1,1),J(rows(Country),1,year') : : id 1 2 +---------+ 1 | 1 1 | 2 | 1 2 | 3 | 1 3 | 4 | 2 1 | 5 | 2 2 | 6 | 2 3 | 7 | 3 1 | 8 | 3 2 | 9 | 3 3 | 10 | 4 1 | 11 | 4 2 | 12 | 4 3 | 13 | 5 1 | 14 | 5 2 | 15 | 5 3 | +---------+ : : end -----------------------------------------------------------------------------------

But as Phil points out given your problem at hand the use of the (Stata) reshape command is a better choice. If you want to do it the Mata way you can use colshape() or rowshape() function, but you have to be careful, especially if you have more than one variable you want to reshape. For example if you have two variables public and income in wide format and they are arranged in this manner

public_debt1961, public_debt1962,public_debt1963,income1961,income1 962,income1963.
...

then it will be easier to use rowshape or colshape if the variables are arranged in the following order

public_debt1961,income1961, public_debt1962,income1962,public_debt1963,income1 963.
...

In my opinion and given your problem reshape in Stata is easier to use.

Best
Christophe
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#9

16 May 2014, 10:17

Originally posted by Phil Schumm View Post

It sounds like you're trying to reinvent the reshape command using Mata—a good programming exercise, perhaps, but probably not how you want to spend your time. Suppose you have an Excel sheet called "Public Debt" that looks like this:

Code:

country year_1961 year_1962 ------- --------- --------- Belgium 0.44 0.46 France 0.6 0.62 etc.

and other similar sheets called "Private Debt", etc. Then the following code should give you what you want:

Code:

clear tempfile mydata save `"`mydata'"', emptyok loc sheets `""Public Debt" "Private Debt""' foreach sheet of local sheets { import excel mydata.xlsx, sheet("`sheet'") clear first gen sheet = subinstr("`sheet'"," ","",.) append using `"`mydata'"' save `"`mydata'"', replace } reshape long year_, i(country sheet) j(year) reshape wide year_, i(country year) j(sheet) string ren year_* *

There are a few housekeeping details WRT variable naming, etc. that you may need to play around with, but in general, a "double reshape" is the way to handle this problem.

Phil, thank you for clearing that up. That is indeed what I have been trying to do and it seems the reshape function is exactly what I need.
It renders perfectly the results I am looking for.

I do have one small request left. Would it perhaps be possible to explain "in human language" what happens in the different parts of your code?
Of course, I have googled all the different components and I do understand most of the individual functions. However, I have some trouble comprehending the overall logic.
It does not have to be very elaborate, but it would sure help me a lot.

Originally posted by Christophe Kolodziejczyk

y apology. I misunderstood what you intented to do. I understood that you wanted to make only one column.
Anyway you can still use the kronecker product to produce the first column as demonstrated below Code:

. . mata: ------------------------------------------------- mata (type end to exit) --------- : : // There are 5 countries and 3 years of data : Country = (1,2,3,4,5)' : year = (1,2,3) : : id = Country#J(cols(year),1,1),J(rows(Country),1,year') : : id 1 2 +---------+ 1 | 1 1 | 2 | 1 2 | 3 | 1 3 | 4 | 2 1 | 5 | 2 2 | 6 | 2 3 | 7 | 3 1 | 8 | 3 2 | 9 | 3 3 | 10 | 4 1 | 11 | 4 2 | 12 | 4 3 | 13 | 5 1 | 14 | 5 2 | 15 | 5 3 | +---------+ : : end -----------------------------------------------------------------------------------
But as Phil points out given your problem at hand the use of the (Stata) reshape command is a better choice. If you want to do it the Mata way you can use colshape() or rowshape() function, but you have to be careful, especially if you have more than one variable you want to reshape. For example if you have two variables public and income in wide format and they are arranged in this manner

public_debt1961, public_debt1962,public_debt1963,income1961,income1 962,income1963.
...

then it will be easier to use rowshape or colshape if the variables are arranged in the following order

public_debt1961,income1961, public_debt1962,income1962,public_debt1963,income1 963.
...

In my opinion and given your problem reshape in Stata is easier to use.

Best
Christophe

Christophe, thank you for providing me with your advice. At least I have learnt some Mata functions now. It will probably prove to be useful one day.
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#10

18 May 2014, 19:31

Originally posted by Willem Vanlaer View Post

I do have one small request left. Would it perhaps be possible to explain "in human language" what happens in the different parts of your code?

Code:

1 clear 2 tempfile mydata 3 save `"`mydata'"', emptyok 4 5 loc sheets `""Public Debt" "Private Debt""' 6 foreach sheet of local sheets { 7 import excel mydata.xlsx, sheet("`sheet'") clear first 8 gen sheet = subinstr("`sheet'"," ","",.) 9 append using `"`mydata'"' 10 save `"`mydata'"', replace 11 } 12 13 reshape long year_, i(country sheet) j(year) 14 reshape wide year_, i(country year) j(sheet) string 15 ren year_* *

Lines 1–3 create an empty dataset, saved under the temporary name mydata, which will be used to store the data cumulatively as we read them in from the Excel sheets. Line 5 simply specifies a list of sheet names (you would add the remaining sheet names to this), stored in the local macro named sheets. Lines 6–11 loop through each sheet, read the data in, add a variable containing the sheet name (without spaces), and add these to the data from the previous sheets. Note that at the end of each loop, the cumulative data are again saved. After the loop terminates, we have all of the data from all of the sheets, so we're ready to reshape. Line 13 rearranges the data in long form, with multiple observations for each country for a given variable—one per year. Line 14 then moves the sheets up into columns (wide form with respect to sheet), as you originally specified. Finally, line 15 merely strips off the year_ prefix on each new column, so that the columns are now named according to the original sheet names (again, minus the spaces). As I mentioned in my original post, you may want to handle the variable naming differently.
Comment

Announcement

Basic matrix question from unexperienced user

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment