On the absence of mixed string-numerical matrices in Mata

Charles Vellutini

Join Date: Apr 2014

Posts: 9
#1

On the absence of mixed string-numerical matrices in Mata

06 May 2014, 10:24

Dear all,

One of the recurring difficulties I run into when transferring a dataset from Stata to Mata is that the latter does not handle mixed matrices (string and numerical). This is often a problem for me as many datasets that I use contain a string ID. The workaround I rely on in this situation is to (1) encode the ID using something like

Code:

egen ID_num = group(ID)

(2) save the mapping to the original string IDs into a tempfile; (3) after the Mata analysis, recover those original IDs using a -merge- command.

This works, but I wonder if there are better solutions? For example, would it be possible to load the original string IDs into Mata as a separate string matrix, alongside the matrix holding the numerical data, and somehow keep track of the observation mapping across the two matrices? This would require, I assume, some sort of primary key holding observations together -- possible in Mata?

Thanks for any suggestions,

Charles

Last edited by Charles Vellutini; 06 May 2014, 10:27.
Tags: None
Roger Newson

Join Date: Apr 2014

Posts: 316
#2

07 May 2014, 04:23

Usually, I would use the -view- command to create 1 matrix for the string variables and 1 matrix for the numeric variables.

Once a data "matrix" starts to have both string columns and numeric columns, we start to call it a dataset. And it is then allowed to have a key (given by the -sortedby- attribute), and even variable labels and variable characteristics. Unfortunately, it is a limitation of Stata that, officially, it is not supposed to have multiple datasets/dataframes co-resident in the memory at one time. In the real world, there are often multiple Stata datasets co-resident in the memory, for instance at some stages of a -merge-,, but this is hidden from the user. I have raised the issue of multiple co-resident datasets in the memory at the last few UK Stata User Meetings, and I am assured that StataCorp are working on this problem, but I don't think it will be in Stata Version 14.

Best wishes

Roger
Comment
Charles Vellutini

Join Date: Apr 2014

Posts: 9
#3

07 May 2014, 04:57

Dear Roger,
Many thanks for your suggestion and explanations.

I hope I am not misunderstanding you, but in fact I am not worried about Stata's limitation to have only one dataset in memory at any given time, but rather to transfer that single dataset into the Mata space (complete with string columns) as one or several Mata matrices. Interestingly, you say you use one Mata matrix for string variable and another Mata matrix for numeric variables, correct? Sounds promising but how do we then keep track of observations across the two matrices? Is this because you use the -view- command (or rather would it be -st_view-?), Mata somehow keeps track of the original observations in Stata?

Sorry for sounding confused, but I probably am.
Thanks
Charles
Comment
Roger Newson

Join Date: Apr 2014

Posts: 316
#4

08 May 2014, 03:13

Yes, you're right, the command is -st_view-, and the data structure is a view. And the indices of the view (or of 2 views) correspond to the observations in the dataset, so the same index specifies the same observation.

Best wishes

Roger
Comment
Charles Vellutini

Join Date: Apr 2014

Posts: 9
#5

09 May 2014, 02:41

Thanks Roger, very interesting. I will try it.
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#6

09 May 2014, 08:51

Originally posted by Charles Vellutini View Post

to transfer that single dataset into the Mata space (complete with string columns) as one or several Mata matrices.

There are several ways you might proceed here (including Roger's suggestion to use two view matrices—one for the numeric variables and one for the string variables), depending on what you are trying to accomplish. What is your ultimate objective?
Comment
Charles Vellutini

Join Date: Apr 2014

Posts: 9
#7

09 May 2014, 09:28

Dear Phil,

My goal is to run some calculations in Mata while preserving string variables (for example: ID variables) as part of a coherent dataset across observations. Importantly (I think), these computations include resorting the data. After the computations, this Mata pseudo-dataset is to be transferred back to Stata for more computations/estimations.
Thanks
Charles
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#8

12 Aug 2014, 04:57

Any particular reason for using: egen idvar = group(id) vs encode idvar, gen(id) One of the nicer features to the later is the creation of value labels that store the original values and could be accessed to directly recover the original string values. I regularly have to deal with combinations of string variables that ID either individuals or groups and this always seems to be a quick and easy way of getting what is needed.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 34771
#9

12 Aug 2014, 06:22

As egen's group() function has a label option, the difference is less than you imply. The advantages of egen's group()include its easy applicability to one or more variables, string or numeric, and its use for tidying up an irregular numeric sequence.

More at

SJ-7-4 dm0034 . . . Stata tip 52: Generating composite categorical variables
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q4/07 SJ 7(4):582--583 (no commands)
tip on how to generate categorical variables using
tostring and egen, group()

http://www.stata-journal.com/sjpdf.h...iclenum=dm0034
Comment
Mike Barker

Join Date: Apr 2014

Posts: 37
#10

12 Aug 2014, 08:25

After doing the work in Mata, are you adding new variables to your original Stata dataset, or are you creating an entirely new dataset?

If you're adding new variables to the original dataset, I don't think you have to keep track of string ids. As long as you resort the Mata matrices back to their original order, you can just resave them back to Stata using st_store().

If you're creating an entirely new dataset, and just want to keep the same string ID associations, I think your current method is easier than what you would have to do in Mata. You could also look at the - label save - command, to save value labels in a do-file.

Mike
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#11

12 Aug 2014, 16:29

Originally posted by wbuchanan View Post

Any particular reason for using: egen idvar = group(id) vs encode idvar, gen(id) One of the nicer features to the later is the creation of value labels that store the original values and could be accessed to directly recover the original string values. I regularly have to deal with combinations of string variables that ID either individuals or groups and this always seems to be a quick and easy way of getting what is needed.

There's a limit to the number of codings within one value label, see help limits. So if you have more than 65,536 different identifiers, then encode will not work.
1 like
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#12

13 Aug 2014, 04:51

I never ran into the limit on encode, but could see how that could present an issue.
Comment

Announcement

On the absence of mixed string-numerical matrices in Mata

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment