Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • On the absence of mixed string-numerical matrices in Mata

    Dear all,

    One of the recurring difficulties I run into when transferring a dataset from Stata to Mata is that the latter does not handle mixed matrices (string and numerical). This is often a problem for me as many datasets that I use contain a string ID. The workaround I rely on in this situation is to (1) encode the ID using something like
    Code:
    egen ID_num = group(ID)
    (2) save the mapping to the original string IDs into a tempfile; (3) after the Mata analysis, recover those original IDs using a -merge- command.

    This works, but I wonder if there are better solutions? For example, would it be possible to load the original string IDs into Mata as a separate string matrix, alongside the matrix holding the numerical data, and somehow keep track of the observation mapping across the two matrices? This would require, I assume, some sort of primary key holding observations together -- possible in Mata?

    Thanks for any suggestions,

    Charles
    Last edited by Charles Vellutini; 06 May 2014, 10:27.

  • #2
    Usually, I would use the -view- command to create 1 matrix for the string variables and 1 matrix for the numeric variables.

    Once a data "matrix" starts to have both string columns and numeric columns, we start to call it a dataset. And it is then allowed to have a key (given by the -sortedby- attribute), and even variable labels and variable characteristics. Unfortunately, it is a limitation of Stata that, officially, it is not supposed to have multiple datasets/dataframes co-resident in the memory at one time. In the real world, there are often multiple Stata datasets co-resident in the memory, for instance at some stages of a -merge-,, but this is hidden from the user. I have raised the issue of multiple co-resident datasets in the memory at the last few UK Stata User Meetings, and I am assured that StataCorp are working on this problem, but I don't think it will be in Stata Version 14.

    Best wishes

    Roger

    Comment


    • #3
      Dear Roger,
      Many thanks for your suggestion and explanations.

      I hope I am not misunderstanding you, but in fact I am not worried about Stata's limitation to have only one dataset in memory at any given time, but rather to transfer that single dataset into the Mata space (complete with string columns) as one or several Mata matrices. Interestingly, you say you use one Mata matrix for string variable and another Mata matrix for numeric variables, correct? Sounds promising but how do we then keep track of observations across the two matrices? Is this because you use the -view- command (or rather would it be -st_view-?), Mata somehow keeps track of the original observations in Stata?

      Sorry for sounding confused, but I probably am.
      Thanks
      Charles

      Comment


      • #4
        Yes, you're right, the command is -st_view-, and the data structure is a view. And the indices of the view (or of 2 views) correspond to the observations in the dataset, so the same index specifies the same observation.

        Best wishes

        Roger

        Comment


        • #5
          Thanks Roger, very interesting. I will try it.

          Comment


          • #6
            Originally posted by Charles Vellutini View Post
            to transfer that single dataset into the Mata space (complete with string columns) as one or several Mata matrices.
            There are several ways you might proceed here (including Roger's suggestion to use two view matrices—one for the numeric variables and one for the string variables), depending on what you are trying to accomplish. What is your ultimate objective?

            Comment


            • #7
              Dear Phil,

              My goal is to run some calculations in Mata while preserving string variables (for example: ID variables) as part of a coherent dataset across observations. Importantly (I think), these computations include resorting the data. After the computations, this Mata pseudo-dataset is to be transferred back to Stata for more computations/estimations.
              Thanks
              Charles

              Comment


              • #8
                Any particular reason for using: egen idvar = group(id) vs encode idvar, gen(id) One of the nicer features to the later is the creation of value labels that store the original values and could be accessed to directly recover the original string values. I regularly have to deal with combinations of string variables that ID either individuals or groups and this always seems to be a quick and easy way of getting what is needed.

                Comment


                • #9
                  As egen's group() function has a label option, the difference is less than you imply. The advantages of egen's group()include its easy applicability to one or more variables, string or numeric, and its use for tidying up an irregular numeric sequence.

                  More at


                  SJ-7-4 dm0034 . . . Stata tip 52: Generating composite categorical variables
                  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
                  Q4/07 SJ 7(4):582--583 (no commands)
                  tip on how to generate categorical variables using
                  tostring and egen, group()

                  http://www.stata-journal.com/sjpdf.h...iclenum=dm0034

                  Comment


                  • #10
                    After doing the work in Mata, are you adding new variables to your original Stata dataset, or are you creating an entirely new dataset?

                    If you're adding new variables to the original dataset, I don't think you have to keep track of string ids. As long as you resort the Mata matrices back to their original order, you can just resave them back to Stata using st_store().

                    If you're creating an entirely new dataset, and just want to keep the same string ID associations, I think your current method is easier than what you would have to do in Mata. You could also look at the - label save - command, to save value labels in a do-file.

                    Mike

                    Comment


                    • #11
                      Originally posted by wbuchanan View Post
                      Any particular reason for using: egen idvar = group(id) vs encode idvar, gen(id) One of the nicer features to the later is the creation of value labels that store the original values and could be accessed to directly recover the original string values. I regularly have to deal with combinations of string variables that ID either individuals or groups and this always seems to be a quick and easy way of getting what is needed.
                      There's a limit to the number of codings within one value label, see help limits. So if you have more than 65,536 different identifiers, then encode will not work.

                      Comment


                      • #12
                        I never ran into the limit on encode, but could see how that could present an issue.

                        Comment

                        Working...
                        X