Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • [email protected] what you’re asking should be possible using Mata and the information in
    Code:
    help dta
    . What you’re asking for is something that parses everything up to the <data> tag in the dta file spec (https://www.stata.com/help.cgi?dta#map).

    Comment


    • Make debugging easier with one or more of:

      1) Provide the line number or context for error messages triggered inside a -foreach loop- without requiring the user to rerun the program with -set trace on;set tracedepth1-.

      2) Mark error messages with a distinctive tag so that they can be searched for in an editor.

      3) Provide a way to suppress purely informative messages such as "NN real changes made" or "NN observations deleted" without suppressing actual error messages.

      4) Add the variable name to the informative messages mentioned in (3) so that they can be related to the particular variable when executed in a -foreach- loop.

      5) Add more context to error message. If an option is improper, what is the string that isn't a proper option. If there is a type mismatch, what are the types that don't match. If there is syntax error, where did the parsing stop. etc, etc. Little bits of information can save a lot of time. How long does it take you to understand the following code fragment and error message:

      Code:
         
      . list
      
           +-------+
           | x   y |
           |-------|
        1. | 3   2 |
           +-------+
      
      . gen z=x*y
      type mismatch
      r(109);
      How long for a beginning Stata user?

      6) Provide a -trace- setting that doesn't trace Stata provided code, only code in the runing program and current directory.
      Last edited by Daniel Feenberg; 03 Sep 2022, 09:25.

      Comment


      • Originally posted by [email protected] View Post
        3) Provide a way to suppress purely informative messages such as "NN real changes made" or "NN observations deleted" without suppressing actual error messages.
        See

        Code:
        help quietly

        Originally posted by [email protected] View Post
        5) Add more context to error message.
        How long does it take you to understand the following code fragment and error message:

        Code:
        . list
        
        +-------+
        | x y |
        |-------|
        1. | 3 2 |
        +-------+
        
        . gen z=x*y
        type mismatch
        r(109);
        How long for a beginning Stata user?
        A click on r(109) yields:

        In an expression, you attempted to combine a string and numeric
        subexpression in a logically impossible way. For instance, you
        attempted to subtract a string from a number or you attempted
        to take the substring of a number.
        Seem like a pretty accurate description of the problem to me.
        Last edited by daniel klein; 03 Sep 2022, 14:36.

        Comment


        • No doubt Stata works better in interactive mode, but I don't.

          Comment


          • Introduce new data structures (such as lists and tuples). I know, there's Python for that, but still...

            Comment


            • Originally posted by Federico Bindi View Post
              Introduce new data structures (such as lists and tuples). I know, there's Python for that, but still...
              Do you want those in Mata? I cannot imagine why you would want those in Stata (but that could also have something to do with my imagination...). If you present a use case for those datastructures, then your request can become more convincing. Everybody can say they want something, but there is only a limited amount of resources.

              Right now it sounds like a common problem that many people who migrate from language A to language B have: they miss some aspect of language A, but don't know yet that there is some other way that language B does things that makes that aspect unnecessary and/or even inefficient. This is not ciriticism of you, it is a common process we all go through at some point.
              ---------------------------------
              Maarten L. Buis
              University of Konstanz
              Department of history and sociology
              box 40
              78457 Konstanz
              Germany
              http://www.maartenbuis.nl
              ---------------------------------

              Comment


              • re: #470
                Code:
                ssc describe tuples
                While not a new data structure it may still prove useful.

                Comment


                • Originally posted by Federico Bindi View Post
                  Introduce new data structures (such as lists and tuples). I know, there's Python for that, but still...
                  In many ways, Stata's macros replicate (or can be made to replicate) the functionality of lists -- see here for more: pmacrolists.pdf (stata.com).

                  Comment


                  • Originally posted by Tom Dietz View Post
                    This isn't a part of the code but a thought on policies. I am moving to emeriti status and so my university will no longer pay for Stata. With the end of perpetual licenses I will have to pay out of pocket for Stata in about two years. So after 30 years of using and teaching Stata almost exclusively I will reluctantly switch to R as I plan to continue doing research. I'm wondering if Stata could have a pricing policy of the sort used by many scientific societies--with a special rate for retirees.
                    Hi Tom! I wanted to reassure you that perpetual licenses are still available. You can upgrade your existing perpetual license online at https://www.stata.com/order. If you would like to purchase a new perpetual license, or if you have any questions, you can contact us at [email protected]. We are happy to go over licensing options, including licensing options for retirees, with you.

                    Comment


                    • [email protected] correction, not possible to do in Mata, currently. That said, if StataCorp is able to allow the buffer functions in Mata to read/write 8-byte unsigned/signed integers it wouldn't be terribly difficult to do what you're asking in Mata and only read in the metadata while skipping the data and strls entirely.

                      Comment


                      • Originally posted by wbuchanan View Post
                        [email protected] correction, not possible to do in Mata, currently.
                        Probably, I misunderstand the request, or there is a bug in one of my routines. I have a clumsy way of reading the variable names from the dta file in my usesome command. I was under the impression that you could, in principle, read the other metadata fields as well.

                        Comment


                        • daniel klein I just looked at the code you referenced and tried an experiment with it to see if it would recover the correct information, but it doesn't seem like it is parsing 8-byte unsigned integers correctly (assuming I was interpreting things correctly). My approach was going to be to read in enough bytes initially to get the <map> element, and then use those 8byte unsigned integers to quickly locate all of the other elements in the file header. When I used the same method you are using in your
                          Code:
                          hexread
                          function the result was definitely not correct:

                          Code:
                          . mata
                          ------------------------------------------------- mata (type end to exit) ------------------------
                          : fh = fopen("Sample.dta", "r")
                          
                          : test = fread(fh, 612)
                          
                          : mapelem = ustrregexm(test, ".*(<map>.*</map>).*")
                          
                          : mapstr= ustrregexs(1)
                          
                          : map = ustrregexra(mapstr, "</?map>", "")
                          
                          : x = ascii(substr(map, 9, 16))
                          
                          : y = inbase(16, x)
                          
                          : for(i = 2; i <= cols(y); ++ i) y[1] = y[1] + substr("0" + y[1], -2)
                          
                          : frombase(16, y[1])
                            3.18931e+38 <- This is a really small file, so it seems unlikely that the second value in the map element would be at this byte position
                          
                          : x = ascii(strreverse(substr(map, 9, 16)))
                          
                          : y = inbase(16, x)
                          
                          : for(i = 2; i <= cols(y); ++ i) y[1] = y[1] + substr("0" + y[1], -2)
                          
                          : frombase(16, y[1])
                            0 <- The second value in <map> should be the byte value where the <map> element is located
                          Regardless, if the point of the buffer functions in Mata is to allow the reading/writing of files, it seems like it would be reasonable that they be able to read/write the same types of values used in a .dta file.

                          Comment


                          • I'm unable to edit my previous comment, but there seem to be some inconsistencies in what -ustrregexm/s()- returns when using a different .dta file. I will continue trying to see if I can figure out what is wrong with the way I applied the logic of daniel klein's function.

                            Comment



                            • ustrregexm(), ustrregexs(), and ustrregexra() are not going to work for what you intend here. All three functions are implemented using ICU lib, which works with well formed text string in UTF-16 encoding, not binary data. Hence the function will convert Stata/Mata strings from UTF-8 to UTF-16 encoding before handing it to ICU, and the results will be convert back to UTF-8 encoding. Any invalid Unicode sequence will be changed.


                              Code:
                              mapelem = ustrregexm(test, ".*(<map>.*</map>).*")
                              will not work consistently depending on if there is line feed in test. The following modified regular expression should handle line feed:

                              Code:
                              mapelem = ustrregexm(test, "(.|\n)*(<map>(.|\n)*</map>)(.|\n)*")
                              After that,

                              Code:
                              mapstr= ustrregexs(1)
                              will converted the byte sequence between <map> and </map> to UTF-16 then converted it back to UTF-8, hence any invalid Unicode sequences will be changed during the conversion, i.e., highly likely you get something back which is different. Here I believe strpos() and substr() should work,

                              Code:
                                
                              mata:
                              fh = fopen("auto.dta", "r")
                              test = fread(fh, 612)  
                              // test = "ab<map>cd</map>ef"
                              p1 = strpos(test, "<map>")  
                              p1  
                              p2 = strpos(test, "</map>")  
                              len = p2 - p1 - 5  
                              len  
                              start = p1 + 5  
                              b = substr(test, start, len)  
                              b
                              end
                              Last edited by Hua Peng (StataCorp); 08 Sep 2022, 18:12.

                              Comment


                              • Hua Peng (StataCorp)
                                Thanks again as always for the insight. All that said, any chance for the buffer functions in Mata to make it possible to parse 8-byte unsigned integers?

                                Comment

                                Working...
                                X