Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • If datetime functions that require the storage type -double- is not detected, a warning should be printed to the user to flag that they are using the wrong storage type and it will result in a loss of precision. I envision that this feature could be controlled by a -set- option to turn it off for experienced programmers, but default to on to catch easy mistakes made by novice programmers.

    Comment


    • Modify didregress (xtdidregress) and hdidregress (xtdidregress) to take into account non binary/categorical treatments

      Comment


        • The do-file editor should match preserve and restore statements and allow Ctrl+5 jumping from top to bottom and bottom to top
        • Nested multiple preserve and restore should be allowed, something like preserve, level(1), preserve, level(2), restore, level(2), restore, level(1). (Although maybe I should bite the bullet and learn how to use frames)

        Comment


        • (Although maybe I should bite the bullet and learn how to use frames)
          Yes! It's easy, it's fun, and it's very useful.

          Also, in current Stata, -preserve- and -restore- are often implemented through frames, #408 could be seen as asking for a more awkward language for working with frames.

          Comment


          • Originally posted by Bert Lloyd View Post
            Nested multiple preserve and restore should be allowed,
            See [D] snapshot

            Comment


            • 1. Time series diagnostic plots and statistics like R—for example, https://rdrr.io/r/stats/tsdiag.html
              2. Advance meta analysis, including network meta analysis
              3. Network modelling
              4. Model diagnostics into the user interface, not as packages.
              5. Data preservation methods like R for the same project.

              Comment


              • #406 Leonardo Guizzetti

                If datetime functions that require the storage type -double- is not detected, a warning should be printed to the user to flag that they are using the wrong storage type and it will result in a loss of precision. I envision that this feature could be controlled by a -set- option to turn it off for experienced programmers, but default to on to catch easy mistakes made by novice programmers.
                As I understand it, this is really hard to ensure in Stata. Your syntax is parsed for being legal, not for being a good idea. Also, no function knows how its results are going to be used, which is implemented by different syntax.

                The issue here is when an user uses generate and forgets, or doesn't know, that double should follow for a date-time variable (* detail later).

                At the time the parser sees (e.g.) a call to clock() there is no cross-reference or checking of what the data are, what user defaults are in play, or what the implications will be. There is only one question: is this syntax legal?

                Consider

                Code:
                . display clock("2025 Feb 8 10:00", "YMD hm")
                2.055e+12
                
                . display %tc clock("2025 Feb 8 10:00", "YMD hm")
                08feb2025 10:00:00
                In the first case display sees a large integer (2 trillion plus some) and just uses a default display format. In the second case display sees a specific date display format. In both cases, the call to clock() is made in complete ignorance of what will happen to the result.

                Now consider (with the information that I use Stata's default default [!] of float as a default numeric storage type)

                Code:
                . clear
                
                . set obs 1
                Number of observations (_N) was 0, now 1.
                
                . gen bad_idea = clock("2025 Feb 8 10:00", "YMD hm")
                
                . gen double better_idea = clock("2025 Feb 8 10:00", "YMD hm")
                
                . format * %tc
                
                . l
                
                     +-----------------------------------------+
                     |           bad_idea          better_idea |
                     |-----------------------------------------|
                  1. | 08feb2025 10:00:48   08feb2025 10:00:00 |
                     +-----------------------------------------+

                What Leonardo wants, I think, is that when each reference to clock() is seen, Stata should be aware of, and warn against, the implication of any bad choice. That requires the parser to know much more than it does.

                FWIW, numdate from SSC requires you to specify what kind of date-time variable you want -- and given choice of a clock variable -- will only yield a double as storage type for a new variable. So, you can arrange that bad ideas can't be implemented. generate as a general work-horse can't be given the burden of judging what is a good idea. More at https://www.statalist.org/forums/for...date-variables

                In another example, taking logarithms of zero or negative or missing values will in Stata yield missing values. There is an entire spectrum of possibilities from that behaviour being desired as well as inevitable to the user being sadly confused or ignorant. (I've seen people take logarithms of temperatures in Celsius or Fahrenheit and blithely ignore messages about missing values for zero or negative temperatures.)

                Now no-one should want clock variables to be forced into floats if they hear an explanation of why that is wrong, but it's all part of the same territory.

                (* Promised detail: Anyone who set their default numeric storage type to double won't be bitten here, and many users do that and some of those users think other people should too.)
                Last edited by Nick Cox; 08 Feb 2025, 03:33.

                Comment


                • Your syntax is parsed for being legal, not for being a good idea. Also, no function knows how its results are going to be used, which is implemented by different syntax.
                  There are three kinds of errors in programming: syntax errors, runtime errors, and silent errors. The first kind of error is usually caught prior to runtime because the author of the code gives syntax that is not legal in the language. This kind of error is often given in the text editor or IDE as text with a red underline (like you might see a misspelled word underlined in red in Microsoft Word). The second happens when syntactically legal code leads to some logical inconsistency at runtime, and usually manifests as an error message and stack trace (depending on the language). The third type of error happens when you provide syntactically legal code that is internally consistent, but the code produces a result other than the one the programmer expects.

                  Right now, forgetting to specify that a date should be a double is a kind of silent error, because the user might mistakenly expect to retain all of the precision, but the (nonetheless legal and internally consistent) code produces an unexpected result. One idea is to take this silent error and try to turn it into a runtime error. The problem (as Nick points out in #412) is that clock returns what it returns, generate takes whatever its given and tries to put it in a variable with an appropriate type, and neither the function or the command can (or should) look outside itself to check the context because that would break encapsulation. I just want to point out that, to the extent this is a parser issue, it's because the parser seems to execute code blind until it hits an error or finishes execution.

                  I think there is an alternative: Turn this problem into the first type of error by having the parser look over the code to make sure the syntax is correct before runtime. One would usually implement this kind of thing with regular expressions, which are good for finding patterns in structured text. Usually you want the pattern to be general (e.g., each line must start with a valid command name, else raise an error), but it can also be specific (e.g., generate followed by a call to clock() should be modified by the "double" keyword, else raise a warning). There may be legitimate engineering reasons you may want to avoid trying to handle very specific special cases like this, my point is just that it is possible in principle exactly because the parser can see the syntax in context before runtime.

                  There is only one question: is this syntax legal?
                  I've made my case elsewhere that I think Stata would benefit from a more powerful system for detecting errors before runtime, so I won't repeat myself here. The crux of the matter is that Nick is exactly right that the parser only cares if the syntax is legal, but I contend that it is possible to have a more detailed accounting of what constitutes legal syntax, and it is even possible to have more severe syntax violations (that result in an error) and less severe syntax violations (that result in a warning).

                  Comment


                  • I agree with the multiple good points raised by Nick and Daniel in #412 and #413. I did know that the interpreter (as it exists) is unlikely to be able to accommodate what amounts to type-checking, at least now with a major redesign. An alternative major change would be broader in scope, which is to modify how Stata manages numeric storage types to operate more like -str#-. Specifically, if a larger type is needed, the variable is automatically recast, which in this case -double- would be the only sensible choice to avoid a loss in precision. This is not without its own problems because sometimes you want control over the storage precision and don't want those to change.

                    A less disruptive change would be to make a change in the do-file editor to add features found in other more full-blown IDEs. The do-file editor could use regular expressions (as suggested by Daniel) to suggest the storage type that might be wanted. In other words, the do-file editor gives you hints at types (or potentially other common programming errors).

                    Comment


                    • Originally posted by Leonardo Guizzetti View Post
                      I did know that the interpreter (as it exists) is unlikely to be able to accommodate what amounts to type-checking. . .
                      Well, it sort of can.

                      . generate str bad_date = clock("2025-02-09 11:11:11", "YMD hms")
                      type mismatch
                      r(109);


                      It gets more complicated with something like
                      Code:
                      generate my_date = clock("2025-02-09 11:11:11", "YMD hms")
                      in that the parser would have to check c(type) to determine whether the user had set the default numeric storage type to double precision, but in principle it seems that the basics are already present.

                      In line with this thread as a wish list, I do agree that embellishments to the Do-file editor would be welcome.

                      Comment


                      • Originally posted by Leonardo Guizzetti View Post
                        An alternative major change would [...] modify how Stata manages numeric storage types to operate more like -str#-. Specifically, if a larger type is needed, the variable is automatically recast,
                        That's actually not how str# works. Here is an example in Stata 18:
                        Code:
                        . clear
                        
                        . set obs 1
                        Number of observations (_N) was 0, now 1.
                        
                        . generate str1 foo = "bar"
                        
                        . list
                        
                             +-----+
                             | foo |
                             |-----|
                          1. |   b |
                             +-----+
                        There’s no warning or error here, but also no automatic expansion of the string length. The expression is simply truncated to fit within str1.

                        One could wish for a generic numeric type, similar to str (without the #); something like
                        Code:
                        generate numeric newvar =exp
                        Or one could think along the lines of c(obs_t) where Stata finds the appropriate storage type for storing an integer as large as _N. However, neither would help here because users still need to be explicit. Also from a more conceptual perspective, there’s a fundamental difference between storage size and numerical precision. Automatically choosing int instead of byte ensures integers larger than 100 are stored and not turned into missing values, whereas choosing double instead of float affects how precisely the same non-missing value is represented. A generic approach to avoiding loss of precision would necessarily (have to) affect all values without finite binary representation.


                        Originally posted by Joseph Coveney View Post
                        Well, it sort of can.

                        . generate str bad_date = clock("2025-02-09 11:11:11", "YMD hms")
                        type mismatch
                        r(109);
                        Are you sure it’s the syntax parser that throws the type mismatch error? It seems more likely that Stata creates a string variable as instructed, and then throws a type mismatch error when it tries to fill the variable with numeric contents. Obviously, only someone from StataCorp. can clarify these details.

                        Edit/Added: Re-reading and taking seriously the help for generate,
                        If a type is specified, the result returned by = exp must be string or numeric according to whether type is string or numeric.
                        it might indeed be the syntax parser that throws the error before creating any variables.
                        Last edited by daniel klein; 09 Feb 2025, 13:08.

                        Comment


                        • Originally posted by daniel klein View Post
                          That's actually not how str# works. Here is an example in Stata 18:
                          ...
                          There’s no warning or error here, but also no automatic expansion of the string length. The expression is simply truncated to fit within str1.
                          It kind of is how Stata treats strings. Following your example, Stata will expand the -str#- to accommodate, so at the very least, the behaviour is inconsistent. If i only wanted a fixed-size string, Stata respects it upon generation, but not upon modification.

                          Code:
                          . replace foo = "bar"
                          variable foo was str1 now str3
                          (1 real change made)
                          This is documented as much in [U] 12.4.7 str1-str2045 and strL (emphasis mine)

                          Stata commands automatically promote str# storage types when necessary:

                          . replace make = ”Mercedes Benz Gullwing” in 1
                          variable make was str17 now str22
                          (1 real change made)

                          In fact, if the string to be stored is longer than 2,045 bytes, generate and replace will even promote to strL. We discuss strLs in the next section.
                          I think this side discussion may have even uncovered a bug. It seems data type is respected if one is explicitly given to -generate-, but then contradicts this document if taken literally.

                          Code:
                          . clear
                          . set obs 1
                          
                          . gen str1 test1 = "a"*2048
                          . gen test2 = "a"*2048
                          
                          . desc
                          
                          Contains data
                           Observations:             1                  
                              Variables:             2                  
                          -------------------------------------------------------------
                          Variable      Storage   Display    Value
                              name         type    format    label      Variable label
                          -------------------------------------------------------------
                          test1           str1    %9s                   
                          test2           strL    %9s                   
                          -------------------------------------------------------------

                          Comment


                          • I know this isn't the place for longer discussions so this will be my last post on this topic here.

                            Originally posted by Leonardo Guizzetti View Post

                            It kind of is how Stata treats strings.
                            No, it isn't. Stata does not treat string and numeric variables differently reading the promotion of storage types. Watch:
                            Code:
                            . clear
                            
                            . set obs 1
                            Number of observations (_N) was 0, now 1.
                            
                            . generate byte foo = 42
                            
                            . replace foo = 4273
                            variable foo was byte now int
                            (1 real change made)
                            To be clear, replace generally promotes variables as needed (unless instructed otherwise; see option nopromote). Conversely, generate always uses the storage type based on explicit specification or its default behavior. The defaults are float (at least that's the factory setting) for numeric expressions and appear to be str (without # or L) for string expressions. Admittedly, the default for strings is not clearly documented as such.

                            Now there is one exception to the behavior of replace: it will not change a variable's storage type from float to double -- ever.
                            Code:
                            . clear
                            
                            . set obs 1
                            Number of observations (_N) was 0, now 1.
                            
                            . generate float myfloat = 1.70141173319*10^38
                            
                            . replace myfloat = 1.8*10^38
                            (1 real change made, 1 to missing)
                            You could argue that's a bug because, in this specific example, a double would be needed to avoid the missing value. However, most users would typically be puzzled to find something like
                            Code:
                            . tabulate foo
                            
                                    foo |      Freq.     Percent        Cum.
                            ------------+-----------------------------------
                                    1.1 |          1       50.00       50.00
                                    1.1 |          1       50.00      100.00
                            ------------+-----------------------------------
                                  Total |          2      100.00
                            as a possible and likely side effect of changing the storage type from float to double.

                            This all goes to my main point: There's a fundamental difference between adjusting the storage type to accommodate larger values (or longer strings) and changing the storage type to accommodate higher precision for the same value.
                            Last edited by daniel klein; 10 Feb 2025, 01:06.

                            Comment


                            • Inspired by this recent therad: Is there any practical benefit to scalars and variables not having separate namespaces? If there is not, could scalars and variables have separate namespaces, please?

                              Comment


                              • incorporate shufflevar into Stata.

                                have a command that allows for the random sampling, with and without replace and in blocks, of a single variable.

                                Comment

                                Working...
                                X