Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • what to do if Stata 13.1 MP itself produced corrupted dta files (no abject failure, but might cause freezes ever-so-often)

    Hi, I am using an up-to-date Stata 13.1 MP on Windows Server 2012, and maybe you will have some advice for me while you also learn a cautionary tale: I did see frequent freezes of Stata during operations among which I saw no obvious pattern. Resources are plentiful, and though StataCorp tech support reminded me that I can always suspect hardware errors or overheating, I really found no fault in a fairly recent server. Today a new suggestion from tech support was that I should check my files with -dtaverify- because the the new format in version 13 could cause problems with third party software or simply during -merge- etc. I was skeptical, because the freezes did not happen during data-management, or even more importantly, because no old or third-party software was involved, this data is generated from text files using -import delimited- and all merges and appends follow from that (StataCorp also suspected string issues specifically, and as you can see, -dtaverify- complains about other things, though it does not finish its checks after it found this issue). Now I do have a Windows command prompt loop to check all .dta files on my drives with -dtaverify-, and it started finding problems. I think the question then becomes: How serious can these be, and is there any way to fix these apart from waiting for a future Stata update that will be robust to these or fix its own faults on the fly? I would rather not spend days rerunning the data generation process merely hoping that this time the errors will not arise for some reason. The first error message I ran into is:
    12. reading and verifying value label definitions
    verifying construction
    (1 labels in file)
    SERIOUS ERROR: |......... .....| found where |</value_labels>| expected
    Stopping; this error prevents continuing to check for other errors.
    Serious errors detected
    r(459);
    The loop for future reference:
    Code:
    FOR /R S:\ %%f IN (*.dta) DO (
    "c:\Program Files (x86)\Stata13\stataMP-64.exe" /e do S:/dtaverify/dtaverify_all.do %%f
    )
    The simple do file this calls:
    Code:
    dtaverify `1'
    exit

  • #2
    It seems that you're reading in text files and doing a bunch of encode work with them before saving the dataset as a Stata .dta file. Some kind of corruption seems to be happening with the value label storage. Maybe it has to do with some kind of character that's part of the text that's being encoded, and ends up being interpreted by Stata on the re-read as some kind of escape character or internal file delimiter, and is seen to be out-of-place in the dataset file.

    I'm obviously just guessing, here, too, but could it be a problem with a code page(s) that your system uses (or that was used to create the text files that are being imported into Stata and then encoded)? For example, what Windows code page (localization) is the server using and is the machine that you're using (desktop, workstation) to access the server (and giving Stata running on it its instructions) using the same or compatible code page?

    Comment


    • #3
      Thanks, I am not sure about the code page. The text files came from a data source, I don't know which environment they made them. Stata did not complain about the encoding, delimiters of line endings when it -import delimited- it. I suspect the data sources used Swedish localizations of Windows or Unix, compiling the TSV files with SQL or SAS. My server run Windows Server 2012 which I use with an English keyboard layout via RDP. Also, many, many files are OK, and so far I only found problems with more convoluted constructions (appends, merges, transformations), so I am not sure the text files are the raw cause. I left all dtaverify logs on Dropbox for anyone curious, strange things happen in controls.log, panel.log, analysis_individ.log, though even some of the raw files (ku25.log, e.g.) also produce errors, maybe with -dtaverify-, not the data itself:
      fseek(): 1458111856 Stata returned
      > error
      read_map_verify(): - function returned error
      read_map(): - function returned error
      readfile(): - function returned error
      verify_dta_file(): - function returned error
      <istmt>: - function returned error
      r(1458111856);

      end of do-file
      r(1458111856);
      In any case, I wonder why Stata complains about this with dtaverify and not otherwise. It is still strange if this leads to weird freezes during calculations on data already in memory (nothing apparent to do with labels or strL variables) later on.

      Comment


      • #4
        Laszlo,

        1) try use13 to read in the data from the problematic files you have. The code in use13 is a totally independent implementation based on the official dta specification 117. It may produce more informative error messages.

        2) There was at least one known bug in Stata 13, fixed very early, see more at use13 page. Thus it is important to know exactly which version produced the file. Report the full version of the executable if possible.

        3) Your reference to dtaverify is not clear:
        Code:
        . which dtaverify
        command dtaverify not found as either built-in or ado-file
        r(111);
        4) I feel like something is missing in your email. You start by providing a cautionary tale, rather than describing the problem. Perhaps this is natural if you communicated already about it with StataCorp for a few days, but for a fresh reader, please give some basic introduction. For example:
        a) I have this text file (attach)
        b) I run this code (attach) in this Stata (describe)
        c) I obtain this file (attach) which can't be read in this Stata (describe) because of this error message (describe)
        5) Giving some idea about the files you are working with would be helpful. For example: the file is 2.3gb in size, uses long strings, which are binary (non-ascii).

        6) The Statalist forum supports attachments. I feel them safer than dropbox. Didn't click on those links

        7) You write that you have a script to validate the dta files you have. Run it on all and see what is common between erroneous files and how do they differ from files deemed valid.

        Best, Sergiy Radyakin

        Comment


        • #5
          Thanks, Sergiy.
          1) To be clear: you recommend -use13- even on Stata 13? I can try that, but to be clear again: -use- has no problems with any of my data, only -dtaverify- does. I still don't know whether any of this could indeed be the cause of the freezes during calculations.
          2) All the executables are up to date (executable May 6, 2014), and the data files were imported in April.
          3) dtaverify is built into my Stata 13.1, and was recommended by StataCorp, I did not know how long it has been around. -which dtaverify- reports version 1.0.0 was timestamped on February 27, 2014.
          4) I can be more careful with the write-up next time, thanks. The cautionary tale part is that I/you/we might have problems with our data that only -dtaverify- finds but -use- does not. I consider myself Stata savvy, and I still did not know about the command, nor the possibility that lurking data corruption could cause any problems during runs. At least the fact that -dtaverify- can find problems unnoticed otherwise is definitely true. I am not sharing with you a reproducible story of how to raise the error, and I cannot share the data.
          5) The logs have a bit more information, the files vary in size from 100 MB to 40 GB. It does seem to have a strL's which I did not even expect, but I cannot trace any of the problems to them. I don't know if they are binary.
          6) I would attach three logs. controls.log shows "the break without anyone pressing break," "panels.log" shows an aborted verification because of label problems, and "analysis_individ.log" shows a major fseek error but maybe that's a problem of the fairly recent -dtaverify- code, not the data. The statalist forum motor seems to be broken, it took me an hour to write these posts in four separate attempts, and now I cannot attach the file. Sorry you'll need to trust me with the link as much as you would need to trust with an email attachment or a file attachment here. At least this way you can see the file extensions: https://www.dropbox.com/s/3hnufq1o86...is_individ.log , https://www.dropbox.com/s/ul05hjyxm59clb4/controls.log , https://www.dropbox.com/s/z63bizcej66jz4f/panel.log
          7) I found no patterns across which files are valid and which not, esp. as the verifications don't complete fully but abort on the first error. You see the Windows batch script in my original post, at least, I hope that helps! So I am not just saying I had a script.
          Thanks again!

          Comment


          • #6
            Hello Laszlo,

            first, thank you for pointing to dtaverify. I was not aware of its existence and I am happy this command was provided by StataCorp.

            I can't say for sure that your files are free of problems, but I can say that I do see a problem with dtaverify, which is consistent with the symptoms you describe: Stata itself opens the data without complains, but the validator claims that there is a problem with the file.

            If I am correct, my use13 should load your dataset without complains. No, I do not recommend Stata 13 users to employ use13 for actual work instead of the standard use command with the data (since you asked this question above), but it is good for validation purposes, since it is a completely independent implementation based on the specification only and a lot of testing. (it wouldn't read long strings since it is intended for earlier versions of Stata, where these strings couldn't be accommodated).

            This also means that you can convert your datasets with Stat/Transfer to make sure they are error-free.

            This leaves however open the question of freezes, which was your original concern, and which appears to me to be totally independent from the dtaverify story.

            Now, again if my guess is correct, the following is what you can try:
            1. Select one of the files that you are having a problem with. Backup this file. Seriously. Work on a copy only. Things may go wrong and I will not assume any responsibility.
            2. Install use13 as usual.
            3. Download and put somewhere along the Stata's search path the following file: http://radyakin.org/statalist/2014/use13_fix.mo
            4. Fix the file by running in Stata 13: mata use13_fix("C:\...Path...\YourFullFileNameHere.dta" )
            5. Try to open the file in Stata 13 with the standard use command
            6. Try to validate the file with the dtaverify
            The file should be open by Stata 13 without an error message and the file should pass the validation checks without error messages.
            Let me know whether this helps. This is experimental (because this is only my best guess so far since I didn't see the data) and so let's wait for your confirmation.

            Hope it works. Best, Sergiy Radyakin

            Comment


            • #7
              Thanks, Sergiy, I will get back to you on this as soon as I can. Which problems does your code hope to fix? The issues with labels? Other potential problems, some of which exemplified in the logs I shared with you?

              Comment


              • #8
                Ooops, sorry, here I am already, could have waited a minute. As the attached log attests, your fix does not fix the problems -dtaverify- claims to fine (a map error and a label error), though the file still works fine with -use-, and I am still no smarter whether this is a false positive from a buggy -dtaverify- or this is a real problem that survives -use- but could cause problems later on. What else can I do for you?
                Attached Files

                Comment


                • #9
                  Laszlo, could you please try the fix with your file analysis_individ ? And could you try just use13 with the data? Sergiy

                  Comment


                  • #10
                    Thanks, Sergiy, I am on it, though that's my biggest file so it takes time. Also, I don't know what changed since yesterday, but now I am getting –fseek– errors before getting to the value label errors even for the files which had other problems. The exact same files. This seems to be a really shaky part of -dtaverify-, if only they could fix it ASAP.

                    Comment


                    • #11
                      Hmmm....

                      Code:
                       . dtaverify auto.dta
                        (file "auto.dta" is .dta-format 117 from Stata 13)
                         1. reading and verifying header
                          release is 117
                          byteorder LSF
                          K (# of vars) is 12
                          N (# of obs) is 74
                          label length 20 |1978 Automobile Data|
                          date length 17 |13 Apr 2013 17:45|
                         2. reading and verifying map
                          map[ 1] =                    0
                          map[ 2] =                  173
                          map[ 3] =                  296
                          map[ 4] =                  353
                          map[ 5] =                  770
                          map[ 6] =                  817
                          map[ 7] =                 1424
                          map[ 8] =                    0
                          map[ 9] =                 2866
                          map[10] =                 3099
                          map[11] =                 6294
                          map[12] =                 6309
                          map[13] =                 6430
                          map[14] =                 6442
                          verifying map
                          1 2 3 4 5 6 7 8SERIOUS ERROR: |<stata_dta><heade| found where |<variable_labels>| expected
                       9 10 11 12 13 14
                         3. reading and verifying vartypes
                         4. reading and verifying varnames
                          verifying varnames unique
                         5. reading and verifying sort order
                          verifying contents
                         6. reading and verifying display formats
                          verifying formats
                          verifying formats correspond to variable type
                         7. reading and verifying value-label assignment
                          verifying value-label construction
                          verifying value label and corresponding variable types
                         8. reading and verifying variable-label assignment
                          verifying variable labels construction
                         9. reading and verifying characteristics
                          (2 characteristics in file)
                          verifying construction
                          verifying characteristics unique
                        10. reading and verifying data
                          (dtaverify_117 cannot verify that values are correct)
                          verifying construction
                          (0 strL variables)
                        11. reading and verifying strLs
                          verifying construction
                          (0 strLs expected)
                        12. reading and verifying value label definitions
                          verifying construction
                          (1 labels in file)
                        serious errors detected
                          See errors reported above.
                      r(459);
                      The file in question, auto.dta, is the ubiquitous exemplar file distributed by Stata Corp. with Stata. (I'm running Stata/MP 13.1 for Windows (64-bit x86-64)
                      Revision 06 May 2014.)

                      Something is clearly amiss.

                      Comment


                      • #12
                        Sergiy, I get the same -fseek- error before and after your fix. But maybe that's most obviously a bug in -dtaverify-? Did you still hope to fix that? In any case, other errors in other files are not fixed error, at least according to -dtaverify- though sometimes an -fseek- error now precedes the checking of previously faulty pieces (like value labels).

                        Comment


                        • #13
                          Great point, Clyde, I can reproduce the same result with the exact same version of Stata. Now I really want to know whether these are false alarms from -dtaverify- or something is still wrong with the files on 64-bit Windows.

                          Originally posted by Clyde Schechter View Post
                          Hmmm....

                          Code:
                           . dtaverify auto.dta
                          (file "auto.dta" is .dta-format 117 from Stata 13)
                          1. reading and verifying header
                          release is 117
                          byteorder LSF
                          K (# of vars) is 12
                          N (# of obs) is 74
                          label length 20 |1978 Automobile Data|
                          date length 17 |13 Apr 2013 17:45|
                          2. reading and verifying map
                          map[ 1] = 0
                          map[ 2] = 173
                          map[ 3] = 296
                          map[ 4] = 353
                          map[ 5] = 770
                          map[ 6] = 817
                          map[ 7] = 1424
                          map[ 8] = 0
                          map[ 9] = 2866
                          map[10] = 3099
                          map[11] = 6294
                          map[12] = 6309
                          map[13] = 6430
                          map[14] = 6442
                          verifying map
                          1 2 3 4 5 6 7 8SERIOUS ERROR: |<stata_dta><heade| found where |<variable_labels>| expected
                          9 10 11 12 13 14
                          3. reading and verifying vartypes
                          4. reading and verifying varnames
                          verifying varnames unique
                          5. reading and verifying sort order
                          verifying contents
                          6. reading and verifying display formats
                          verifying formats
                          verifying formats correspond to variable type
                          7. reading and verifying value-label assignment
                          verifying value-label construction
                          verifying value label and corresponding variable types
                          8. reading and verifying variable-label assignment
                          verifying variable labels construction
                          9. reading and verifying characteristics
                          (2 characteristics in file)
                          verifying construction
                          verifying characteristics unique
                          10. reading and verifying data
                          (dtaverify_117 cannot verify that values are correct)
                          verifying construction
                          (0 strL variables)
                          11. reading and verifying strLs
                          verifying construction
                          (0 strLs expected)
                          12. reading and verifying value label definitions
                          verifying construction
                          (1 labels in file)
                          serious errors detected
                          See errors reported above.
                          r(459);
                          The file in question, auto.dta, is the ubiquitous exemplar file distributed by Stata Corp. with Stata. (I'm running Stata/MP 13.1 for Windows (64-bit x86-64)
                          Revision 06 May 2014.)

                          Something is clearly amiss.

                          Comment


                          • #14
                            Laszlo, Clyde, this is exactly the problem that use13_fix() is dealing with:
                            Code:
                            tempfile t
                            copy "http://www.stata-press.com/data/r13/auto.dta" "`t'"
                            mata use13_fix("`t'")
                            dtaverify "`t'"             //passes after fix
                            
                            copy "http://www.stata-press.com/data/r13/auto.dta" "`t'", replace
                            // without fix
                            dtaverify "`t'"             // fails without fix
                            Sergiy

                            Comment


                            • #15
                              Thanks, Sergiy, but then I am getting confused. So it is not true that there is one correct file format that will load while others will fail? You can definitely change something in the file that matters at least for -dtaverify- and -use- would not complain about either version? It is troubling! Because it is a sign that -dtaverify- might be right, it is not only raising false positives. (To be clear: it might be raising false positives still, but it is definitely not the case that the "correct" file format raises an error and if you "fix" the error for -dtaverify- you corrupt the file for all other intents and purposes. The file is just as valid for -use-, so maybe it is a good idea to fix those errors…)

                              Comment

                              Working...
                              X