Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -use13- and large files

    Hello,
    I have been attempting to run an ACS dataset given to me by a Stata 13 user, but I am a Stata 11_IC user. I have used -use13- before, but it was on a much smaller dataset. I am concerned that it simply is too big to be converted. The file size is 4.40gb I was wondering if there is a filesize limit for -use13-, as I was unable to see any in the documentation, however there was one older post on this forum that seemed to hint at a 2gb limit. This problem is not critical as I will likely just ask for a conversion of the data by the guy who gave it to me, but I thought it would be useful to be included in the forum.
    Code:
    . use13 "C:\Users\Sean\Documents\2014 - ACS Dataset V1\2014 - ACS Dataset V1.dta"
     
    Converting file C:\Users\Sean\Documents\2014 - ACS Dataset V1\2014 - ACS Dataset V1.dta
    unexpected text found: ØC ,instead of <value_labels>       _use13_fskipstr():   610  file format error
            _use13_convert():     -  function returned error
                     use13():     -  function returned error
                     <istmt>:     -  function returned error
    r(610);
    
    . which use13
    c:\ado\plus\u\use13.ado
    *! version 1.0.1  19jun2013
    *! Sergiy Radyakin, 2013
    *! ff83ef0b-3d9d-4b19-a519-90f13b6e8257

  • #2
    Sean Lambert reports an error message produced by a user-written Stata module -use13- while processing a large file produced by a 64-bit version of Stata 13. Sean has done search and found an earlier thread in which a problem was reported but not resolved. I assume that it is this post. Sean is aware of the importance of 2GB limit, and is correct that the documentation for use13 is not restricting such a use. Sean has provided enough information about his Stata environment and the exact commands he had typed and Stata response. Which means he has either carefully read the FAQ of statalist carefully, or the use13 troubleshooting reference, or both.

    It takes a while to reproduce the problem, depending on the actual data and machine performance. In my trial -use13- was crunching the data for 1 hour and 2 minutes before it stopped with an error. Because the error message was different from what Sean has quoted I decided to investigate in more detail.

    It appears that a bug in Stata prevents -use13- from working correctly and the exact failure is data-dependent, which means that for different files the program will fail with potentially a different error message. Notably, if a data file is smaller than 2GB then there is never a problem.

    Because -use13- is a module for Stata, any defects in Stata limit the stability of use13. The exact bug is in fseek() implementation in mata. Specifically the documentation describes that:
    fseek(fh,offset,whence) and _fseek(fh,offset,whence) abort with error if offset is outside the range 2,147,483,647 on 32-bit computers; if offset is outside the range
    9,007,199,254,740,991 on 64-bit computers;
    The hope was that if the 13th file is large, the user will be using a proper (read 64-bit) version of Stata to handle it, and hence the larger number will apply as the limit for fseek.

    It appears the 2gb limit applies for both 32- and 64-bit versions of Stata, and although Stata can read large files, it can't seek a large file.

    The following minimal example illustrates that:
    Code:
    do "http://www.radyakin.org/statalist/statabugs/fseek_bug/fseek_bug.do"
    (64-bit Stata 13 is required for it to run, but you can change the version statement to experiment with other versions of Stata).

    Since it takes a while to run, here are the highlights:



    The bug was reproduced in 64-bit versions of Stata 12.1 and 13.0 on Windows, and interested parties may try it in Stata 14 environment. But importantly Stata 12 and earlier are no longer updated, so this bug will remain for those versions where it is most harmful.

    Applying -use13- to such huge datasets is a bit of a stretch, especially because it is not optimized for such volumes of data and even for smaller datasets can take into hours (depending on the machine). But thank you, I take this trust as a compliment !

    The fix is not easy (since fseek is used in a few places, and it seems the map structure is assuming the reader will seek to the necessary locations).

    A workaround is possible, but it's probably better to use alternative tools (read Stat\Transfer) for files larger than 2GB for now.
    I will take a note and describe this better on the support site, and perhaps will introduce a warning for large files.

    Hope this helps.

    Best, Sergiy Radyakin

    Comment


    • #3
      Sergiy,

      Thank you for your response, I was very interested in this problem, mainly because I really like your approach with -use13-. I think it addresses a major flaw that Stata should have addressed long ago. It seems almost as if it is a strategy of planned obsolescence for older versions of Stata. I don't think that's entirely ethical. Users should upgrade if the features of new versions are superior to old ones and we feel a need for them.

      Regardless, I found it reassuring that people like you have put so much time into fixing problems like this.

      I do agree that a note in the documentation would certainly clear things up for those of use who need to convert larger files.

      Sean

      Comment


      • #4
        It seems almost as if it is a strategy of planned obsolescence for older versions of Stata. I don't think that's entirely ethical.
        I really disagree strongly with this. You will have to search far and wide in the world to find a person who has a more mistrustful attitude towards the ethics of businesses than I. But this is just way off base. I've been using Stata since version 4. In no instance can I recall a change in the data set format occurring without there being a meaningful improvement in functionality that necessitated it. Moreover, Stata has always provided a -saveold- command that allows users of the new version to save data in formats that are readable from at least the preceding version, and often going several versions back. (The current -saveold- command goes back to version 11.) On top of that, you have user-written software such as Sergiy's -use13-, available as a labor of love, which enables users of the older versions to read the newer data sets. Stata Corp. is very aware of this activity--and it is my understanding that they actively support this kind of thing.

        Granted, if you were running Stata version 6 and somebody sent you a Stata version 14 data set you may have to go through some contortions. But even then, depending on what's in the data set, it might be as simple as asking someone with current Stata to export it to a delimited text file or something like that. You have unfortunately stumbled on a situation where the available tools fail. But to call it built-in obsolescence, puh-lease! How many extra copies of current version Stata do you think can be forced down the public's throat by making it difficult for people running Stata 11 to read version 13 4GB data sets on 32-bit machines in 2015? Especially since you can ask anyone who has Stata 13 or later to read the data for you and then use -saveold- to put it into Stata 11 format. You described the person who furnished the data set as a friend: I suspect he or she will do that for you.

        Disclosure: I have no financial ties to Stata Corp. I do not own any shares, nor do I receive any financial or other compensation from them. My own relationship to Stata Corp. is as a very satisfied customer.

        Comment


        • #5
          I am very glad Clyde said this. I can't match his disclosure, as I have an explicit role as an Editor of the Stata Journal.

          But in my view he's right.

          Comment

          Working...
          X