Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bug : space on variable names in R written .dta file

    I want to report a bug that I faced when downloading a R-written .dta file (from a World Bank site (http://wits.worldbank.org/, under registration).

    Some of variables names have a space in it. Which is particularly bothering when two variables start by the same word, you cant neither change their name, neither call those variable.
    I've re-download the data in CSV and then imported it, but this bug surprised me much.
    I know it is R-written since this displays at the opening of the dataset.

    I can't even report the data using dataex, which also stops in front of ambiguous abbreviations.

    So I only report a screenshot describing it

    Were you aware this could happen?

    Charlie
    Click image for larger version

Name:	VarnNameBug.png
Views:	2
Size:	162.8 KB
ID:	1399649
    Attached Files

  • #2
    This sounds like a problem to be reported to the World Bank, since they are creating "Stata" datasets with variable names not permitted in Stata. I'd also argue that the R routine should also be applying some basic checking to ensure the variable names are usable in the target language, but R's attitude is likely to be that it's the duty of the R user to understand any limitations on variable names. I think Stata's attitude would be that Stata is the only supported creator of Stata datasets, and if you have a dataset not created by Stata that fails to read in Stata, that is not Stata's problem, but rather the problem of the software that created the "Stata" dataset. I would be sympathetic with that point of view.

    Some of variables names have a space in it. Which is particularly bothering when two variables start by the same word, you cant neither change their name, neither call those variable.
    Is it possible that a group rename command can handle this? (See help rename group for details.) Perhaps something like:
    Code:
    rename (*Region) (PartnerRegion)
    or
    Code:
    rename (Par*) (foo#), addnumber
    I'm not optimistic, but it could be worth a try.

    Comment


    • #3
      As a work around, you may use undocumented command

      Code:
      renamevarno varnum newname
      where varnum is the number of variable. The command is only available in Stata 14 or 15. For example

      Code:
      . sysuse auto
      (1978 Automobile Data)
      
      . desc
      
      Contains data from C:\Program Files (x86)\Stata14\ado\base/a/auto.dta
        obs:            74                          1978 Automobile Data
       vars:            12                          13 Apr 2014 17:45
       size:         3,182                          (_dta has notes)
      --------------------------------------------------------------------------------
                    storage   display    value
      variable name   type    format     label      variable label
      --------------------------------------------------------------------------------
      make            str18   %-18s                 Make and Model
      price           int     %8.0gc                Price
      mpg             int     %8.0g                 Mileage (mpg)
      rep78           int     %8.0g                 Repair Record 1978
      headroom        float   %6.1f                 Headroom (in.)
      trunk           int     %8.0g                 Trunk space (cu. ft.)
      weight          int     %8.0gc                Weight (lbs.)
      length          int     %8.0g                 Length (in.)
      turn            int     %8.0g                 Turn Circle (ft.)
      displacement    int     %8.0g                 Displacement (cu. in.)
      gear_ratio      float   %6.2f                 Gear Ratio
      foreign         byte    %8.0g      origin     Car type
      --------------------------------------------------------------------------------
      Sorted by: foreign
      
      . renamevarno 1 mymake
      
      . renamevarno 3 mympg
      
      . desc
      
      Contains data from C:\Program Files (x86)\Stata14\ado\base/a/auto.dta
        obs:            74                          1978 Automobile Data
       vars:            12                          13 Apr 2014 17:45
       size:         3,182                          (_dta has notes)
      --------------------------------------------------------------------------------
                    storage   display    value
      variable name   type    format     label      variable label
      --------------------------------------------------------------------------------
      mymake          str18   %-18s                 Make and Model
      price           int     %8.0gc                Price
      mympg           int     %8.0g                 Mileage (mpg)
      rep78           int     %8.0g                 Repair Record 1978
      headroom        float   %6.1f                 Headroom (in.)
      trunk           int     %8.0g                 Trunk space (cu. ft.)
      weight          int     %8.0gc                Weight (lbs.)
      length          int     %8.0g                 Length (in.)
      turn            int     %8.0g                 Turn Circle (ft.)
      displacement    int     %8.0g                 Displacement (cu. in.)
      gear_ratio      float   %6.2f                 Gear Ratio
      foreign         byte    %8.0g      origin     Car type
      --------------------------------------------------------------------------------
      Sorted by: foreign
           Note: Dataset has changed since last saved.
      Last edited by Hua Peng (StataCorp); 28 Jun 2017, 09:59.

      Comment


      • #4
        Thanks William for the comment,

        You're probably right, this is more a R routine issue, and I have contacted the world bank for that.
        Concerning the way to deal with it, actually it's not that bothering indeed, since other format are available, included CSV. Importing the CSV file ftom Stata prevents the variable name to have a space in them (I had to rename them though).

        The real question was thus: could Stata prevent a file to be saved in .dta file when it violates some Stata rules (e.g. space in varnames), even if the file is created through R?

        Peng, Thanks for the varnum trick, unfortunately I use Stata13. I should upgrade to Stata14 (or even 15) in September, so I keep that in mind.
        But again, I'm not stuck with that since other format were available. I just was amazed such a basic rule could be violated in a .dta file.

        Best,
        Charlie

        Comment


        • #5
          Hua Peng (StataCorp) - wow! Thanks for the tip.

          Charlie Joyez - the problem is that R, not Stata, is "violating the basic rule" when it wrote the .dta file containing incorrect variable names that you downloaded, which it did using the layout of a Stata dataset as described by help dta without following the guidlines for what constitutes a valid Stata variable name as given by help varname. I argue that that constitutes a programming error on the part of whomever wrote the R code. Stata cannot stop R from making this error. Do you mean, could Stata while reading the file prevent itself from creating variables in memory with inappropriate names? My own opinion is that that has the potential to add overhead to each and every use command - validating variable names, checking to make sure that the same variable doesn't appear twice, ... - and still would only capture those errors that Stata sought to look for, not other errors. That's why I'm sympathetic to Stata assuming .dta files will not have such basic errors in them.

          Comment

          Working...
          X