Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Number of Characters in Variable Names

    Could STATA please double the number of characters allowed in a variable name from 32 to 64.
    If nowhere else then at least in the variables window.
    It makes it very difficult to use STATA in survey work as the variable characters routinely exceed 32 characters in the data files from survey fieldwork.

  • #2
    Also, long variable names in databases which is a major problem in having to rename variables (fields).

    Comment


    • #3
      Welcome to Statalist.

      Stata has a long history of effective processing of survey data. Perhaps the problems you are encountering are representative of issues specific to the data collection process producing the surveys you are using. In general, I have seen that effectively designed surveys have short variable names for which long variable descriptions are available, and these descriptions work nicely as variable labels.

      Code:
      . sysuse cancer, clear
      (Patient Survival in Drug Trial)
      
      . describe studytime died drug age
      
                    storage   display    value
      variable name   type    format     label      variable label
      ------------------------------------------------------------------------------------------------
      studytime       byte    %8.0g                 Months to death or end of exp.
      died            byte    %8.0g                 1 if patient died
      drug            byte    %8.0g                 Drug type (1=placebo)
      age             byte    %8.0g                 Patient's age at start of exp.

      Comment


      • #4
        Thank you but that is not the case. We routinely use over 20,000 variable names and the data are a mixture of SP and RP data that come straight out of databases. Currently, we can spend weeks on adjusting variable names that have been designed by clients that we have no control over. Yes STATA does have a long history of survey data but we are increasingly experiencing long variable names both in SP and RP data and it would help a lot if STATA could go from 32 to 64 characters.
        Last edited by CON MENICTAS; 08 Jul 2018, 07:00.

        Comment


        • #5
          I thought it was great when Stata upped the limit from 8 characters. Just googling around, I see that SAS has the same 32 character limit, while SPSS allows 64 and R allows 10,000! Such long names would seem like a nightmare to me but obviously some people want them.
          -------------------------------------------
          Richard Williams, Notre Dame Dept of Sociology
          Stata Version: 17.0 MP (2 processor)

          EMAIL: [email protected]
          WWW: https://www3.nd.edu/~rwilliam

          Comment


          • #6
            We routinely use over 20,000 variable names and the data are a mixture of SP and RP data that come straight out of databases. Currently, we can spend weeks on adjusting variable names that have been designed by clients that we have no control over.
            Thank you for the elaboration on your situation. I haven't any idea what "SP and RP data" are, but I now have some idea of the sort of problem you are facing.

            When I have been in similar situations, I have found it helpful to precede the processing of the data with processing of the metadata - the data about the data.

            In Excel terms, the metadata is the first row (or if you're unlucky, first several rows) of a worksheet that function as column headings, and which usually are totally unsuitable for use as Stata variable names. In Oracle database terms, it's the ALL_TAB_COLUMNS view of the table with your data, which presents a table of data about each of the columns of the table that will become variables in Stata. Other SQL databases have similar constructs.

            The strategy I have in mind for dealing with the metadata, like the column names for a database table, is one I've never seen documented or formally taught, but seems to be passed from programmer to programmer. It's "writing a program that writes another program."

            Thinking in Excel terms, you read the first row in, and reshape it into a column of string variables containing overly verbose descriptions such as
            Code:
            "Age of the mother at the birth of the first child"
            "Family income (excluding profits and losses from a wholly-owned business)"
            Let's assume these are in the first two columns of first row the Excel worksheet, with the data following in rows 2 and onward. The "metadata" program would create a text file using the file command (so it is not a Stata dataset) containing Stata commands
            Code:
            import excel using ... , cellrange(A2:BQ42042) // assuming there are actually lots more columns and rows
            rename (A-BQ) (var#)
            label variable var1 "Age of the mother at the birth of the first child"
            label variable var2 "Family income (excluding profits and losses from a wholly-owned business)"
            If that file is named getmydata.txt then once it is created,
            Code:
            include getmydata.txt
            will run these commands, first importing the Excel worksheet (skipping the first row), then renaming the variables from the Excel column names to something a little simpler and easier to deal with, and then applying the information from the first row of the worksheet to the dataset as variable labels.

            Now, you might feel that in your case, variable names v1 through v20000 are not helpful. To this I would argue that with that many variables, any user should rely on the output of
            Code:
            codebook
            saved as a PDF or similarly searchable text file, or an Excel spreadsheet of metadata, rather than a variable name limited to 32 or even 64 characters, to ensure the best possible understanding of the data.

            Now, this isn't meant to be an exhaustive discussion of how to work around Stata's limitation on variable name lengths. It's just a signpost that might point you in a useful direction. Because you say you can spend weeks adjusting variable names, I fear you are doing a lot of manual work that it might be possible to automate following this strategy. And one advantage is, it's easier to debut the metadata program that writes the data program than it is to review and proofread the result of weeks of manual changes to a dataset.

            Comment


            • #7
              Sad fact is that whatever limit Stata has, there are many circumstances in which you will only see abbreviated names. Like William, I have no idea what is meant by "SP and RP data".

              Comment


              • #8
                Nick correctly reminds us that while 32 character variable names are possible today, there are many places in Stata output where displaying a full 32 character variable name is not possible, and would be ridiculous were it done.

                I view that as another reason to favor - or at least not look down on - non-meaningful short variable names like var1 and var2 - the variable name will be presented with full fidelity in Stata output.

                Comment


                • #9
                  In addition, characteristics can store strings much longer than 80 bytes. Also, see Maximizing Stata's Metadata Capabities for some ideas for using characteristics for defining attributes of variables and the use of -ds- and -findname-.

                  Comment


                  • #10
                    Question to Nick on the #7 response. Is there however a solution to get Stata to display a little more than 8 characters of the variable name or label?
                    This is a correlation table I generated but Stata has shortened my variable names to just 8 characters. I would like to be able to extend it to say 12 or 15 or 20. How do I do that?


                    . pwcorr Time_to_5ft Beginning_depth, sig

                    | Time_t~t Beginn~h
                    -------------+------------------
                    Time_to_5ft | 1.0000
                    |
                    |
                    Beginning_~h | 0.7728 1.0000
                    | 0.0032
                    |



                    Attached Files

                    Comment


                    • #11
                      Code:
                      help pwcorr
                      shows no options for tuning the display width that I can see, so the answer appears to be that you would need to rewrite the code yourself.

                      Comment


                      • #12
                        Is there a significant constraint (memory?) that precludes allowing arbitrarily long names? I run into this all of the time because I prefer to give variables 'readable' names (it's much easier form me to spot a misplaced variable when they are called mother_age_birth1 and family_income than when they're called var1 and var2).

                        Comment


                        • #13
                          Since a couple of people have asked:

                          SP = "Stated Preferences", and
                          RP = "Revealed Preferences".

                          CON MENICTAS wrote: "please double the number of characters allowed in a variable name from 32 to 64. If nowhere else then at least in the variables window".

                          The variables window has hardly anything to do with restriction on the variable names. It just displays the names as they are stored in the data file. It is that format that will need to be revised (once again). But StataCorp has done it many times now, and that will not be a problem. The problem will be thousands of user-written programs that may rely on the variable name to be no longer than 32 characters. And when they silently tabulate or plot a wrong variable, one can be in trouble.

                          CON MENICTAS , with 20,000 variables you will be just as good naming them x1...x20000 and provide meaningful variable labels, store original variable names in the variable characteristics.
                          I can't imagine anything being done here manually, rather than iterating by all or some variables.

                          I would (humbly, but strongly) oppose extension of variable names to longer than 32 characters.
                          Already, the previous decision to revise the rules for identifiers can give me headaches sometimes, try for example (in Stata 14.0+):
                          Code:
                          use "http://www.radyakin.org/statalist/2020/epidemics.dta", clear
                          summarize

                          Also, consider raising the issue with your data suppliers or the manufacturer(s) of those systems they employ. Your mileage may vary. For example, Survey Solutions software for survey management and data collection automatically restricts the length of variable names to 32 characters, and even more for variables where indices or suffixes will be appended to prevent violation of the limit as well as controls names uniqueness to prevent collisions.

                          Comment


                          • #14
                            Stata currently allows up to 120000 variable names,depending on version and flavour, but the impact of longer variable names would not, so far as I can see, really be on memory needed.

                            The issue is where and how people who want this expect variable names to be shown, not just in output, but also in other windows, and how that works out in terms of revising official and community-contributed code.

                            For every user with two or more enormous monitors, there are perhaps several more just using laptops. And although longer variable names would be at most permitted, and never obligatory, I can't sense that the upheaval involved for the sake of a few vocal users would be a gain compared with other priorities.

                            Comment

                            Working...
                            X