Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Labels for a string variable

    I have a string variable consisting of CPT codes, which are medical codes for various procedures. Stata has a nice set of routines for handling ICD9 codes, but not CPT codes. In the SAS file version of this, which is the form in which we originally got the data, the string variables are labeled via Proc Format, e.g. a CPT code of “J507” might have a label of “MRI lower abdomen”. When I convert the file to Stata using Stat Transfer the string variables get no labels of course because Stata will not let you label a string.

    I know that I can use encode to create a set of numeric codes corresponding to the string values with the original string values becoming labels. So, J507 becomes, say, numeric value 22 with label “J507.” But I need to get the original long labels attached to the numeric values so that 22 gets labeled as “MRI lower abdomen” or even better “J507 MRI lower abdomen.” I can do this by brute force my modifying the SAS code and changing it to label define statements but I suspect that someone else has faced this problem and I would appreciate any suggestions as to how to solve it efficiently.
    Richard T. Campbell
    Emeritus Professor of Biostatistics and Sociology
    University of Illinois at Chicago

  • #2
    It is not clear to me if you have access to SAS, or only to a copy of the SAS PROC FORMAT code that created the format. If you do have access to SAS, using PROC FORMAT with the CNTLOUT option can create a standard SAS dataset containing the format information. You could then write a simple SAS program to read that dataset and print out label define statements. Copy and paste that into your Stata do file and away you go. Alternatively, you could transfer the CNTLOUT dataset into Stata and use it to create label define statements. If you lack access to SAS, perhaps the source of your data could be prevailed upon to also proved the CNTLOUT dataset.

    This may not be much less work than brute force, but it's probably less prone to typos.

    I hope this is helpful. I'd demonstrate with some sample SAS code, but I currently lack SAS access.
    Last edited by William Lisowski; 23 Jan 2015, 20:15. Reason: Correct typos.

    Comment


    • #3
      I know of people who deal with problems like this on a recurring basis and have written scripts in Python (or similar languages) to translate SAS FORMAT commands into Stata -label define- commands. I've considered writing a C++ program to do this myself at times, but it doesn't seem to come up often enough in my work to be worth the effort, so I just hand edit the FORMAT file into a do file.

      Comment


      • #4
        I don't use SAS nor StatTransfer, but you mention having StatTransfer and from my understanding of

        http://www.stattransfer.com/support/...ASFormats.html

        it is possible to export the SAS value labels. If so, can't you import to Stata and work with that?

        If this approach is feasible and you still have doubts, sharing an example result of data and labels exported by StatTransfer, might help someone help you.
        You should:

        1. Read the FAQ carefully.

        2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

        3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

        4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

        Comment


        • #5
          After reflecting on this overnight, I'd like to recommend a different approach. At the overview level,
          1. Create a Stata file cptref containing two string variables CPT and CPTdescription.
          2. Use merge m:1 CPT using cptref to merge this file onto the Stata file containing the data imported via StatTransfer, adding CPTdescription to the file.
          3. Use encode CPTdescription to generate value labels in the process, rather than encoding CPT.
          For the first step, I'm thinking that what you have is the code for the SAS format. My "behavioral assumption" is that the code was written in a nicely formatted and easily readable "fixed" layout, something like

          Code:
          proc format;
          value cdrdesc
          "J507" = "MRI Lower Abdomen              "
            ...
          ;
          so the individual lines could be read using infix or maybe import delimited. Or the CNTLOUT information I discussed above would contain the necessary data.

          On the second step, initial sorting will be required, and care must be take that nonmatches are handled correctly.

          My bottom line, though, is that the process of assigning labels to encoded values will be simplified greatly by having encode create the value labels, rather than trying to assign them after the fact.
          Last edited by William Lisowski; 24 Jan 2015, 06:46. Reason: Clarification

          Comment


          • #6
            Thanks to all who responded to this. After reading William Lisowski's second post I realized that I had forgotten that I had hit on a similar solution when I faced this problem a few years ago.

            I have never understood why Stata does not permit labels on string values. I can imagine how that was originally decided (if it's a string it's self-labeling), but there doesn't seem to be any fundamental reason for disallowing labels on strings and they are necessary sometimes, e.g. when the string is a meaningless code and you want to add a translation. I would add such a request for the Stata 14 wish list. Actually, I would guess that Stata 14 is fully specified at this point and we will have to wait till Stata 15 but I hope this can be fixed someday.
            Richard T. Campbell
            Emeritus Professor of Biostatistics and Sociology
            University of Illinois at Chicago

            Comment


            • #7
              Note that help label reports

              label define defines a list of up to 65,536 (1,000 for Small Stata) associations of integers and text called value labels.
              which suggests to me that the value label implementation is based on the restriction of the values to integers. Also, if value labels were supported for strings, we'd need to have encode modified to transfer the value labels to the encoded variable.

              And that leads to the more general concern. Stata's design choices generally reflect decisions made long ago that limit analysis to numeric data items. I've worked with JMP, and found it pleasant that character variables were treated naturally as categorical variables in any analysis, with no need to manually create an "encoded" version. Barring a wholesale reconsideration of how Stata treats character data, I think changes at the margins would be unlikely, an attempt to mask the underlying issue.

              I think the best that I can hope for is that, by having had this discussion here in Statalist, I'll find it here sometime in the future when I face this problem myself and don't remember having (eventually) worked out a reasonable approach. Which could be as soon as a few weeks, given my memory ...

              Comment


              • #8
                I have encountered this same problem and am hoping for some advice. I have Stata 14 and am attempting to import a sas .xpt file (actually there are two, the data consist of almost 2 million cases and the distributor decided to split it into two separate files for reasons unknown). There is also a formats.cpt file which I assume contains all the value label commands. I can import the data while ignoring the formats file, which would rather not do for obvious reasons. If I include the formats file in the import then the command halts and issues the error that
                "D:\D10TC.XPT has string variable prepby with format (value label) $prepby; value labels for string variables are not allowed in Stata
                r(610);"

                It's not immediately clear to me if it is just the prepby variable that has labels attached to the strings or if this is simply the first instance of this issue Stata encountered and stopped. If it is the only problematic variable then I would be fine ignoring or omitting it from the import if that's possible.

                I do not have a copy of stat transfer and I do not have a copy of SAS (or know how to use it for that matter). I do have SPSS although I am far from savy with it. I'm not clear how to even go about editing the .cpt file (since I don't have SAS) - opening it in note pad, for example, results in incomprehensible text, not code.

                I would welcome any bright ideas.


                Comment


                • #9
                  How is the file split, by variables or by cases? If the latter then you could read each file in separately and just append them. If the former you probably have to merge them. Regarding the string variable format issue, you may be able to get a handle on it by using the describe option for the import command. I don't know if Stata will stop if it encounters labels for a string variable in that case, but it is worth checking out.
                  Richard T. Campbell
                  Emeritus Professor of Biostatistics and Sociology
                  University of Illinois at Chicago

                  Comment


                  • #10
                    You should be able to open the file in SPSS. Once you have it in SPSS, save-as Stata, Numeric variables should retain labels; strings should lose them. Not quite as nice as having labels all around, but having labels for your numeric data should be nice.

                    Comment


                    • #11
                      Originally posted by Dick Campbell View Post
                      How is the file split, by variables or by cases? If the latter then you could read each file in separately and just append them. If the former you probably have to merge them. Regarding the string variable format issue, you may be able to get a handle on it by using the describe option for the import command. I don't know if Stata will stop if it encounters labels for a string variable in that case, but it is worth checking out.
                      I can append the datasets no problem, that's not an issue. The data are longitudinal and split by cases. Cases prior to 2010 are in 1 file and cases from Jan 1, 2010 on to present are in the other. Looks to be the same set of variables in both files but there are 290 of them so I could be missing something. In any case, the primary issue is the labels.

                      The import command using the describe option does run. The interesting thing is the "prepby" variable is identified as a string but is not identified as having any sort of value label attached (the space under that column header is blank). The only variables identified as having labels are numeric and the one string variable that we know for certain has a label - "prepby" - is not identified as having a label. So it looks like the describe command is not going to identify string variables that Stata is going to attempt to label during import.

                      Ben - that's helpful, I will do that and at least have labels for the numeric variables. It would be nice to know how many of the string variables have labels. Is there any way to see this in SPSS or maybe a way to get a peek at the formats.cpt file without having access to SAS or stat transfer?

                      Thank you both for the quick and useful advice.


                      Comment


                      • #12
                        It would be nice to know how many of the string variables have labels.
                        If this is of any help: you will get one error message "Can't accomodate value labels for a string variable" for each string variable using value labels when importing an SPSS file into Stata with usespss. Conveniently they are one per line, so you can log the import messages, then count the lines with this value in the log.

                        Click image for larger version

Name:	usespss_messages.png
Views:	1
Size:	13.5 KB
ID:	1306912


                        Best, Sergiy Radyakin

                        Comment

                        Working...
                        X