Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Assigning set of value labels to one variable from list from another

    Hello, I would like to set the value labels of coded variables in a dataset based on the contents of another variable (code definitions from an accompanying dictionary file). The other (string) variable has the codes and corresponding definitions stored in a single line, all combined, as in the following example:
    Code:
    1, Yes | 2, No | 3, Unknown
    There are hundreds of these coded variables with corresponding code/definitions in dictionary file (with varying numbers of classes and nature of definitions). Thank you for any help.
    Last edited by Straso Jovanovski; 09 May 2023, 10:51.

  • #2
    I'm unclear about what you're working with, so here are some questions:

    1) Is the "dictionary" file a Stata dataset, or what I'd take to be a conventional Stata dictionary, just a syntax data set? My *guess* is that it's a Stata data set with each value label definition contained in a string variable. That's potentially a useful thing, but not a conventional "dictionary file" in the Stata sense.

    2) Can you show more examples of entries from your dictionary file?

    3) What is it that links a particular value label definition such as you illustrate with its corresponding variable in your dataset?

    Without this information (and perhaps more), I don't think we can efficiently and effectively help you.

    Comment


    • #3
      Thank you. Yes, the dictionary file is a .DTA (STATA) file. Your guess is correct. It is not a conventional dictionary file, it's just a simple codebook with rows/observations for each variable in the dataset; containing various metadata for each variable in the dataset. For example, for var1 the codebook file would have a variable named 'codes_definitions' with the string contents "1, male | 2, female | 3, other"; then for var2 that same variable may have the contents "1, agree | 2, neutral | 3, disagree | 4, don't know | 5, refused"; then for var3 that same variable may have "1, yes | 2, no | 0, maybe", etc. So, each code-definition pair in all instances is separated by the symbol "|".
      In the dataset itself, the variable var1 would be a numeric field with values 1,2,3; var2 (also numeric) would take on the values of 1 through 5; then var3 would have values 0,1,2; Etc.
      Some other variables in the codebook file are named 'description', 'type', 'valid', 'notes'. I would also like to use the variable 'description' to assign its values as variable labels to the corresponding variables in the dataset. Not interested in any of the other variables in the codebook file.
      Last edited by Straso Jovanovski; 09 May 2023, 13:47.

      Comment


      • #4
        The situation is far too complex to work out a solution without a data example.

        Assuming that neither the | character nor a comma appears as a valid character in any of the labels,

        Code:
        generate values_and_labels = subinstr(subinstr(codes_definition, ",", "", .), "|", "", .)
        removes those characters and leaves you with a valid input for the label define command.

        Depending on where the value label names come from, you might be able to define all labels in a loop as

        Code:
        local N = c(N)
        forvalues i = 1/`N' {
            
            local lblname           = value_label_name
            local values_and_labels = values_and_labels[`i']
            
            label define `lblname' `values_and_labels'
            
        }

        Comment


        • #5
          I was preparing an answer before I saw Daniel's. I've supplied some example material, but more to the point, I tried a somewhat different tack.

          My approach involved creating a data file that was virtually a labeling syntax do-file that could be applied to the data. As it happens, I did take on the problem of inserting double quotes into the -label def- command, a task for which the presence of "| and "," can temporarily be useful. I was not particularly careful in my assumptions about what spacing might be like within the label definitions in the file. Here's what I have in mind, which I suspect could be made more compact and more robust by someone more skilled with regular expressions.
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str8 varname str64 codes_definition
          "thisvar"  "1, male | 2, female | 3, other"                                  
          "thatvar"  "1, agree | 2, neutral | 3, disagree | 4, don't know | 5, refused"
          "othervar" "1, yes | 2, no | 0, maybe"                                       
          end
          // Create the -label def- commands.
          gen wanted = ustrregexra(codes_definition, " *\| *", "|")  //   Clean up "| and ","
          replace wanted = ustrregexra(wanted, " *, *", ",")
          replace wanted = subinstr(wanted, "|", `"" "', . )
          replace wanted = subinstr(wanted, ",", `" ""', . ) + `"""' 
          replace wanted = "label def " +  varname + "Lbl " + wanted 
          // Make a label values command for each label def
          expand 2
          bysort varname: replace wanted = ///
            "label values " + varname + " " + varname + "Lbl " if (_n == 2)
          //
          format wanted %-100s
          // I'd grab the following in a log file and save it as a do file.
          list wanted, clean noobs // for illustration

          Comment


          • #6
            I appreciate it very much. To Daniel's point - not sure what should go under 'value_label_name' - is that the codebook variable that contains a list of all variable names (in my case called 'var_names'?
            when I run this I get error -
            Code:
            forvalues i = 1/`N' {
                
                local lblname           = var_names
                local values_and_labels = values_and_labels[`i']
                
                label define `lblname'l `values_and_labels'
                
            label obs1 already defined
            r(110);
            
            end of do-file
            
            }

            Comment


            • #7
              You want

              Code:
              local lblname = var_names[`i']
              Note, however, that Mike has correctly pointed out that I have missed the obviously missing double quotes. You will need to fix this. Mike's approach seemed fine to me.

              Comment


              • #8
                Thank you so much Mike and Daniel. Mike's code worked. Only, any ideas on how to mass-delete empty spaces produced? I have over 700 of these variables, and ended up with a log file and upon pasting its contents all in a dofile have close to 5,000 lines of code. I was able to delete all the ">" signs using "Edit-Find and Replace" but still lots of empty lines in-between the lines of code.

                Comment


                • #9
                  Straso, I believe you mean that your log file looked like this:

                  Code:
                      wanted                                        
                  >                                       
                      label def othervarLbl 1 "yes" 2 "no" 0 "maybe"
                  >                                       
                      label values othervar othervarLbl             
                  >                                       
                      label def thatvarLbl 1 "agree" 2 "neutral" 3 "
                  > disagree" 4 "don't know" 5 "refused"  
                      label values thatvar thatvarLbl               
                  >                                       
                      label def thisvarLbl 1 "male" 2 "female" 3 "ot
                  > her"                                  
                      label values thisvar thisvarLbl               
                  >
                  You should be able to avoid both the ">" and the blank lines by changing the -set linesize- value in Stata. (I created the preceding example by setting my linesize artificially low, to 50). I'd suggest you put in a -set linesize 200- command anywhere before the -list- command in the preceding syntax, and run it again. (By the way: I tried using the -linesize- option on the -list- command, but that didn't work.)

                  If you wanted to fix your existing do-file, you could likely do that with any text editor or word processor program that allows find/replace on end of line characters, since your "blank lines" arise from two consecutive end of line characters. I don't know if the current version of Stata's do-file editor can do that, but my old version can't. It's also possible to do this with Stata's -filefilter- command:
                  Code:
                  filefilter OldFile.do NewFile.do, from(\W\W) to(\W)
                  ("\W" applies to a Windows system, but it could be \U or \M for Unix or Mac, about which see -help filefilter-.)

                  Comment


                  • #10
                    Thank you very much. I did try:
                    Code:
                    set linesize 255
                    (which I believe is the maximum) without success, same result as before.
                    The dofile ended up having lots of instances of text breaking off at random points (for example, parts of words would end up on a new line with Stata causing an error, seeing them as non-existent commands; or a word ending on one line and the closing quotation mark falling on a new line). Not sure where those random line breaks are coming from.
                    Last edited by Straso Jovanovski; 10 May 2023, 18:05.

                    Comment


                    • #11
                      That's strange all right. I can't understand why your Stata didn't obey the 255 linesize. Anyway, can you post some examples of these problem areas, with perhaps 4-5 lines around them? That might help us figure out the problem and solution.
                      One thing that occurs to me as a possible cause would be if your value labels file wasn't completely regular at that point, so that perhaps a " was missing, or a |. Or, perhaps there were some problem characters in the text. Or, maybe there wasn't a proper end of line on the line. I suppose you've already looked at those in the file already, but if you haven't, that would be worth a try.

                      Comment

                      Working...
                      X