Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Labeling variables of large amount of stata files with one csv metadata file

    Hi friends
    I'm pretty new with Stata programming.
    I have a large amount of Stata files without variable labels (each file has different variables).
    I have also one csv file (as a metadata with more than 1000 rows) contains all variable names and variable labels in two columns (with heading MyVar and MyLabel).
    I need to load each Stata file from directory and loop over its variable names and if it matches with the variable name in csv, assign the label to it.
    Could you please help me about how to write .do file for this task?
    Last edited by Iman Faraji; 23 Feb 2022, 10:24. Reason: Stata programming

  • #2
    So the first step is to import the CXV file into Stata. The -import delmited- command will do that for you. Read -help import delimited- for the full details of the syntax. I would recommend that you save the resulting imported as a Stata data set and then work with that instead--that way if you make some mistakes along the way and have to backtrack, you won't need to reimport the metadata file over and over again. So let's say you've done that. Here's an example of what that file might look like:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str10 my_var str36 my_label
    "age"        "Age at date of survey"              
    "sex"        "Sex (M = 1, F = 2)"                  
    "n_children" "Number of children at date of survey"
    "income"     "Annual household income"            
    end
    So after you -use- this dataset, you loop over the files you want to apply it to, and within that you loop over the variables. The code below assumes that some of the files may not contain all of the variables in the metadata file. So it will skip over any non-existent variables without any warning. If, on the contrary, you expect every file to contain all these variables, post back and I will show you revised code that would verify that all of them are there and, if not, throw an error message to bring the problem to your attention.

    Code:
    use the_metadata_file, clear
    isid my_var
    
    local vbles
    forvalues i = 1/`=_N' {
        local v = my_var[`i']
        local lbl = my_label[`i']
        local `v'_lbl `lbl'
        local vbles `vbles' `v'
    }
    
    // INSERT CODE HERE TO DEFINE A LOCAL MACRO filenames CONTAINING A LIST OF ALL THE FILES
    //  IN WHICH THESE LABELS WILL BE APPLIED.  MIGHT BE A DIRECT LISTING, OR, GIVEN
    //  THAT THE NUMBER OF SUCH FILES IS LARGE, PROBABLY RELIES ON USING
    //  -local filenames: dir ...-
    
    foreach f of local filenames {
        use `"`f'"', clear
        foreach v of local vbles {
            capture confirm var `v', exact
            if c(rc) == 0 {
                label var `v' `"``v'_lbl'"'
            }
        }
        save `"`f'"', replace
    }
    Note that I haven't shown how to create the local macro containing the file names. That's because the best way to do it depends on how many files there are, to what extent their names follow simple paradigms that can be captured with wildcards, and how they are organized into directories. If you need help with creating that filename macro, post back with detailed information about the files.

    Note: This code is untested and may contain errors, but it gives the gist of the approach and should work, if not as is, then with modest modifications.
    Last edited by Clyde Schechter; 23 Feb 2022, 10:50.

    Comment


    • #3
      Thank you Clyde, the code was very useful and almost works, although using quotation in Stata is a bit tricky for me and I'd appreciate it if you could provide me with a brief applicable explanation about using quotation in macros. For example, in the foreach loop, when I played with quotation I got confusing result below:

      ...
      foreach v of local vbles {
      display `v' // shows the variable name
      display "`v'" // shows the variable name
      display `"`v'"' // shows the value of the first observation in variable column
      }

      I expanded the code by adding another column to my metadata excel file (mylabel_value) to define "label for all numerical values" (such as define label GENDER 1 "Female" 2 "Male" & label values var1 GENDER), the label values is assigned perfectly to all variables, but the problem is that all numerical values will be replaced by label value (all 1 in the variable column replace by Female and all 2 by Male). My last question is how to define non-numerical value label such as country variable which contain "BE" for Belgium, I don't want to replace in dataset.

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        So the first step is to import the CXV file into Stata. The -import delmited- command will do that for you. Read -help import delimited- for the full details of the syntax. I would recommend that you save the resulting imported as a Stata data set and then work with that instead--that way if you make some mistakes along the way and have to backtrack, you won't need to reimport the metadata file over and over again. So let's say you've done that. Here's an example of what that file might look like:
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str10 my_var str36 my_label
        "age" "Age at date of survey"
        "sex" "Sex (M = 1, F = 2)"
        "n_children" "Number of children at date of survey"
        "income" "Annual household income"
        end
        So after you -use- this dataset, you loop over the files you want to apply it to, and within that you loop over the variables. The code below assumes that some of the files may not contain all of the variables in the metadata file. So it will skip over any non-existent variables without any warning. If, on the contrary, you expect every file to contain all these variables, post back and I will show you revised code that would verify that all of them are there and, if not, throw an error message to bring the problem to your attention.

        Code:
        use the_metadata_file, clear
        isid my_var
        
        local vbles
        forvalues i = 1/`=_N' {
        local v = my_var[`i']
        local lbl = my_label[`i']
        local `v'_lbl `lbl'
        local vbles `vbles' `v'
        }
        
        // INSERT CODE HERE TO DEFINE A LOCAL MACRO filenames CONTAINING A LIST OF ALL THE FILES
        // IN WHICH THESE LABELS WILL BE APPLIED. MIGHT BE A DIRECT LISTING, OR, GIVEN
        // THAT THE NUMBER OF SUCH FILES IS LARGE, PROBABLY RELIES ON USING
        // -local filenames: dir ...-
        
        foreach f of local filenames {
        use `"`f'"', clear
        foreach v of local vbles {
        capture confirm var `v', exact
        if c(rc) == 0 {
        label var `v' `"``v'_lbl'"'
        }
        }
        save `"`f'"', replace
        }
        Note that I haven't shown how to create the local macro containing the file names. That's because the best way to do it depends on how many files there are, to what extent their names follow simple paradigms that can be captured with wildcards, and how they are organized into directories. If you need help with creating that filename macro, post back with detailed information about the files.

        Note: This code is untested and may contain errors, but it gives the gist of the approach and should work, if not as is, then with modest modifications.
        ----------------------------------------------------------------------------------------------

        Thank you Clyde, the code was very useful and almost works, although using quotation in Stata is a bit tricky for me and I'd appreciate it if you could provide me with a brief applicable explanation about using quotation in macros. For example, in the foreach loop, when I played with quotation I got confusing result below:

        ...
        foreach v of local vbles {
        display `v' // shows the variable name
        display "`v'" // shows the variable name
        display `"`v'"' // shows the value of the first observation in variable column
        }

        I expanded the code by adding another column to my metadata excel file (mylabel_value) to define "label for all numerical values" (such as define label GENDER 1 "Female" 2 "Male" & label values var1 GENDER), the label values is assigned perfectly to all variables, but the problem is that all numerical values will be replaced by label value (all 1 in the variable column replace by Female and all 2 by Male).

        My last question is how to define non-numerical value label such as country variable which contain "BE" for Belgium, I don't want to replace in dataset.

        Comment


        • #5
          foreach v of local vbles {
          display `v' // shows the variable name
          display "`v'" // shows the variable name
          display `"`v'"' // shows the value of the first observation in variable column
          }

          If that is true, then your Stata installation is somehow corrupted. I'm more inclined to believe you have misreported what happened. What should have happened, and I believe, really did happen is:

          -display "`v'"- and -display `"`v'"'- both display the name of the variable. By contrast, -display `v'- will display the value of the first observation for that variable. So let's unpack why it works this way.

          The first thing is to recognize that `"..."' and "..." are, in any context, more or less the same thing. The difference between them is that if the quoted material (what I have represented as ... here) itself contains quotation marks "..." will get it wrong, and `"..."' will get it right. The problem is that if you have "something1 "something2" something3", Stata has no way to know if the " before something2 is supposed to terminate the quoted material that begins with "something 1 , or if it is the start of a new, embedded quote. This arises because the same character, ", is used for beginning and ending quotes. The `"..."' notation overcomes this problem because `" is always the beginning of a quote, and "' is always the end. When the quoted material is a variable name, the problem of embedded quotes never arises because a variable name in Stata can never contain a " character. So "`v'" and `"`v'"' will always be the same. In general, it is always best in Stata to use `"..."' because it will never lead to difficulties, whereas the use of "..." can cause problems when ... contains ". Often with a local macro (not created by -foreach v of varlist...-) you can't be sure when you write the code whether it will contain a " character, so it is safest to always use `"..."'. The main advantage of "..." is that it is fewer characters to type, and, at least on a US keyboard, the ` character is difficult to reach from the home keys and, for me at least, is a cause of frequent typos. That said, about the only place I still do use "..." is when I know that ... is a list of variables, either because it is an iterator from -foreach v of varlist- or because I have created it as a list of variables in some other way (e.g. after -ds-, or with -unab-).

          Now what does `v' mean. `v' means the string that constitutes the contents of local macro v. In your particular situation, it is the name of the variable. Now, when Stata encounters the name of a variable, unadorned by quotes, in a -display- command, what does it do? Voila:
          Code:
          . sysuse auto, clear
          (1978 automobile data)
          
          . list price in 1
          
               +-------+
               | price |
               |-------|
            1. | 4,099 |
               +-------+
          
          . display price
          4099
          
          .
          In other words, when -display- encounters the name of a variable, with no quotes around it, it displays the value of that variable in the first observation. That is how -display- works whether the name of the variable comes from a local macro or not. That's just what -display- does with a variable name. To get -display- to show the actual name of the variable, you must surround that variable name in quotes:
          Code:
          . display "price"
          price
          
          . display `"price"'
          price
          I expanded the code by adding another column to my metadata excel file (mylabel_value) to define "label for all numerical values" (such as define label GENDER 1 "Female" 2 "Male" & label values var1 GENDER), the label values is assigned perfectly to all variables, but the problem is that all numerical values will be replaced by label value (all 1 in the variable column replace by Female and all 2 by Male).
          The numerical values are still there. It's just that when you -label values var1 GENDER- Stata understands that as meaning that you do not want to see those numerical values in output. The numerical values are still there. If, momentarily, you want to see them, you can add the -nolabel- option to your -browse- or -list- command and Stata will show them to you without the labels:
          Code:
          . display "price"
          price
          
          . display `"price"'
          price
          
          . des rep78
          
          Variable      Storage   Display    Value
              name         type    format    label      Variable label
          ------------------------------------------------------------------------------------------------------------------------------------------
          rep78           int     %8.0g                 Repair record 1978
          
          . des foreign
          
          Variable      Storage   Display    Value
              name         type    format    label      Variable label
          ------------------------------------------------------------------------------------------------------------------------------------------
          foreign         byte    %8.0g      origin     Car origin
          
          . list foreign in 1/5
          
               +----------+
               |  foreign |
               |----------|
            1. | Domestic |
            2. | Domestic |
            3. | Domestic |
            4. | Domestic |
            5. | Domestic |
               +----------+
          
          . list foreign in 1/5, nolabel
          
               +---------+
               | foreign |
               |---------|
            1. |       0 |
            2. |       0 |
            3. |       0 |
            4. |       0 |
            5. |       0 |
               +---------+
          Unlike many forms of popular software, Stata is not a "what you see is what you get" application. What you see in Stata is often quite different from what Stata is working with. Another good example of this is with Stata date variables. If you assign a correct datetime format to a Stata date variable, it will show you things that look like 13mar2022. But Stata has no 13mar2022 in its memory: it has the number 22717, which is Stata's code for the date 13 March 2022. So it is common in Stata for what you see to differ from what Stata has (and what you get when you do calculations with Stata). This is done specifically because what Stata shows you, and what you see, are easier for human eyes and brains to work with, than Stata's internal numerical representations of those same things. It is there for your convenience. If you don't like it, you never are forced to use display formats or value labels. But most people love being able to use them.

          My last question is how to define non-numerical value label such as country variable which contain "BE" for Belgium, I don't want to replace in dataset.
          If you create a value-labeled numerical variable for country by using the -encode- command, the original variable is not replaced. In fact -encode- doesn't even have a -replace- option*: -encode- requires you to specify the -gen()- option which causes a new variable to be created. So just decide on a name for the value-labeled-numeric variable and use that.

          *I should point out here that many of us actually prefer to eliminate the original string variable and have the value labeled numeric variable replace it. For that purpose, Daniel Klein's -encoder- program, available from SSC, does have a -replace- option. It's an enormous convenience that I use almost daily in my work. It also has an option to begin numbering from 0 instead of 1, an option I also use frequently. But I digress.

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            [/B]
            If that is true, then your Stata installation is somehow corrupted. I'm more inclined to believe you have misreported what happened. What should have happened, and I believe, really did happen is:

            -display "`v'"- and -display `"`v'"'- both display the name of the variable. By contrast, -display `v'- will display the value of the first observation for that variable. So let's unpack why it works this way.

            The first thing is to recognize that `"..."' and "..." are, in any context, more or less the same thing. The difference between them is that if the quoted material (what I have represented as ... here) itself contains quotation marks "..." will get it wrong, and `"..."' will get it right. The problem is that if you have "something1 "something2" something3", Stata has no way to know if the " before something2 is supposed to terminate the quoted material that begins with "something 1 , or if it is the start of a new, embedded quote. This arises because the same character, ", is used for beginning and ending quotes. The `"..."' notation overcomes this problem because `" is always the beginning of a quote, and "' is always the end. When the quoted material is a variable name, the problem of embedded quotes never arises because a variable name in Stata can never contain a " character. So "`v'" and `"`v'"' will always be the same. In general, it is always best in Stata to use `"..."' because it will never lead to difficulties, whereas the use of "..." can cause problems when ... contains ". Often with a local macro (not created by -foreach v of varlist...-) you can't be sure when you write the code whether it will contain a " character, so it is safest to always use `"..."'. The main advantage of "..." is that it is fewer characters to type, and, at least on a US keyboard, the ` character is difficult to reach from the home keys and, for me at least, is a cause of frequent typos. That said, about the only place I still do use "..." is when I know that ... is a list of variables, either because it is an iterator from -foreach v of varlist- or because I have created it as a list of variables in some other way (e.g. after -ds-, or with -unab-).

            Now what does `v' mean. `v' means the string that constitutes the contents of local macro v. In your particular situation, it is the name of the variable. Now, when Stata encounters the name of a variable, unadorned by quotes, in a -display- command, what does it do? Voila:
            Code:
            . sysuse auto, clear
            (1978 automobile data)
            
            . list price in 1
            
            +-------+
            | price |
            |-------|
            1. | 4,099 |
            +-------+
            
            . display price
            4099
            
            .
            In other words, when -display- encounters the name of a variable, with no quotes around it, it displays the value of that variable in the first observation. That is how -display- works whether the name of the variable comes from a local macro or not. That's just what -display- does with a variable name. To get -display- to show the actual name of the variable, you must surround that variable name in quotes:
            Code:
            . display "price"
            price
            
            . display `"price"'
            price

            The numerical values are still there. It's just that when you -label values var1 GENDER- Stata understands that as meaning that you do not want to see those numerical values in output. The numerical values are still there. If, momentarily, you want to see them, you can add the -nolabel- option to your -browse- or -list- command and Stata will show them to you without the labels:
            Code:
            . display "price"
            price
            
            . display `"price"'
            price
            
            . des rep78
            
            Variable Storage Display Value
            name type format label Variable label
            ------------------------------------------------------------------------------------------------------------------------------------------
            rep78 int %8.0g Repair record 1978
            
            . des foreign
            
            Variable Storage Display Value
            name type format label Variable label
            ------------------------------------------------------------------------------------------------------------------------------------------
            foreign byte %8.0g origin Car origin
            
            . list foreign in 1/5
            
            +----------+
            | foreign |
            |----------|
            1. | Domestic |
            2. | Domestic |
            3. | Domestic |
            4. | Domestic |
            5. | Domestic |
            +----------+
            
            . list foreign in 1/5, nolabel
            
            +---------+
            | foreign |
            |---------|
            1. | 0 |
            2. | 0 |
            3. | 0 |
            4. | 0 |
            5. | 0 |
            +---------+
            Unlike many forms of popular software, Stata is not a "what you see is what you get" application. What you see in Stata is often quite different from what Stata is working with. Another good example of this is with Stata date variables. If you assign a correct datetime format to a Stata date variable, it will show you things that look like 13mar2022. But Stata has no 13mar2022 in its memory: it has the number 22717, which is Stata's code for the date 13 March 2022. So it is common in Stata for what you see to differ from what Stata has (and what you get when you do calculations with Stata). This is done specifically because what Stata shows you, and what you see, are easier for human eyes and brains to work with, than Stata's internal numerical representations of those same things. It is there for your convenience. If you don't like it, you never are forced to use display formats or value labels. But most people love being able to use them.


            If you create a value-labeled numerical variable for country by using the -encode- command, the original variable is not replaced. In fact -encode- doesn't even have a -replace- option*: -encode- requires you to specify the -gen()- option which causes a new variable to be created. So just decide on a name for the value-labeled-numeric variable and use that.

            *I should point out here that many of us actually prefer to eliminate the original string variable and have the value labeled numeric variable replace it. For that purpose, Daniel Klein's -encoder- program, available from SSC, does have a -replace- option. It's an enormous convenience that I use almost daily in my work. It also has an option to begin numbering from 0 instead of 1, an option I also use frequently. But I digress.
            -------------------------------------------------------------------------------------------------------------------
            Thank you Clyde for complete explanation of quotation use in Stata. I checked again and you're right about difference of `v', "`v'" and `"`v'"'.
            I couldn't use Daniel Klein's -encoder- program but the -replace- option works for me.

            I have another problem, I want to go through some folder and subfolder to convert lots of .csv files to .dta. Let's suppose main folder names are A to F, and subfolders from 1 to 10. but the problem is some subfolders are not homogeneous, it means some of them doesn't contain some numbers, for example folder B doesn't have subfolder 1 or 7 and the code below terminated with error in that case.
            Could you please tell me how can I modify the code.

            PS: I try use cd `x'/`y', but it doesn't work!

            ---------------------------------
            local dir_list = "A B C D E F"
            local subdir_list = "1 2 3 4 5 6 7 8 9 10"
            foreach x of local dir_list {
            cd `"`x'"'
            foreach y of local subdir_list {
            cd `"`y'"'
            local filenames : dir . files "*.csv"
            foreach f of local filenames {
            import delimited "`f'", clear delim(",")
            local outfile = subinstr("`f'",".csv","",.)
            save "`outfile'", replace
            }
            cd ..
            }
            cd ..
            }
            ---------------------------------

            Comment


            • #7
              I think this will do it.
              Code:
              local dir_list = "A B C D E F"
              local subdir_list = "1 2 3 4 5 6 7 8 9 10"
              foreach x of local dir_list {
                  cd `"`x'"'
                  foreach y of local subdir_list {
                      capture cd `"`y'"'
                      if c(rc) == 0 {        
                          local filenames : dir . files "*.csv"
                          foreach f of local filenames {
                              import delimited "`f'", clear delim(",")
                              local outfile = subinstr("`f'",".csv","",.)
                              save "`outfile'", replace
                          }
                          cd ..
                      }
                      else if c(rc) == 170 {
                          continue
                      }
                      else {
                          display as error "Unexpected Error subdirectory `y'"
                          exit `c(rc)'
                      }
                  }
                  cd ..
              }
              Note: if the starting directory for this code (i.e. that one in which directories A B C D E and F) are located is such that all of the .csv files in it and in all of its subdirectories indefinitely far down the tree are to be processed, there is a different overall approach to this that can be used and is, in my opinion, simpler, relying on Robert Picard's -filelist- program.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                I think this will do it.
                Code:
                local dir_list = "A B C D E F"
                local subdir_list = "1 2 3 4 5 6 7 8 9 10"
                foreach x of local dir_list {
                cd `"`x'"'
                foreach y of local subdir_list {
                capture cd `"`y'"'
                if c(rc) == 0 {
                local filenames : dir . files "*.csv"
                foreach f of local filenames {
                import delimited "`f'", clear delim(",")
                local outfile = subinstr("`f'",".csv","",.)
                save "`outfile'", replace
                }
                cd ..
                }
                else if c(rc) == 170 {
                continue
                }
                else {
                display as error "Unexpected Error subdirectory `y'"
                exit `c(rc)'
                }
                }
                cd ..
                }
                Note: if the starting directory for this code (i.e. that one in which directories A B C D E and F) are located is such that all of the .csv files in it and in all of its subdirectories indefinitely far down the tree are to be processed, there is a different overall approach to this that can be used and is, in my opinion, simpler, relying on Robert Picard's -filelist- program.
                --------------------------------------------------------------------

                Unfortunately it doesn't work. For example if the subfolder start with 2, I get an error r(170) "unable to change to Folder ... "

                Comment


                • #9
                  I do not see how you can get that error because the -cd- command for the subfolder is -capture-d. I tested the code on my own setup and it ran without error messages in a small directory structure similar to what you describe.

                  Please -set tracedepth 1- and -set trace on- and run again. If I can see the trace output I can figure out where that is coming from and advise.

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    I do not see how you can get that error because the -cd- command for the subfolder is -capture-d. I tested the code on my own setup and it ran without error messages in a small directory structure similar to what you describe.

                    Please -set tracedepth 1- and -set trace on- and run again. If I can see the trace output I can figure out where that is coming from and advise.
                    ---------------------------------------------------------

                    The error originated from folder without the last subfolder (in our case number 10), when I added an empty subfolder #10, code works.

                    Comment


                    • #11
                      OK, but I still don't see how you are getting that error. When there is no subfolder #10, the -cd 10- command is captured, and so the error message should be suppressed and the program continues to -else if c(rc) == 170-, from where the code should move on to the next folder in dir_list. If the expedient of adding an empty subfolder #10 works and you can live with that, then so can I. But I am truly perplexed as to how this is happening.

                      Comment

                      Working...
                      X