Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating new variable based on some observations of another variable


    Hi,

    Below is a piece of my dataset which shows the export value of different countries at each industry (Naics codes). The "country" variable has the name of countries but it is also grouped based on the industry classification (Naics codes). How can I make a new variable (naics), which equals to the 4-digit Naics codes shown in some observations of the "country" variable (like observations number 1 and 10 and 39) for a large dataset?
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str104 country double value
    "NAICS 3111 - Animal Food Manufacturing"                              .
    "United States"                                               684661842
    "Russia"                                                       37939608
    "Japan"                                                        37479174
    "Mexico"                                                       28458726
    "Hong Kong"                                                    20846344
    "NAICS 3112 - Grain and Oilseed Milling"                              .
    "United States"                                              5277482929
    "China"                                                       826853203
    "Mexico"                                                      154639652
    "Japan"                                                       142519122
    "Korea, South"                                                141465886
    "Malaysia"                                                     62299783
    "Hong Kong"                                                    48161451
    "Cameroon"                                                         8132
    "Brunei Darussalam"                                                6249
    "Congo (formerly Brazzaville)"                                     6049
    "NAICS 3113 - Sugar and Confectionery Product Manufacturing"          .
    "United States"                                              2878446114
    "Mexico"                                                       41351707
    "Australia"                                                    17047768
    "Korea, South"                                                 16787372
    "Armenia"                                                         36811
    end
    I want to make something like:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str104 country double value float naics
    "NAICS 3111 - Animal Food Manufacturing"                              . 3111
    "United States"                                               684661842 3111
    "Russia"                                                       37939608 3111
    "Japan"                                                        37479174 3111
    "Mexico"                                                       28458726 3111
    "Hong Kong"                                                    20846344 3111
    "NAICS 3112 - Grain and Oilseed Milling"                              . 3112
    "United States"                                              5277482929 3112
    "China"                                                       826853203 3112
    "Mexico"                                                      154639652 3112
    "Japan"                                                       142519122 3112
    "Korea, South"                                                141465886 3112
    "Malaysia"                                                     62299783 3112
    "Hong Kong"                                                    48161451 3112
    "Cameroon"                                                         8132 3112
    "Brunei Darussalam"                                                6249 3112
    "Congo (formerly Brazzaville)"                                     6049 3112
    "NAICS 3113 - Sugar and Confectionery Product Manufacturing"          . 3113
    "United States"                                              2878446114 3113
    "Mexico"                                                       41351707 3113
    "Australia"                                                    17047768 3113
    "Korea, South"                                                 16787372 3113
    "Armenia"                                                         36811 3113
    end
    Thanks,

    Hossein
    Last edited by Hossein Jebeli; 23 Jul 2017, 17:40.

  • #2
    If all of the NAICS codes are exactly 4 digits long, and if all of those lines begin with "NAICS" followed by a single space, followed by the code, then this will work:

    Code:
    gen int naics_code = real(substr(country, 7, 4)) ///
        if strpos(country, "NAICS")
    replace naics_code = naics_code[_n-1] if missing(naics_code)
    If the content of those lines varies from that rigid format, then it is more complicated and we would need to see more examples of those lines to see just what the range of variation is.

    Note tat you should probably -drop- all of those observations once you have done this, as leaving them in the data set may cause later errors in calculations if they are erroneously included.

    Comment


    • #3
      Try something like
      Code:
      generate int naics = real(substr(country, 7, 4)) if strpos(country, "NAICS") == 1
      replace naics = naics[_n-1] if mi(naics)

      Comment

      Working...
      X