Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create new variable based on text within a substring

    I want to be able to search for a string within an observation and then write a new file based on the value of the string. I have a list of school names in a flat file. Each school has a type in its name (i.e., elementary, primary, charter, middle, junior, high, secondary, etc.) but the location of the type varies for each observation varies as does the length of string. For example, if I have a variable called "School Name" and the first observation of that variable is the text "Johnson Elementary School", I want to create a second variable called "School Type" whose first observation is the text "Elementary".








  • #2
    Hello Kevin,

    Try the regexm() and regexs() functions.

    Code:
    help f_regexm

    Comment


    • #3
      Well, if the words "elementary, primary, ..." etc actually occur in the school name variable, it is easy enough to capture that information, even if it isn't always in the same location in the string.

      Code:
      gen school_type = ""
      foreach x in elementary primary charter middle junior high secondary {
          replace school_type = "`x'" if strpos(lower(school_name)), "`x'")
      }
      Notes:
      1. I think it is safer to seek the keyword in a lower case transform of the school name, so that we don't have to deal with possible inconsistencies in capitalization in the original data.
      2. The variable school_type created by this code is a string variable. Depending on what you are going to do with the information, you may find it more convenient to work with a value-labeled numeric variable. See -help encode-.

      Comment on posting technique: you don't have a variable called "School Name" -- you can't possibly, because that is not a legal variable name in Stata. Embedded blanks are not permitted. When describing Stata data as part of a question, it is best to be exact and use correct terminology.

      Added: Crossed with Roger Chu's response. While this can be done using the functions he suggests, I think that is using a sledge hammer to swat a fly. The regular expression functions are fairly complicated to use properly. Sometimes they are indispensible, but in this case a far simpler solution is available.
      Last edited by Clyde Schechter; 13 Jan 2017, 13:32.

      Comment


      • #4
        not entirely clear (e.g., I assume you mean you have a string variable that contains these descriptors) but I think that the following will do what you want:
        Code:
        gen stelem=strpos(School_Name,"Elementary")
        gen sthigh=strpos(School_Name,"High")
        etc.
        note that I have changed the illegitimate variable name "School Name" (spaces are not allowed) to something legitimate
        this will give you variables with the number where the string starts (if it doesn't exist then the new variable (stelem) will have a value of 0
        you can then do, e.g.,
        Code:
        gen str School_Type="Elementary" if stelem>0
        replace School_Type="High School" if sthigh>0
        etc.

        added: crossed with above replies from Roger and Clyde (and Clyde's is fuller)

        Comment


        • #5
          Clyde Schechter Rich Goldstein Roger Chu Joao Santos Silva Nick Cox
          Code:
           Example generated by -dataex-. To install: ssc install dataex
          clear
          input int year str9 iso3_d str253 ProductSector str33 iso3_o double tradeflow_wto_d
          
          2015 "IND" " Product: 2603 Copper  concentrates"                "EU"    2 0
          2015 "IND" " Product: 2603 Copper  concentrates"                "EU"    10
          2015 "IND" " Product: 2603 Copper ores "                         "EU"    37
          2015 "IND" " Product: 2603 Copper ores "                         "EU"    10
          2015 "IND" " Product: 2603 Copper ores "                         "EU"    10
          2015 "IND" " Product: 2603 Copper ores and concentrates"   AUS"    27
          2015 "IND" " Product: 2603 Copper ores and concentrates"   AUS     10
          2015 "IND" " Product: 2603 Copper ores and concentrates"   AUS     10
          2015 "IND" " Product: 2604 Nickel ores and concentrates"    "EU"     0
          2016 "IND" " Product: 2604 Nickel ores and concentrates"    EU"    11
          2017 "IND" " Product: 2604 Nickel ores and concentrates"    EU"     0
          2018 "IND" " Product: 2604 Nickel ores and concentrates"    EU"     2
          2019 "IND" " Product: 2604 Nickel ores and concentrates"    EU"     2
          2020 "IND" " Product: 2604 Nickel ores and concentrates"    "AUS"     0
          2021 "IND" " Product: 2604 Nickel ores and concentrates"    AUS    0
          2022 "IND" " Product: 2604 Nickel ores and concentrates"     AUS     0
          2015 "IND" " Product: 2605 Cobalt ores and concentrates"   EU"    58
          2016 "IND" " Product: 2605 Cobalt ores and concentrates"   EU"     0
          2017 "IND" " Product: 2605 Cobalt ores and concentrates"   EU"     0
          2018 "IND" " Product: 2605 Cobalt ores and concentrates"   AUS     3
          2019 "IND" " Product: 2605 Cobalt ores and concentrates"   AUS    60
          2020 "IND" " Product: 2605 Cobalt ores and concentrates"   AUS  123
          2021 "IND" " Product: 2605 Cobalt ores and concentrates"   AUS    73
          this my data example
          i need two new variables namely minerals (string variable) and trade_minerals (numerical variable) whose value should be the sum of variable tradeflow_imf_d on the basis of string variable ProductSector. i want a data like this

          Code:
          year iso3_d  iso3_o minerals trade_minerals
          2015 "IND"  "EU"      Copper    57
          2015  "IND"  "EU"      Nickel      47

          my dataset ranges from 2015 to 2022 .
          in other words i need to aggregate trade value according to mineral type on year and country basis .

          please help. my paper is stuck because of this problem.
          Last edited by Dr. Iqra Yaseen; 31 Oct 2023, 00:51.

          Comment


          • #6
            Well, based on your description of what you want it sounds like you would get it with:
            Code:
            collapse (sum) tradeflow_wto_d, by(iso3_d iso3_o year ProductSector)
            But, I do not see how you arrived at a total of 47 for 2015 "IND" "EU" Nickel. The only observation I find that matches that year, origin, destination, and product has 0 tradeflow. Did you make a mistake, or am I misunderstanding what you are trying to do?

            Comment


            • #7
              Clyde Schechter yes i have made a mistake here. but we have already discussed this query at https://www.statalist.org/forums/for...03#post1732203

              thank you so much

              Comment

              Working...
              X