Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Renaming duplicate observations

    Hello Statalist users,

    I have found much information on the duplicate command; however, most are regarding dropping duplicate observations or renaming variables. Instead, I want to rename my duplicate observations. The data I received has categories (i.e. crops, livestock, etc.) with multiple subcategories. Each category and subcategory is imported as an observation and I eventually will be transposing this data. However, there are some subcategories that are labeled the same under different categories. I need to relabel these subcategory observations in order to transpose.

    Below is a simplified version of my data:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str23(statename B)
    "Crop"        "1"
    "consumption" "2"
    "adjustment"  "3"
    "Animal"      "4"
    "consumption" "5"
    "adjustment"  "6"
    end
    I have found a quick fix; however, this is not dynamic.
    Code:
    sort statename
    quietly by statename: gen dup=cond(_N==1,0,_n)
    tabulate dup
    
    
    replace statename = "consumption_crop" if dup>1
    sort statename
    
    quietly by statename: replace dup=cond(_N==1,0,_n)
    tabulate dup
    replace statename = "adjustment_crop" if dup>1
    
    sort B
    Again, this works for this situation, but I want this code to be dynamic and allow for changes in duplicated names not just the current names.

    I was trying to mend the information provided in the discussion about renaming variables as it seemed relevant. However, I could not identify the proper way to edit Daniel Klein's suggested posted.

    Code:
    // Create data set 
    clear input str23 A str23 B // note str23
    "This is my desired name" "This is my desired name"
    "9098" "8676878" 
    end 
    
    // rename 
    foreach var of var A-B { 
        loc original_text : di `var'[1] 
        loc newname = strtoname(`"`original_text'"') 
        loc newname : permname `newname' 
        ren `var' `newname' char `newname'[original_text] `"`original_text'"' 
    } 
    
    d ,f 
    l 
    char l
    I recognize the permname would assist in this process, but I am not sure how to use it to replace a duplicated observation instead of a variable.

    I appreciate any and all information regarding this topic.

    Regards,
    Amie
    Hi, I'm interested in renaming variables using values from the first observations. The problem with the particular data set I'm currently working on is that

  • #2
    I think the following will create your unique variable names:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str23(statename B)
    "Crop"        "1"
    "consumption" "2"
    "adjustment"  "3"
    "Animal"      "4"
    "consumption" "5"
    "adjustment"  "6"
    end
    
    * rename the variable, these are not state names
    rename statename category
    
    * you need an observation identifier
    gen long obs = _n
    
    * tag observations that are unique, we assume there are main categories
    bysort category: gen tag = _N == 1
    
    * restore the sort order
    sort obs
    
    * use a running sum to create a main category identifier
    gen mainid = sum(tag) 
    
    * create a unique name
    bysort mainid (obs): gen newname = category[1]
    by mainid: replace newname = newname + " " +  category if _n > 1
    
    * make the name a valid Stata name that can be used as a variable name
    gen goodname = strtoname(newname, 20)
    
    list, sepby(mainid)
    Note that if this is a continuation of yesterday's thread, I suspect that you would be better off making a master list of all possible variable labels across all your files and decide manually of an appropriate and unique name (i.e. create a dataset that contains unique labels and the variable name you want to use and then use merge to add the new variable names to your original data).

    Comment


    • #3
      Hi Robert,

      Thank you for the suggestion of a master list with all variable labels. I originally tried to do this; however, I am worried that in the future the dataset would add in additional names or edit current categories labels. Therefore, I think your code added to the code you provided in yesterday's thread will provide a dynamic model that will account for any duplicate variables that may appear in the future.

      I did not think about replacing based on a main identifier! Thank you for the suggestion; I was able to modify it to my data.

      Best,
      Amie

      Comment

      Working...
      X