Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing unnecessary spaces from string values

    Hello the Statalist Community,

    I have a string variables which contains spaces in some of the values as Prefixes and suffixes as shown below
    string_var
    " Kenya"
    "Ireland "
    " South Africa"
    " India "
    "Burkina Faso"
    "Mexico"


    How can I remove those unnecessary backspaces from this string variable? Please note that I want to remove only blank spaces in the prefix and suffix of the string values (not those in the middle)

    Best, Anagaw

  • #2
    Code:
    help strtrim()
    If you did want to remove all spaces including those in the middle, then
    Code:
    replace my_string_variable = subinstr(my_string_variable, " ", "", .)

    Comment


    • #3
      Joseph's advice is spot on. I just want to flag that leading and trailing spaces is also common jargon here.

      And that stritrim() can be useful. That standardize internal spaces to one only. This example is blatant:

      Code:
      . di stritrim("South        Africa")
      South Africa
      but it's quite common in messy text that people mean to type one space, but type two. Or even that data was entered by different people with different habits or conventions.

      Comment


      • #4
        Dear Joseph and Nick, thanks indeed for the useful advises!!!
        The problem is solved based on your suggestions

        Comment


        • #5
          Hi Nick, What is the full syntax for the di stritrim command please? Is it di stritrim (variable)?

          Comment


          • #6
            Code:
            h stritrim()

            Comment


            • #7
              display with a variable will
              only show results for the first observation — unless you specify otherwise. With a variable use
              generate or replace and then list or edit.

              Comment


              • #8
                Thank you very much for your kind help.

                Comment


                • #9
                  Hello, I have question that is somewhat similar to that of Anagaw. I am trying to merge two datasets using the household identifier (hhid) as the variable. I was able to merge most of the data. There were some unwanted spaces in the unmerged data. So, i used the excellent advice given by Joseph, above, to deal with that problem. I am still left with about 40 odd observations that don't match. Eyeballing the hhid for these unmatched observations, i cannot find any difference between them. And yet, i cannot seem to get these observations to merge. i am using the following code. Would you be so kind as to help me out here; see if there is anything that i might be overlooking? Thank you in advance.

                  *to remove unwanted gaps in hhid
                  replace hhid = subinstr(hhid, " ", "", .)

                  *collapsed the master data by hhid
                  collapse (sum) agri_prod land_poss, by(hhid)

                  *generated hhid_new so that i could compare the unmatched observations post-merging. hhid and hhid_new are the exact same variables.
                  gen hhid_new = hhid

                  *merged two datasets using hhid as the key identifier.
                  merge 1:1 hhid using "C:\Users\kandikus\OneDrive\East West Center\Data\SituationAssessmentofAgriHHs\Climate Data\CRU_APRIL32023\Apr182023\HHID List\NSSO_HHID_2019.dta", generate(_merge_rough)
                  Click image for larger version

Name:	Screenshot (5).png
Views:	1
Size:	221.9 KB
ID:	1711623

                  Last edited by Sandeep Kandikuppa; 28 Apr 2023, 12:55.

                  Comment


                  • #10
                    In the future, please post text rather than an image (the latter is deprecated in the StataList FAQ).

                    Some thoughts:

                    1) Did you remove spaces from the hhid variable in the "using file" named in the merge? I'd presume you did, but you didn't say anything about that, and that certainly could produce the observed difficulty.

                    2) Showing us the frequency distribution for the _merge_rough variable would be helpful, as it would show what kind of "observations that don't match" is happening.

                    3) One (unlikely) possibility is that you have some non-visible characters embedded in hhid. In your particular situation, in which it appears that only numerals should appear in the hhid string, you could try this to detect the location of any such problems:
                    Code:
                    gen hhidlen = strlen(hhid)
                    summ hhidlen
                    gen byte bad_locations = ""
                    forval i = 1/`r(max)' {
                       replace bad_locations = bad_locations + "`i' " if  (`i' <= hhidlen) &
                       !inrange(substr(hhidnew, `i',1), "0", "9")
                    }
                    You might even find such problems with something as simple as scanning the data for values of hhidlen that are shorter/longer than expected.

                    Comment


                    • #11
                      Mistake: I accidentally left off the line continuation "///" in the preceding code, it should be:
                      Code:
                      forval i = 1/`r(max)' {
                         replace bad_locations = bad_locations + "`i' " if (`i' <= hhidlen) & ///
                            !inrange(substr(hhidnew, `i',1), "0", "9")
                      }  

                      Comment

                      Working...
                      X