Removing unnecessary spaces from string values

Anagaw Derseh

Join Date: Oct 2016

Posts: 56
#1

Removing unnecessary spaces from string values

11 Jul 2018, 23:10

Hello the Statalist Community,

I have a string variables which contains spaces in some of the values as Prefixes and suffixes as shown below
string_var
" Kenya"
"Ireland "
" South Africa"
" India "
"Burkina Faso"
"Mexico"

How can I remove those unnecessary backspaces from this string variable? Please note that I want to remove only blank spaces in the prefix and suffix of the string values (not those in the middle)

Best, Anagaw
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4368
#2

11 Jul 2018, 23:30

Code:

help strtrim()

If you did want to remove all spaces including those in the middle, then

Code:

replace my_string_variable = subinstr(my_string_variable, " ", "", .)
4 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35327
#3

11 Jul 2018, 23:38

Joseph's advice is spot on. I just want to flag that leading and trailing spaces is also common jargon here.

And that stritrim() can be useful. That standardize internal spaces to one only. This example is blatant:

Code:

. di stritrim("South Africa") South Africa

but it's quite common in messy text that people mean to type one space, but type two. Or even that data was entered by different people with different habits or conventions.
2 likes
Comment
Anagaw Derseh

Join Date: Oct 2016

Posts: 56
#4

12 Jul 2018, 01:37

Dear Joseph and Nick, thanks indeed for the useful advises!!!
The problem is solved based on your suggestions
Comment
Morris Muzyamba

Join Date: Jul 2021

Posts: 4
#5

10 Oct 2022, 05:19

Hi Nick, What is the full syntax for the di stritrim command please? Is it di stritrim (variable)?
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4425
#6

10 Oct 2022, 05:59

Code:

h stritrim()
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35327
#7

10 Oct 2022, 06:16

display with a variable will
only show results for the first observation — unless you specify otherwise. With a variable use generate or replace and then list or edit.
Comment
Morris Muzyamba

Join Date: Jul 2021

Posts: 4
#8

10 Oct 2022, 07:22

Thank you very much for your kind help.
Comment
Sandeep Kandikuppa

Join Date: Nov 2017

Posts: 6
#9

28 Apr 2023, 11:53

Hello, I have question that is somewhat similar to that of Anagaw. I am trying to merge two datasets using the household identifier (hhid) as the variable. I was able to merge most of the data. There were some unwanted spaces in the unmerged data. So, i used the excellent advice given by Joseph, above, to deal with that problem. I am still left with about 40 odd observations that don't match. Eyeballing the hhid for these unmatched observations, i cannot find any difference between them. And yet, i cannot seem to get these observations to merge. i am using the following code. Would you be so kind as to help me out here; see if there is anything that i might be overlooking? Thank you in advance.

*to remove unwanted gaps in hhid
replace hhid = subinstr(hhid, " ", "", .)

*collapsed the master data by hhid
collapse (sum) agri_prod land_poss, by(hhid)

*generated hhid_new so that i could compare the unmatched observations post-merging. hhid and hhid_new are the exact same variables.
gen hhid_new = hhid

*merged two datasets using hhid as the key identifier.
merge 1:1 hhid using "C:\Users\kandikus\OneDrive\East West Center\Data\SituationAssessmentofAgriHHs\Climate Data\CRU_APRIL32023\Apr182023\HHID List\NSSO_HHID_2019.dta", generate(_merge_rough)

Last edited by Sandeep Kandikuppa; 28 Apr 2023, 11:55.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2400
#10

28 Apr 2023, 19:44

In the future, please post text rather than an image (the latter is deprecated in the StataList FAQ).

Some thoughts:

1) Did you remove spaces from the hhid variable in the "using file" named in the merge? I'd presume you did, but you didn't say anything about that, and that certainly could produce the observed difficulty.

2) Showing us the frequency distribution for the _merge_rough variable would be helpful, as it would show what kind of "observations that don't match" is happening.

3) One (unlikely) possibility is that you have some non-visible characters embedded in hhid. In your particular situation, in which it appears that only numerals should appear in the hhid string, you could try this to detect the location of any such problems:

Code:

gen hhidlen = strlen(hhid) summ hhidlen gen byte bad_locations = "" forval i = 1/`r(max)' { replace bad_locations = bad_locations + "`i' " if (`i' <= hhidlen) & !inrange(substr(hhidnew, `i',1), "0", "9") }

You might even find such problems with something as simple as scanning the data for values of hhidlen that are shorter/longer than expected.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2400
#11

29 Apr 2023, 10:29

Mistake: I accidentally left off the line continuation "///" in the preceding code, it should be:

Code:

forval i = 1/`r(max)' { replace bad_locations = bad_locations + "`i' " if (`i' <= hhidlen) & /// !inrange(substr(hhidnew, `i',1), "0", "9") }
Comment

Announcement

Removing unnecessary spaces from string values

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment