The summary of this post is that (a) replace, nopromote functions differently on string variables than on numeric variables, (b) with multibyte unicode characters, replace, nopromote functions poorly on string variables, and (c) because replace, nopromote is infrequently used in Stata code, the choice to have the st_store() and especially st_sstore() Mata functions not trigger promotion in Stata (and not have any means of triggering it) leads to unexpected results.
I think this behavior is at a minimum not well documented, and arguably a bug rather than a feature, especially with respect to the handing of strings.
Here's some sample code and results, the discussion follows.
From the "b" series, we see that replace and getmata, replace promote the byte to an integer to accommodate the new value, while replace, nopromote and st_store replace the data with a missing value.
From the "s" series, we see that only replace promotes the str1 to str3 to accommodate the longer value, while replace, nopromote and st_sstore and getmata, replace quietly replace the data with just what will fit from the replacement value, rather than provide a missing value to signal failure, as was the case for numeric data.
From the "u" series, we again see that only replace promotes the str1 to str3 to accommodate the three-byte Unicode character, while replace, nopromote and st_sstore and getmata, replace quietly replace the data with just what will fit from the replacement value, which in this case is only the first byte of a three-byte character, which then is not a valid Unicode character.
I will add that in code not demonstrated here, getmata (with no replace option) with character data functioned as you would expect it to, creating ss and uu as str3 variables.
IMHO, ideally st_store and st_sstore would trigger promotion by the received variable, and in any event commands that do not trigger promotion should for string variables should replace values that will not fit with missing values, as is done for numeric variables. Also, ideally st_store, st_sstore, and getmata, replace should provide diagnostics similar to those from replace.
This post was precipitated by the following topic on the Mata forum.
https://www.statalist.org/forums/for...sing-st_sstore
I think this behavior is at a minimum not well documented, and arguably a bug rather than a feature, especially with respect to the handing of strings.
Here's some sample code and results, the discussion follows.
Code:
clear set obs 5 generate byte b1 = 1 generate byte b2 = 1 generate byte b3 = 1 generate str1 s1 = "-" generate str1 s2 = "-" generate str1 s3 = "-" generate str1 u1 = "-" generate str1 u2 = "-" generate str1 u3 = "-" describe b* s* u* replace b1 = 666 replace s1 = "abc" display "unicode character " ustrunescape("\u2022") " = 3 bytes " tobytes(ustrunescape("\u2022"),1) replace u1 = ustrunescape("\u2022") replace b2 = 666, nopromote replace s2 = "abc", nopromote replace u2 = ustrunescape("\u2022"), nopromote mata: bb = J(5,1,666) mata: st_store(.,"b3",bb) mata: ss = J(5,1,"abc") mata: st_sstore(.,"s3",ss) mata: uu = J(5,1,ustrunescape("\u2022")) mata: st_sstore(.,"u3",uu) generate byte bb = 1 generate str1 ss = "-" generate str1 uu = "-" getmata bb, replace getmata ss, replace getmata uu, replace describe b* s* u* list b* s* u*, clean noobs
Code:
. describe b* s* u* storage display value variable name type format label variable label -------------------------------------------------------------------------------------------------- b1 byte %8.0g b2 byte %8.0g b3 byte %8.0g s1 str1 %9s s2 str1 %9s s3 str1 %9s u1 str1 %9s u2 str1 %9s u3 str1 %9s . replace b1 = 666 variable b1 was byte now int (5 real changes made) . replace s1 = "abc" variable s1 was str1 now str3 (5 real changes made) . display "unicode character " ustrunescape("\u2022") " = 3 bytes " tobytes(ustrunescape("\u2022"),1) unicode character • = 3 bytes \xe2\x80\xa2 . replace u1 = ustrunescape("\u2022") variable u1 was str1 now str3 (5 real changes made) . replace b2 = 666, nopromote (5 real changes made, 5 to missing) (5 values changed to missing because of storage type) . replace s2 = "abc", nopromote (5 real changes made) (5 values truncated because of storage type) . replace u2 = ustrunescape("\u2022"), nopromote (5 real changes made) (5 values truncated because of storage type) . mata: bb = J(5,1,666) . mata: st_store(.,"b3",bb) . mata: ss = J(5,1,"abc") . mata: st_sstore(.,"s3",ss) . mata: uu = J(5,1,ustrunescape("\u2022")) . mata: st_sstore(.,"u3",uu) . generate byte bb = 1 . generate str1 ss = "-" . generate str1 uu = "-" . getmata bb, replace . getmata ss, replace . getmata uu, replace . describe b* s* u* storage display value variable name type format label variable label -------------------------------------------------------------------------------------------------- b1 int %8.0g b2 byte %8.0g b3 byte %8.0g bb int %8.0g s1 str3 %9s s2 str1 %9s s3 str1 %9s ss str1 %9s u1 str3 %9s u2 str1 %9s u3 str1 %9s uu str1 %9s . list b* s* u*, clean noobs b1 b2 b3 bb s1 s2 s3 ss u1 u2 u3 uu 666 . . 666 abc a a a • � � � 666 . . 666 abc a a a • � � � 666 . . 666 abc a a a • � � � 666 . . 666 abc a a a • � � � 666 . . 666 abc a a a • � � � .
From the "s" series, we see that only replace promotes the str1 to str3 to accommodate the longer value, while replace, nopromote and st_sstore and getmata, replace quietly replace the data with just what will fit from the replacement value, rather than provide a missing value to signal failure, as was the case for numeric data.
From the "u" series, we again see that only replace promotes the str1 to str3 to accommodate the three-byte Unicode character, while replace, nopromote and st_sstore and getmata, replace quietly replace the data with just what will fit from the replacement value, which in this case is only the first byte of a three-byte character, which then is not a valid Unicode character.
I will add that in code not demonstrated here, getmata (with no replace option) with character data functioned as you would expect it to, creating ss and uu as str3 variables.
IMHO, ideally st_store and st_sstore would trigger promotion by the received variable, and in any event commands that do not trigger promotion should for string variables should replace values that will not fit with missing values, as is done for numeric variables. Also, ideally st_store, st_sstore, and getmata, replace should provide diagnostics similar to those from replace.
This post was precipitated by the following topic on the Mata forum.
https://www.statalist.org/forums/for...sing-st_sstore