The thread https://www.statalist.org/forums/for...-mata-function meandered from its title, which was natural enough, but makes starting a new thread sensible for what I am going to post.
The thread latterly focused on the Myers index for measuring age heaping, i.e. the tendency of people to report inaccurate ages that err in the last digit.
Chen Samulsion unearthed a nice-looking program by German Rodriguez that is more serious than my own token effort in that thread.
However, I did a little more work and changed the focus and direction of my code to be about last digits of integer numeric variables. So, possible last digits with nothing else said are 0 1 ... 8 9.
I calculate what I know as the dissimilarity index 1/2 SUM | p_j - 0.1 | where p_j is the proportion reporting digit j and 0.1 is the expectation if frequencies were equal. I now leave on one side whether this is, or is a relative of, the Myers index.
I also add a standard chi-square test.
At this moment I am not sure that I want to take this much further and can't even summon the inclination and energy to write a help file.
The thread latterly focused on the Myers index for measuring age heaping, i.e. the tendency of people to report inaccurate ages that err in the last digit.
Chen Samulsion unearthed a nice-looking program by German Rodriguez that is more serious than my own token effort in that thread.
However, I did a little more work and changed the focus and direction of my code to be about last digits of integer numeric variables. So, possible last digits with nothing else said are 0 1 ... 8 9.
I calculate what I know as the dissimilarity index 1/2 SUM | p_j - 0.1 | where p_j is the proportion reporting digit j and 0.1 is the expectation if frequencies were equal. I now leave on one side whether this is, or is a relative of, the Myers index.
I also add a standard chi-square test.
At this moment I am not sure that I want to take this much further and can't even summon the inclination and energy to write a help file.
Code:
. sysuse auto, clear (1978 automobile data) . lastdigit mpg +---------------------------+ | digit counts percents | |---------------------------| | 0 5 6.8 | | 1 7 9.5 | | 2 7 9.5 | | 3 3 4.1 | | 4 11 14.9 | |---------------------------| | 5 9 12.2 | | 6 7 9.5 | | 7 4 5.4 | | 8 12 16.2 | | 9 9 12.2 | +---------------------------+ Dissimilarity index: .15405405 Chi-square: 10.324324 P-value: .32487296 . lastdigit price +---------------------------+ | digit counts percents | |---------------------------| | 0 8 10.8 | | 1 3 4.1 | | 2 7 9.5 | | 3 4 5.4 | | 4 7 9.5 | |---------------------------| | 5 11 14.9 | | 6 7 9.5 | | 7 7 9.5 | | 8 4 5.4 | | 9 16 21.6 | +---------------------------+ Dissimilarity index: .17297297 Chi-square: 17.621622 P-value: .03982599
Code:
*! 1.0.0 NJC 22 Jan 2024 * nod to Daniel Klein for earlier comments program lastdigit, rclass version 9 syntax varname(numeric) [if] [in] marksample touse * integer argument? assert `varlist' == floor(`varlist') if `touse' quietly count if `touse' if r(N) == 0 error 2000 local N = r(N) tempvar digits counts percents which quietly { gen byte `digits' = mod(`varlist', 10) gen `counts' = . gen `percents' = . gen `which' = _n - 1 in 1/10 } quietly forval i = 0/9 { local I = `i' + 1 count if `touse' & `digits' == `i' replace `counts' = r(N) in `I' replace `percents' = 100 * r(N)/`N' in `I' } char `which'[varname] "digit" char `counts'[varname] "counts" char `percents'[varname] "percents" format `percents' %2.1f list `which' `counts' `percents' in 1/10, noobs subvarname mata : _dissim("`counts'") di _n "Dissimilarity index: " r(dissim) mata : _chisq("`counts'") di _n "Chi-square: " r(chisq) di "P-value: " r(P_value) return scalar dissim = r(dissim) return scalar chi_sq = r(chisq) return scalar P_value = r(P_value) end mata : void _dissim(string scalar countsname) { real vector counts, props counts = st_data((1::10), countsname) props = counts :/ sum(counts) st_numscalar("r(dissim)", sum(abs(props :- 0.1))/2) } void _chisq(string scalar countsname) { real vector counts, props real scalar n, expected, chisq counts = st_data((1::10), countsname) n = sum(counts); expected = n/10 chisq = sum((counts :- expected):^2 :/ expected) st_numscalar("r(chisq)", chisq) st_numscalar("r(P_value)", chi2tail(9, chisq)) } end