Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Last digits

    The thread https://www.statalist.org/forums/for...-mata-function meandered from its title, which was natural enough, but makes starting a new thread sensible for what I am going to post.

    The thread latterly focused on the Myers index for measuring age heaping, i.e. the tendency of people to report inaccurate ages that err in the last digit.

    Chen Samulsion unearthed a nice-looking program by German Rodriguez that is more serious than my own token effort in that thread.

    However, I did a little more work and changed the focus and direction of my code to be about last digits of integer numeric variables. So, possible last digits with nothing else said are 0 1 ... 8 9.

    I calculate what I know as the dissimilarity index 1/2 SUM | p_j - 0.1 | where p_j is the proportion reporting digit j and 0.1 is the expectation if frequencies were equal. I now leave on one side whether this is, or is a relative of, the Myers index.

    I also add a standard chi-square test.

    At this moment I am not sure that I want to take this much further and can't even summon the inclination and energy to write a help file.


    Code:
    . sysuse auto, clear
    (1978 automobile data)
    
    . lastdigit mpg
    
      +---------------------------+
      | digit   counts   percents |
      |---------------------------|
      |     0        5        6.8 |
      |     1        7        9.5 |
      |     2        7        9.5 |
      |     3        3        4.1 |
      |     4       11       14.9 |
      |---------------------------|
      |     5        9       12.2 |
      |     6        7        9.5 |
      |     7        4        5.4 |
      |     8       12       16.2 |
      |     9        9       12.2 |
      +---------------------------+
    
    Dissimilarity index: .15405405
    
    Chi-square:          10.324324
    P-value:             .32487296
    
    . lastdigit price
    
      +---------------------------+
      | digit   counts   percents |
      |---------------------------|
      |     0        8       10.8 |
      |     1        3        4.1 |
      |     2        7        9.5 |
      |     3        4        5.4 |
      |     4        7        9.5 |
      |---------------------------|
      |     5       11       14.9 |
      |     6        7        9.5 |
      |     7        7        9.5 |
      |     8        4        5.4 |
      |     9       16       21.6 |
      +---------------------------+
    
    Dissimilarity index: .17297297
    
    Chi-square:          17.621622
    P-value:             .03982599
    Code:
    *! 1.0.0 NJC 22 Jan 2024
    * nod to Daniel Klein for earlier comments
    program lastdigit, rclass
        version 9
        syntax varname(numeric) [if] [in]
        marksample touse
        
        * integer argument?
        assert `varlist' == floor(`varlist') if `touse'
        
        quietly count if `touse'
        if r(N) == 0 error 2000
        local N = r(N)
        
        tempvar digits counts percents which
        
        quietly {
            gen byte `digits' = mod(`varlist', 10)
            gen `counts' = .
            gen `percents' = .
            gen `which' = _n - 1 in 1/10
        }
        
        quietly forval i = 0/9 {
            local I = `i' + 1
            count if `touse' & `digits' == `i'
            replace `counts' = r(N) in `I'
            replace `percents' = 100 * r(N)/`N' in `I'
         }
        
        char `which'[varname]    "digit"
        char `counts'[varname]   "counts"
        char `percents'[varname] "percents"
        
        format `percents' %2.1f
        
        list `which' `counts' `percents' in 1/10, noobs subvarname  
        
        mata : _dissim("`counts'")
        di _n "Dissimilarity index: " r(dissim)
        
        mata : _chisq("`counts'")
        di _n "Chi-square:          " r(chisq)
        di    "P-value:             " r(P_value)
        
        return scalar dissim = r(dissim)
        return scalar chi_sq = r(chisq)
        return scalar P_value = r(P_value)
    end
    
    mata :
    
    void _dissim(string scalar countsname) {
        real vector counts, props
        counts = st_data((1::10), countsname)    
        props = counts :/ sum(counts)
        
        st_numscalar("r(dissim)", sum(abs(props :- 0.1))/2)
    }
    
    void _chisq(string scalar countsname) {
        real vector counts, props
        real scalar n, expected, chisq
        counts = st_data((1::10), countsname)
        n = sum(counts); expected = n/10
        chisq = sum((counts :- expected):^2 :/ expected)
        
        st_numscalar("r(chisq)", chisq)
        st_numscalar("r(P_value)", chi2tail(9, chisq))
    }
    
    end
Working...
X