Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -isid- sorting data in Stata 16.1?

    Recently I was running some reliable code on another computer that I don't use frequently, and I was hitting an error. After some troubleshooting I found that the command -isid- was behaving differently across the two computers, which I assume it has to do with the Stata version. This short program, run on both computers, illustrates the behavior:

    Code:
    cap program drop test_isid
    program test_isid
        noi version
        sysuse auto, clear
        keep make
        sort make
        replace make = "string with a trailing space " in 1
        gen n1 = _n
        isid make
        *gisid make
        gen n2 = _n
        count
        loc N = `r(N)'
        count if n1==n2
        noi di "n1==n2 in `r(N)' of `N' observations"
        list * if _n<=5
    end
    
    test_isid
    This is the output when I run -test_isid- on my everyday computer (version 16.0):

    version 16.0
    (1978 Automobile Data)
    variable make was str18 now str29
    (1 real change made)
    74
    74
    n1==n2 in 74 of 74 observations


    This is the output when I run -test_isid- on my other computer (version 16.1):

    version 16.1
    (1978 Automobile Data)
    variable make was str18 now str29
    (1 real change made)
    74
    0
    n1==n2 in 0 of 74 observations

    I didn't expect that -isid- would ever sort data ... Is this intended behavior? It seems as if the data get sorted by the string variable that is in the -isid- varlist (with the trailing spaces thrown to the bottom of the data set). If I use -gisid- instead of -isid- then there is no sorting. After updating to 16.1 on my everyday computer, it also displays this sorting behavior.

  • #2
    note: -gisid- is part of the gtools suite. It's great. https://gtools.readthedocs.io/en/lat...sid/index.html

    Comment


    • #3
      -isid- is performing exactly as intended, in part, because sorting is needed at some point to check the uniqueness of variables. Indeed, -isid- even gives the -sort- option to end with a sorted dataset. What is different with the Stata 16 version is the addition of a feature to check if uniqueness is due to leading/trailing spaces. The second check requires an additional sort, and this is where the discrepancy comes in. I suppose what Stata could have done was to restore the original sort order after concluding, but neither operation harms the dataset.

      You can inspect the code with this:

      Code:
      which isid
      viewsource isid.ado
      However, knowing this, I think your test is flawed. This version of the test sorts the data after altering one value to have a trailing space (the other differences merely make the same code more compact).

      Code:
      sysuse auto, clear
      keep make
      replace make = "string with a trailing space " in 1
      sort make
      gen n1 = _n
      isid make
      gen n2 = _n
      list in 1/5
      assert n1==n2
      This test results in success under both Stata 15.1 and 16.1. Alternatively, you could have pre-processed the data with -trim(make)- after the sort, and again, the results would be successful in both versions.

      To test the string trimming feature of the newer -isid-, you should consider this example instead.

      Code:
      sysuse auto, clear
      keep make
      replace make = "string with a trailing space " in 1
      replace make = " string with a trailing space" in 2
      isid make
      Last edited by Leonardo Guizzetti; 16 Mar 2020, 19:03.

      Comment


      • #4
        Thanks Leonardo! I should have been clearer: I'm not interested in testing the new trimming feature. I simply was surprised that -isid- is sometimes sorting without the sort option being specified (it seems only when the string contains a leading/trailing space?). And yes, I would have preferred if the original sort order was preserved, as I believe was the previous behavior.

        Code:
        program test_isid
        syntax , new_str(str)
            qui sysuse auto, clear
            keep make
            qui replace make = "`new_str'" in 1
            gen n1 = _n                /* order before -isid- */
            isid make
            gen n2 = _n                /* order after -isid- */
            assert n1==n2
        end
        test_isid, new_str("string without leading or trailing space")          /* assert does not fail */
        test_isid, new_str("string with a trailing space ")                     /* assert fails */
        test_isid, new_str(" string with a leading space")                      /* assert fails */
        What do others think?

        Comment


        • #5
          hi Brian,
          I would say that, perhaps, this was just a bug. But also, easy to solve.
          1. It may be worth it to contact Stata technical support to inform them about this "misbehavior" from Stata.
          2. Do the following change to the isid.ado file. Right after the program is defined, add " , sortpreserve". This should deal with the error you find, and I would expect, will have no other impact on other programs.
          Best Regards
          Fernando

          Comment

          Working...
          X