Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Has anyone benchmarked the performance of the new -stintcox- "Interval censored Cox proportional hazards model" command?

    I've been testing out the new stintcox command which fits an Interval-censored Cox model on survival type data. It's fairly new, so I can't find much written on it.

    I naively tried to throw a fairly sizable dataset with over 30,000 observations at it and I couldn't really tell if I was going to get convergence in some reasonable time. The data is Case-I interval-censored, which is to say it's a single point-in-time observation with a single covariate. The longest observable survival time is around 150.

    In order to get a rough estimate of how it scales, I reran the command on a small random sample of the data data starting with 1000 observations up to 15000 observations and timed the completion, using the favorspeed option. I'm using 4-core Stata/MP, and according to the latest Stata/MP Performance Report, the 4-core performance is only 1.3x and doesn't scale much beyond that at 16 cores. This going to be a little quick-and-dirty, so take the results with a grain of salt. I got the following results:

    Observations Time (s) Time increase (%)
    1000 4.1
    2000 10.8 167%
    3000 19.1 77%
    4000 36.2 89%
    5000 63.6 76%
    6000 63.5 0%
    7000 110.9 75%
    8000 147.3 33%
    9000 172.1 17%
    10000 218.1 27%
    11000 250.3 15%
    12000 346.5 38%
    13000 425.6 23%
    14000 488.5 15%
    15000 592.1 21%


    I was initially worried that the time was going to nearly double every 1000 observations which would have made estimation on the whole dataset infeasible, but the time delta seems to scale better at the 8,000 obs mark. I'm guessing estimation on the full dataset will converge in a couple hours, though I don't have a sense of how additional covariates are going to impact things.

    Anyhow, to the meat of my question. Does anyone have any experience estimating stintcox on datasets with 20,000+ observations under different models and scenarios? If so, do you have any rough observations about the speed of convergence under different scenarios, or any particular data-structure issues to look out for that might make convergence practically infeasible?

  • #2
    Not really a comment about this specific scenario but there are combinations of models and dsatsets that I have fit that take several hours. Sometimes it is what it is (or relatedly, simulations that can take days). In the cases, I let Stata do its thing overnight/over the weekend or farm it out on a computer I don't mind being tied up for a while

    Comment


    • #3
      Thanks. I was looking for more general guidance anyways. The thing about new commands is that it’s sometimes a fools errand to set them running across hours or a day or two because it will actually never finish at all under any useful timescale, and it’s hard to intuit when that will happen without experience.

      Comment


      • #4
        Like #2, not directly on point here. I have generally found that maximum-likelihood estimates of non-linear models on data sets with tens of thousands of observations often need to run overnight. When the data set has millions of observations, they can take weeks to run, especially if they are multi-level models. I hardly have the world's fastest computers at my disposal, but my equipment is mid-range quality and I run MP4.

        Particularly annoying is the situation where the estimation has started and gets to a point like "Refining starting values" where there is no periodic output to show progress. You can let that spin for a very long time and have no idea whether Stata is hung or not. One three-level -melogit- model that I fit to a data set with about 5,000,000 observations spent 2 days just refining the starting values. But it eventually did move on to the "Fitting full model" stage and finally converged after 15 days.

        If you're going to be fitting complicated models to large data sets often, you need a lot of patience. Probably a good idea to invest in the fastest equipment you can reasonably afford (or a spare machine you don't need for anything else), and find other things to keep yourself occupied or amused while waiting!

        Comment

        Working...
        X