Accuracy of results from -optimize()- under various optimization techniques (nr, bfgs, dfp)

Mike Lacy

Join Date: Apr 2014

Posts: 2383
#1

Accuracy of results from -optimize()- under various optimization techniques (nr, bfgs, dfp)

19 Oct 2017, 11:47

I'm wanting to get a better understanding and possibly some suggestions of why Newton-Raphson consistently yields somewhat different (better??) results than either bfgs or dfp in a maximum likelihood routine I've written. The issue, essentially, is the consequences of using nr or dfp or bfgs in -optimize_init_technique(S, "whatever")-

Context: I have a nonstandard mle estimation routine (i.e., can't work with ML) that I implemented with -optimize() - and a d1 evaluator. It works pretty well, I think, but I'm trying to speed it up, and I discovered that for larger problems, Newton-Raphson takes 50 or 100 times longer bfgs or dfp. (The number of parameters being estimated = _N, so that methods that use the Hessian are expensive.) Using bfgs or dfp yields parameter estimates that are similar to each other and are on average within +/- 1% of what nr gives. The final LL from nr is consistently somewhat larger than what bfgs or dfp gets.

Question: There really isn't any external criterion for the accuracy of the parameter estimates in my situation. Is Newton-Raphson to be regarded as the gold standard of accuracy, even if it is slower? Might there be any ways to tweak the work of -optimize()- to get results from bfgs or dfp that are more like those from nr? I've tried diddling the parameter tolerance as a convergence criterion (the one that matters in this case), but that didn't change anything.

On the more practical side, my thinking is that any data I have for this routine (categorical attitude responses from humans) is much less accurate and precise than the +/- 1% in variation in the parameter estimates, so I suppose there's no compelling reason to prefer nr | bfgs | dfp.

I realize this is a pretty technical question, but please don't mistake me for someone who has much theoretical knowledge about optimization techniques, but I'm hoping this happens to fall in someone's area of expertise.

Last edited by Mike Lacy; 19 Oct 2017, 12:04.
Tags: None

1 like
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 621
#2

20 Oct 2017, 15:12

The bfgs and dfp techniques are essentially using Newton-Raphson with a stand-in for the observed Hessian matrix.

Think of this stand-in matrix as an approximation to the observed Hessian. When the approximation is reasonably good, you can rely on them getting you close the to same result as the original Newton-Raphson technique with the observed Hessian. The problem is that these different techniques can take a different path to the solution, and the convergence criterion used to determine when to stop iterating might be a little more liberal than you want.

Mike can experiment with the nrtolerance() option, specifying smaller values to force the techniques to take more iterations and thus reduce their distance to the nr results.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2383
#3

20 Oct 2017, 15:54

Thanks for this information. For whatever reason, in this particular application, I have never been able to get the iteration to converge *except* when I use
optimize_init_conv_ignorenrtol(S, "on") I have reason to believe that the parameter results I get without this convergence criterion are reasonable, so I had not worred about it.

When I do enforce the nrtolerance convergence criterion the search gets to the final LL value in a few iterations, but just gets stuck there ("backed up") without any progress so I have never used it. I tried just now enforcing the nrtolerance criterion, but set the value at a very liberal level optimize_init_conv_nrtol(S, 1e-1)
but I get the same result as when I ignore it. So, I don't seem to find a way to incorporate an nrtolerance criterion, which I'd like to do as you suggest. Perhaps I have something screwy going on here. Any thoughts? Thanks.
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 621
#4

20 Oct 2017, 17:22

Consider looking at the results from optimize_result_gradient(). Near the maximum of your criterion function, the gradient should be close to a vector of zeros.

If the returned gradient is not, then look into your formulas for the first derivatives. Have you compared your results to the numerical derivatives by setting the evaluator type to d1debug?

By the way, if your parameter vector is on the same order as the sample size, are you sure your evaluator can be used to identify the parameters of interest?
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2383
#5

24 Oct 2017, 08:02

Aha, good points. I'm not getting a gradient close to 0s, and there are issues raised by the results from d1debug.
I had been brushing both of those issues under the rug because I have been getting good estimates on synthetic data generated based on the assumptions of the model.

One possible source of the difficulty with the gradient is that the parameters of the model must be between 0/1, and I've been enforcing that by trapping out of range values and resetting them to something close to 0 or 1. Some time ago I had tried solving with problem with a logit link that kept the parameters in 0/1, but had abandoned that approach because it did not seem to produce better estimates. I'll have to look at that again.

Concerning the issue of identifiability: As it happens, an observation in this model pertains to a pair of individuals (it's a model about agreement), so there are N(N-1)/2 observations from which to estimate N parameters.

I'll go back to the drawing board with this, but would like to follow up later.
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 621
#6

24 Oct 2017, 10:26

When your evaluator gets out of bound values in the parameter vector, it should just return a missing value.
This will trigger half stepping until all the parameter values are back in bounds.

Resetting the parameter value will at best have no effect, at worst will prevent the numerical derivative taker from doing a proper job.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2383
#7

24 Oct 2017, 11:09

I haven't tried using missing values. If I'm understanding you correctly, I should check at the beginning of the evaluator function for out of bound parameter values, and return a missing value for the LL and missing values for the gradient vector if *any* of the parameter values are out of bounds?

In this application, btw, almost any set of data will have some parameter estimates that should (in the limit) be close to 1. Is this going to cause a problem for the missing value approach?

Thanks for your continued help.
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 621
#8

24 Oct 2017, 15:31

You are correct, except you do not have to do anything to the gradient if you are returning missing values in the LL.

It is possible to get stuck with "(backed up)" messages with this method.

An alternative is to write your evaluator without range restrictions, and then just impose fixed value constraints for when the unconstrained parameters go out-of-bounds.
1 like
Comment
Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#9

25 Oct 2017, 07:17

Jeff Pitblado (StataCorp)

Hi Jeff
I'm trying to make sure I understand you correctly
Assuming that I have an evaluator with 1 parameter (for simplification) and I want this parameter to be between 1 and and -1, are you suggesting something along these lines?

For your last suggestion

Code:

void eval(M,todo,b,lnf,g,H) { b = b < -1 ? -1 : b b = b > 1 ? 1 : b ... lnf = ... }

and for the first suggestion

Code:

void eval(M,todo,b,lnf,g,H) { if ( b > 1 | b <-1 ) { lnf = . return } }

Where in this second evaluator one can easily get stucked with backed up messages (where the step is halfed repeatedly and gets very small)?
I thought that the first solution was what you were advising against in post #6.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2383
#10

25 Oct 2017, 10:07

Thanks for joining in Christophe-- I don't feel quite so much like someone pursuing an arcane point now!
And, I share your uncertainty here.

To be concrete, here's some pseudo code for my evaluator function to show how I implemented the "missing value" approach: (per below, it didn't work so well for me)

Code:

OutOfBounds = 0 for ( loop over all parameter values) { if any parameter is outside 0/1 { OutOfBounds = 1 } } // if (OutOfBounds == 1) { LLFunctionValue = . // skip the gradient vector calculation } else { Calculate LL Function Value normally Calculate gradient vector normally }

This works on some sets of data for me, but on other more difficult ones, it simply returns missing values for all the parameters. However, my original program, with its "funky" resetting of out of bounds values at the top of the evaluator function, produces reasonable values even on those more difficult data sets.

On that count, I'm thinking that I must be misunderstanding the implications of Jeff's comments about how resetting the parameter values during iteration "will at best have no effect." Alternatively in my situation, ignoring the out of bounds values during iteration, and resetting them after the fact, does not work, as it leads to completely wildly out of bounds estimates for *all* the parameters.

It would be nice to discuss all this in the context of a concrete but simple example of an evaluator function and some data. Is this feasible? Or, perhaps I'm leading Jeff and others into arcane details that here that would better be better pursued individually with technical support. For what it's worth, I am trying to develop a Stata implementation for which an unfriendly and clunky R routine exists. (The R routine uses a standard implementation of bfgs that allows "box constraints" on the parameters, which is what I'm wishing I had here.)
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 621
#11

25 Oct 2017, 10:47

Christophe and Mike understood my first suggestion regarding missing values.

My second suggestion was to write your evaluator as if there are no limits on the elements of the parameter vector at all. Then use optimize_init_constraints() whenever optimize() returns a parameter estimate that is out of bounds. Depending on the functional form of the criterion being optimized, this suggestion might not work out so well.

For example, some of Stata's estimation commands fit variance components directly instead of using a log transformation. The evaluators for these commands simply return a missing value when the optimizer provides a zero or negative value for the variance component. In this case, it is not possible to compute a log-likelihood when the parameter is out of bounds. This is also why commands like melogit can fail to converge, usually because one of the variance components is not identifiable/estimable for a given the dataset.

On the other hand, suppose one of your parameters is a probability of some kind, something you want to fit directly instead of using a logit transformation.

You could do this by returning a missing value if the optimize provides a value outside the range [0,1]. The challenge here is that the optimizer will have trouble when the fitted value wants to be very close to a limit.

You could just use the value in your calculation and see where the optimizer takes you. If it converges to a value outside the acceptable range, then just use optimize_init_constraints() to constrain the probability parameter and refit. The challenge here is that the computed criterion may not be valid or usable for parameter values outside the range of support.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2383
#12

25 Oct 2017, 12:58

Jeff, thanks for the advice, and yes, in my situation, the parameters are all estimates of probabilities. I had not seen optimize_init_constraints() before, but I find the documentation pretty opaque, and seems to be more about constraints among the parameters. I'd really like to try this. Could you show I would used this function to constrain a single parameter vector so all elements are in 0, ..., 1? Thanks.
Comment

Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014
Posts: 621

#13

26 Oct 2017, 11:18

In the following code block, I use probit with the auto data to
fit two models, the first with a full set of freely estimated parameters
and the second with the intercept parameter constrained to 5.

For commands like probit the constrained model fit is easily
specified once the constraints are defined using the constraint
command. After the fitting the constrained probit model I
display the vector of parameter estimates and e(Cns).
The matrix e(Cns) is the constraint matrix, in the form that
optimize_init_constraints() is expecting. Notice how the
colums are labeled, and since we used the constraint command to
constrain the intercept to 5, we see that constructing a constraint
matrix of this kind for optimize() is not too difficult. The
challenge is knowing which element of the parameter to constrain.

The Mata code following the probit fits contains a d2
evaluator for probit regression and shows how to specify constraints
to optimize().

We could add more constraints simply by adding rows to the Cns
matrix.

Code:

sysuse auto
gen cons = 1

probit for turn trunk
matrix bfree = e(b)
constraint 1 _b[_cons] = 5
probit for turn trunk, constr(1)
matrix bCns = e(b)
matrix list bCns
matrix list e(Cns)

mata:
void probit_d2(
        real scalar     todo,
        real rowvector  beta,
        real colvector  y,
        real matrix     X,
        real scalar     lnf,
        real rowvector  g,
        real matrix     H
)
{
        real colvector  pm
        real colvector  xb
        real colvector  lj
        real colvector  dllj
        real colvector  d2llj

        pm      = 2*(y :!= 0) :- 1
        xb      = X*beta'

        lj      = normal(pm:*xb)
        if (any(lj :== 0)) {
                lnf = .
                return
        }
        lnf = quadcolsum(ln(lj))
        if (todo == 0 | missing(lnf)) return

        dllj    = pm :* normalden(xb) :/ lj
        if (missing(dllj)) {
                lnf = .
                return
        }
        g       = quadcross(dllj, X)
        if (todo == 1) return

        d2llj   = dllj :* (dllj + xb)
        if (missing(d2llj)) {
                lnf = .
                return
        }
        H       = - quadcross(X, d2llj, X)
}

y = st_data(., "foreign")
X = st_data(., "turn trunk cons")

S = optimize_init()
optimize_init_evaluator(S, &probit_d2())
optimize_init_argument(S, 1, y)
optimize_init_argument(S, 2, X)

b0 = J(1,3,0)
optimize_init_params(S, b0)
b = optimize(S)
b
st_matrix("bfree")

Cns = J(1,4,0)
Cns[3] = 1
Cns[4] = 5
optimize_init_constraints(S, Cns)

b0 = J(1,3,0)
optimize_init_params(S, b0)
b = optimize(S)
b
st_matrix("bCns")

end

Code:

. sysuse auto
(1978 Automobile Data)

. gen cons = 1

.
. probit for turn trunk

Iteration 0:   log likelihood =  -45.03321
Iteration 1:   log likelihood = -27.295783
Iteration 2:   log likelihood = -26.131048
Iteration 3:   log likelihood = -26.108219
Iteration 4:   log likelihood = -26.108186
Iteration 5:   log likelihood = -26.108186

Probit regression                               Number of obs     =         74
                                                LR chi2(2)        =      37.85
                                                Prob > chi2       =     0.0000
Log likelihood = -26.108186                     Pseudo R2         =     0.4202

------------------------------------------------------------------------------
     foreign |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        turn |  -.3001913    .068877    -4.36   0.000    -.4351877   -.1651949
       trunk |  -.0112468   .0588094    -0.19   0.848     -.126511    .1040174
       _cons |   11.01885   2.422312     4.55   0.000     6.271207     15.7665
------------------------------------------------------------------------------

. matrix bfree = e(b)

. constraint 1 _b[_cons] = 5

. probit for turn trunk, constr(1)
Iteration 0:   log likelihood = -783.37992
Iteration 1:   log likelihood = -30.975442
Iteration 2:   log likelihood = -30.062999
Iteration 3:   log likelihood = -30.060852
Iteration 4:   log likelihood = -30.060852

Probit regression                               Number of obs     =         74
                                                Wald chi2(2)      =    1016.20
Log likelihood = -30.060852                     Prob > chi2       =     0.0000

 ( 1)  [foreign]_cons = 5
------------------------------------------------------------------------------
     foreign |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        turn |  -.1346611   .0173422    -7.76   0.000    -.1686512   -.1006711
       trunk |  -.0285197   .0499317    -0.57   0.568     -.126384    .0693446
       _cons |          5  (constrained)
------------------------------------------------------------------------------

. matrix bCns = e(b)

. matrix list bCns

bCns[1,3]
       foreign:    foreign:    foreign:
          turn       trunk       _cons
y1  -.13466114  -.02851972           5

. matrix list e(Cns)

e(Cns)[1,4]
     foreign:  foreign:  foreign:     _Cns:
        turn     trunk     _cons        _r
r1         0         0         1         5

.
. mata:
------------------------------------------------- mata (type end to exit) -----
:
: void probit_d2(
>         real scalar     todo,
>         real rowvector  beta,
>         real colvector  y,
>         real matrix     X,
>         real scalar     lnf,
>         real rowvector  g,
>         real matrix     H
> )
> {
>         real colvector  pm
>         real colvector  xb
>         real colvector  lj
>         real colvector  dllj
>         real colvector  d2llj
>
>         pm      = 2*(y :!= 0) :- 1
>         xb      = X*beta'
>
>         lj      = normal(pm:*xb)
>         if (any(lj :== 0)) {
>                 lnf = .
>                 return
>         }
>         lnf = quadcolsum(ln(lj))
>         if (todo == 0 | missing(lnf)) return
>
>         dllj    = pm :* normalden(xb) :/ lj
>         if (missing(dllj)) {
>                 lnf = .
>                 return
>         }
>         g       = quadcross(dllj, X)
>         if (todo == 1) return
>
>         d2llj   = dllj :* (dllj + xb)
>         if (missing(d2llj)) {
>                 lnf = .
>                 return
>         }
>         H       = - quadcross(X, d2llj, X)
> }

:
: y = st_data(., "foreign")

: X = st_data(., "turn trunk cons")
:
: S = optimize_init()

: optimize_init_evaluator(S, &probit_d2())

: optimize_init_argument(S, 1, y)

: optimize_init_argument(S, 2, X)

:
: b0 = J(1,3,0)

: optimize_init_params(S, b0)

: b = optimize(S)
Iteration 0:   f(p) = -51.292891
Iteration 1:   f(p) = -26.341102
Iteration 2:   f(p) = -26.108377
Iteration 3:   f(p) = -26.108186
Iteration 4:   f(p) = -26.108186

: b
                  1              2              3
    +----------------------------------------------+
  1 |  -.3001913003   -.0112467996    11.01885232  |
    +----------------------------------------------+

: st_matrix("bfree")
                  1              2              3
    +----------------------------------------------+
  1 |  -.3001912835   -.0112467994     11.0188517  |
    +----------------------------------------------+

:
: Cns = J(1,4,0)

: Cns[3] = 1

: Cns[4] = 5

: optimize_init_constraints(S, Cns)

:
: b0 = J(1,3,0)

: optimize_init_params(S, b0)
: b = optimize(S)
Iteration 0:   f(p) = -783.37992
Iteration 1:   f(p) =  -30.97544
Iteration 2:   f(p) = -30.063024
Iteration 3:   f(p) = -30.060852
Iteration 4:   f(p) = -30.060852

: b
                  1              2              3
    +----------------------------------------------+
  1 |   -.134661137   -.0285197207              5  |
    +----------------------------------------------+

: st_matrix("bCns")
                  1              2              3
    +----------------------------------------------+
  1 |  -.1346611382   -.0285197167              5  |
    +----------------------------------------------+

:
: end
-------------------------------------------------------------------------------

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2383
#14

27 Oct 2017, 15:27

Thanks, this gets me going in the right direction. So how does this feature work if you want constraints that specify a range, rather than a single value?. The analogy here to my situation would be if one wanted to constrain the constant to a range of, say 4 <= constant <= 6, rather than 5. An even closer analogy would be if we wanted in your example to constrain all the parameters to (say) (-20, 20).
I found a little more documentation a moptimize_init_constraints(), but not much.
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 621
#15

28 Oct 2017, 12:02

optimize_init_constraints() and moptimize_init_constraints() are for imposing linear constraints, this does not include the ability to restrict the range of parameters beyond a single point.

Range constraints on open sets, such as (4,6) or (-20, 20), are best implemented by using a suitable transformation.

Range constraints on non-open sets, such as [4,6] or [4, 6), are a difficult problem needing special methods that are not currently implemented in any of Stata's official estimation commands or estimation tools.

My previous suggestion is to try fitting without any constraints. If your converged results exceed a limit, refit with fixed value constraints (using the appropriate limits) on the fitted parameters that exceeded a limit. If you experience troubles with convergence, get the optimizer to report the parameter values at each iteration so see which parameters might need to be fixed at a limit.

Wish I could be more helpful here, but I've reached the limit of what I know to offer here.
Comment

Announcement