Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unstable st_view connections

    My query does not focus on a specific Mata problem but an issue of program design. I am writing a set of programs in Stata which rely heavily on calling Mata routines for executing the main calculations, while I use Stata for data management. The datasets being processed may be quite large and I have found that transferring data variables between Stata and Mata using putmata or getmata commands can take considerable amounts of time. Under some circumstances there can be memory problems too. Hence, I have switched to using views of variables using st_view rather than copies of variables. This appears to save time and memory.

    But ... the st_view help information contains a warning that worries me (a lot). It says:

    " Cautions when using views 3 ... For faster data access, an st_view() connection accesses data using variable indices, not variable names. However, variable indices can change when variables are created or removed. If a variable is created or removed while your code is using a view connection, there is a chance the view will switch to another variable."

    If the "chance" were to happen, this would be disastrous because I and the program would have no way of knowing that it has happened and the results would be completely misleading. The whole point of organising my program as I have is that Stata is more suitable for managing and storing lots of data, whereas Mata is more efficient for carrying out optimisations and other calculations - my program relies heavily on the Mata linear programming class. My questions are:

    A. Has anyone actually had this happen? and under what circumstances? Is it very rare or a serious potential issue?

    B. For now, I have built in protection by redeclaring all operative views before calling any Mata routine, but this seems like overkill. I believe that st_view is efficient because it relies upon index manipulation, but nonetheless there are processing and memory costs. What is the happy medium? William Gould's Mata book provides very little guidance on views. For small problems this does not matter but I am writing code that may take 1 or 2 hours of CPU time to run in actual use, even with 4 or 8 cores.


  • #2
    I don't think this is a big mystery. Watch:

    Code:
    . sysuse auto
    (1978 automobile data)
    
    .
    . mata {
    >    
    >     st_view(mpg=.,.,"mpg")
    >
    >     mpg[1]
    >    
    >     st_dropvar("price")
    >    
    >     mpg[1]
    >    
    > }
      22
      3
    When loading the auto dataset, variable mpg is in third position. After dropping the variable price, which is in the second position, mpg moves to the second position and rep78 is now in the third position. The view still points to the third position.

    I don't think this behavior is a big problem in programs because you (can) control the data. You just need to take extra care when you change the sort order of variables.


    Oh, I forgot. Take extra caution when you code stuff like this:
    Code:
    void don_t_do_that_then()
    {
        real colvector x
        
        pragma unset x
        
        
        st_view(x,.,st_nvar()) // <- watch out! st_nvar() might not be what you expect
        
        x[1]
    }
    Here, I use program properties sortpreserve, which adds a variable to (end of) the dataset. Watch:

    Code:
    . program got_cha , sortpreserve // <- adds a variable
      1.    
    .     mata : don_t_do_that_then()
      2.    
    . end
    
    .
    . sysuse auto
    (1978 automobile data)
    
    .
    . mata : don_t_do_that_then()
      0
    
    .
    . got_cha
      1
    This might or might not bite depending on the level of control over the data that you and those who might call your programs in different environments have. Generally speaking, be careful when setting up views with variable indices.
    Last edited by daniel klein; 27 May 2024, 09:19.

    Comment


    • #3
      I accept that there is no great mystery but there is, from my perspective, a huge problem. A single (relatively) static program can be managed in the way you suggest - especially by not dropping variables in the main body of the program. Unfortunately, I am dealing with a changing set of programs that organise large amounts of external data in Stata and then call Mata routines to carry out a variety of calculations and procedures. I am particularly concerned that Stata procedures - especially user-written code - may inadvertently change the order of variables. The sortpreserve parameter helps for Stata programs.

      My inference from your comments is that redeclaration of views immediately prior to each call to a Mata routine is a tedious but necessary safeguard. Still, I appreciate any comment from Stata Corp staff as I have never seen that point made in any guidance on use of Mata. However, I am not using Mata in the conventional framework of providing a list of variables for use in a statistical procedure but as a way of using matrices to automate some messy data calculations that Stata can't handle properly.

      Comment


      • #4
        Originally posted by Gordon Hughes View Post
        My inference from your comments is that redeclaration of views immediately prior to each call to a Mata routine is a tedious but necessary safeguard.
        Yes. I guess this should also be (much) faster and consume less memory than any alternative.

        Comment

        Working...
        X