Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • What is the best choice of Amazon EC2 instance for Stata/MP?

    What is the best choice of Amazon EC2 instance for Stata/MP?

    I am developing the Amazon EC2 instance (Windows Server 2022) where the Stata/MP is installed as a platform to accomplish some research projects. I am wondering which types of EC2 instance should I choose if we use the Stata/MP as a main data analytics software? Since there are several types of instances there, I am wondering which types of instances is best if we take the characteristics of Stata software is taking into account? For instance, I have three types of questions for the best choice of instance though it is not so clear for me.
    • Which instance family (computing optimized, memory optimized, etc.) is best for the use of the Stata/MP?

      One of the best choice I have in my mind for the best use of Stata was "Memory Optimized" instance since Stata has a characteristics of in-memory database, and it requires more memory when we deal with a huge datasets.
    • What is the best choice of memory size for the best use of the Stata/MP?

      Suppose if we are dealing with a huge datasets, how large memory of instance is required? Is it a good choice to use a 128GB instance for the optimal use of Stata/MP?
    • What is the best choice of CPU size (the numbers of CPU cores) when we use he Stata/MP?

      According to the product description of Stata/MP (https://www.stata.com/statamp/), it says that Stata/MP supports up to 64 cores/processors. Is that mean the types of instance we use for Stata/MP should have the CPU cores less than 64 cores?
    Please kindly give us any advices or comments.
    Last edited by Tatsuru Kikuchi; 20 Feb 2023, 05:49.

  • #2
    My quick take:
    1. It really depends on the size of your dataset. For most projects, I like to have memory that's about 10x dataset to allow for data overhead. Some may find this isn't enough, or is an overkill.
    2. Version of Stata/MP you can afford/need. Stata/MP supports up to 64 cores, but see how much it will cost you. I am using 8 cores, and it doesn't come cheap.
    3. Do you need a Windows Server instance? Are you considering multiple users? If so, you may need to purchase a network license which may cost more. And you may need factor this into the number of cores you need (for instance, number of cores that your version supports x number of users).

    Comment


    • #3
      Thanks for your quick comments.
      • In fact, I am managing several projects in the economic research, in which the numbers of projects are about 10 projects, ongoing. The reason why we are sharing the same platform is because all the projects are using the same datasource, whose total size of data is about 2 TB. It is mentioned in the official Stata website (https://www.stata.com/products/compa...ng-systems-mp/) that we need to have enough memory to hold 1.5 times the maximum data size we may use.
      • I know that the size of memory depends on the datasets we will use, which depends on each projects. As a system administrator point of view, it is hard to decide each instance types for each projects when we take the costs management into accounts.
      • Hence, at this time, I decided to develop one common instance, which is shared with most of users. In addition to that, depending on the custom needs, I am thinking to develop additional instance, whose requirements actually depends on each projects. In fact, some users may need to process datasets containing more than 2 billions observations. In such a case, it is necessary to prepare the instance with memory 128GB or so, according to the Stata official web site (https://www.stata.com/features/overview/huge-datasets/).

      Any comments and advices are truly welcome!

      Comment


      • #4
        I've worked with TB datasets in aggregate; thankfully all components were small enough to be loaded using 128 GB memory. Most of the memory were needed for generating the datasets, and the final analytic files were in the 2 GB range. If you are already familiar with your dataset, and can efficiently work in smaller chunks, I think 128 GB would be reasonable. If someone needs to load all of your data simultaneously (all 2 TB), then you will likely need an expensive EC2 instance. If there are multiple projects run by different people, each user should work in their own account. I would think having multiple Stata instances running simultaneously under a single account is a recipe for disaster.

        Regarding the number of cores, my experience is that MP is definitely a luxury, and that I could probably get away with running things on a single core version. If you have a truly large final analytic dataset (you should estimate how large a dataset should be based on number of observations and variables), MP is probably the way to go. Stata probably provides performance statistics on its website.

        Comment

        Working...
        X