Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Quick questions: list in combine with IF statement and tabmiss

    Dear all,

    I have a very big panel dataset in a long format. I used the list command quiet often to- among other things- sample a portion of my data so I can better understand it and/or because I need to print it for future references. The problems comes when I include the if statement in my list command. For example if I want to list 10 observation of ID=100- because ID=100 has thousand of rows, the list command may not return any observation.
    Code:
    list ID in 1/10 if ID==100
    . This may be the case because the observation = ID 100 may not be in the first 10 observations of my dataset. Is there a way to use the "in 1/10" option but only for the observation I want, for example if ID==100, return the fist 10 observations?

    Also a common data cleaning procedure I do is to check for missing values. However, missing values can be system missing for numeric variables (.) or empty spaces for string "" , " " etc.I use the tabmiss command to check for missing values but this command is not sensitive to string missing. Is there a way to check for all king of missing values at once- numeric string (blank spaces)?

  • #2
    1. To list the first 10 observations for ID 100 you need to sort the data with ID 100 to the top.
    Code:
    gen byte id100 = (ID == 100)
    gen long obs_no = _n // REGISTER THE CURRENT SORT ORDER
    gsort -id100 obs_no
    list in 1/10 if ID == 100
    Note: The -if ID == 100- clause is there just in case ID 100 happens to have fewer than 10 observations.

    2. The -missing()- function handles both numeric and string variables as arguments.
    Code:
    whatever if missing(my_variable)
    will work correctly whether my_variable is string or numeric.

    Comment


    • #3
      See also the concurrent thread http://www.statalist.org/forums/foru...missing-values which specifies some basic commands that you are missing in pursuit of the missing.

      Comment


      • #4
        Thanks Clyde! Thanks for the solution for my list question but it seems a little too complicated. I thought there was an easier way to do it. But thanks anyway!

        Comment


        • #5
          Marvin: You are missing a deeper point when you make comments like that about perfectly good code by Clyde.

          It would be possible to write hundreds of highly specific commands, each to do something highly specific in data management; and again in graphics; to say nothing of statistics. Then Stata gets presented as thousands of commands, like an enormously large toolkit, and you would be, no doubt, among those to complain about Stata being, well, complicated. In fact we already have thousands of such commands but fortunately most people never need more than a few.

          The point is that with a lot of experience and a little effort Clyde could quickly break down your problem into components: identify what is of interest; sort those observations to the top; and then list them.

          You're not expected to be as good as Clyde at doing it. That's fine. The point is to see the strategy and learn a little of the technique.

          Another answer is to

          Code:
           
          edit if ID == 100
          and then you'll see the first so many, regardless of where they are in the dataset. It won't be 10, usually, but that's another technique.

          Comment


          • #6
            I got your point Nick and thanks for sharing. This is not a crucial issue for me, I just was wondering if I was doing something wrong with the code. Anyway, I will give it a try at the code. Thanks anyway!

            Best,
            Marvin

            Comment


            • #7
              Note that listsome (from SSC) can list the first observations that meet a certain condition. It can also list a random sample of observations that meet the condition. I use it all the time to put in the log file examples of the data. It is particularly useful when using regex functions on large datasets to show a small sample of the results. Since you can't eyeball all the changes, the next best thing is to have a good sampling of the changes.

              The following lists the first 10 observation of company 4 and then 10 random observations of the same company:

              Code:
              . webuse grunfeld, clear
              
              . listsome if company == 4, max(10)
              
                   +--------------------------------------------------+
                   | company   year   invest   mvalue   kstock   time |
                   |--------------------------------------------------|
               61. |       4   1935    40.29    417.5     10.5      1 |
               62. |       4   1936    72.76    837.8     10.2      2 |
               63. |       4   1937    66.26    883.9     34.7      3 |
               64. |       4   1938     51.6    437.9     51.8      4 |
               65. |       4   1939    52.41    679.7     64.3      5 |
                   |--------------------------------------------------|
               66. |       4   1940    69.41    727.8     67.1      6 |
               67. |       4   1941    68.35    643.6     75.2      7 |
               68. |       4   1942     46.8    410.9     71.4      8 |
               69. |       4   1943     47.4    588.4     67.1      9 |
               70. |       4   1944    59.57    698.4     60.5     10 |
                   +--------------------------------------------------+
              
              . listsome if company == 4, max(10) random
              
                   +--------------------------------------------------+
                   | company   year   invest   mvalue   kstock   time |
                   |--------------------------------------------------|
               62. |       4   1936    72.76    837.8     10.2      2 |
               63. |       4   1937    66.26    883.9     34.7      3 |
               65. |       4   1939    52.41    679.7     64.3      5 |
               66. |       4   1940    69.41    727.8     67.1      6 |
               68. |       4   1942     46.8    410.9     71.4      8 |
                   |--------------------------------------------------|
               69. |       4   1943     47.4    588.4     67.1      9 |
               72. |       4   1946    74.12    893.8     84.8     12 |
               73. |       4   1947    62.68      579     96.8     13 |
               76. |       4   1950   100.66    693.5    163.2     16 |
               78. |       4   1952      145      727    290.6     18 |
                   +--------------------------------------------------+
              
              .

              Comment


              • #8
                I'd forgotten about Robert's listsome (SSC) and indeed about my own earlier effort http://www.stata.com/statalist/archi.../msg00448.html

                Comment


                • #9
                  thank you so much Robert! That was what I exactly needed! Thank you!!!

                  Comment

                  Working...
                  X