Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looping through records to produce a summary record

    Hi,

    I am using Stata 13.1 for Windows and have a dataset with the following fields: ID, nlab, ntesttype, testdate where the fields nlab and ntesttype are numeric variables (recoded from string variables) and contain the values 1 to 9 and 1 to 4 respectively. A person can have multiple records for repeated tests at the same or multiple labs occurring on different dates.

    How can I loop through all records for each person (using their ID) and produce a file of a single summary record (for each person) which contains the following information: Id, lab, testtype where both lab and testtype contain a summary of all combinations occurring e.g. if a person had a test done at lab 1 only, then compute lab=1; if tests done at labs 1 and 2 then lab=12; if tests done at labs 1 to lab 3 then lab=123. Similarly for test type. If test type 1 only done, compute testtype=1; if test types 1 & 2 done, then testtype=12; if test types 1 to 3, then testtype=123 etc. There could be duplicates by lab or test type which can be included e.g. if a person had the same test 1 done on different dates at lab 1, then compute lab=11 & testtype=11; if 2 different tests 2 & 3 done at lab 2 on different dates then compute lab=22 & testtype=23.

    What would be the best way of doing this to collapse the file to produce one single summary record per person? I have looked at a few different commands in Stata, however, there appears to be no command capable of directly doing this. Any help or suggestions would be much appreciated.

    Thanks
    Kevin

  • #2
    Welcome to Statalist!

    I suspect the reason there's no command capable of directly doing what you seek is that the format you describe would, on the face of it, be very difficult to use effectively in Stata for many of the sorts of analyses you might hope to do on data of this sort. There are a fair number of Statalist readers who enjoy puzzling out answers regardless of any practical value, so you might well get an answer to your question. Should you not get an answer to this question, you might be able to get advice on how to reach your analytical goals through a different path more natural to Stata, were you to share them, and a sample of your data (or made up data similar to your real data), with Statalist readers.

    To increase the likelihood that Statalist readers will be able to assist you, please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question. The more you help others understand your problem, the more likely others are to be able to help you solve your problem. It would be particularly helpful to post a small hand-made example, perhaps with just a few variables and observations, showing the data before the process and how you expect it to look after the process. In particular, please read FAQ #12 and use dataex and CODE delimiters when posting to Statalist.


    Comment


    • #3
      Thanks for the feedback William. I apologise for not posting in accordance with the posting rules. I should have read them more closely.

      Included below is a sample of my data using -dataex- and the desired outcome. I know how to readily achieve the outcome using other software such as Visual Foxpro (VFP) which I regularly use (which is a database management tool). However, the agency making this request has requested that the solution be written in Stata. Unfortunately my knowledge of Stata is limited where I do not understand how to tell Stata to process all records for a person (using their Id) within a for loop.

      In the desired output file, the variables lab and testtype are simply concatenations of all the different combinations of values in each field respectively. However, these values should be sorted in ascending order and combinations not to be repeated e.g. if a person had tests done at lab 5 and lab 2 then lab would have a value of 25 and not 52 in the summary record for this person. The string function can be used to concatenate the values of lab and testtype respectively for all records for a person.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input double id str20 testdate float(nlab ntesttype)
      2011000320 "08 Jan 2011"          2 2
      2011000321 "08 Jan 2011"          3 2
      2011000321 "12 Jan 2011"          2 2
      2011000363 "09 Jan 2011"          3 2
      2011000365 "10 Jan 2011"          2 2
      2011000366 "10 Jan 2011"          2 2
      2011000368 "10 Jan 2011"          7 2
      2011000368 "11 Jan 2011"          7 2
      2011000369 "10 Jan 2011"          7 2
      2011000369 "11 Jan 2011"          7 2
      2011000370 "10 Jan 2011"          7 2
      2011000370 "11 Jan 2011"          5 3
      2011000371 "10 Jan 2011"          7 2
      2011000371 "12 Jan 2011"          7 4
      2011000372 "10 Jan 2011"          7 2
      2011000372 "11 Jan 2011"          7 2
      2011000495 "12 Jan 2011"          4 2
      2011000510 "12 Jan 2011"          7 2
      2011000525 "12 Jan 2011"          4 2
      2011000538 "12 Jan 2011"          2 2
      2011000604 "13 Jan 2011"          2 2
      2011000605 "13 Jan 2011"          2 2
      2011000615 "13 Jan 2011"          1 2
      2011000615 "14 Jan 2011"          3 4
      2011000616 "13 Jan 2011"          3 2
      2011000616 "14 Jan 2011"          5 2
      2011000617 "13 Jan 2011"          7 2
      2011000653 "14 Jan 2011"          7 2
      2011000657 "14 Jan 2011"          4 2
      2011000657 "15 Jan 2011"          5 2
      2011000731 "15 Jan 2011"          7 2
      2011000731 "16 Jan 2011"          7 2
      2011000732 "15 Jan 2011"          2 2
      2011000733 "15 Jan 2011"          2 2
      2011000734 "15 Jan 2011"          2 2
      2011000735 "15 Jan 2011"          1 2
      2011000736 "15 Jan 2011"          1 2
      2011000737 "15 Jan 2011"          3 2
      2011000737 "23 Jan 2011"          3 2
      2011000789 "17 Jan 2011"          7 2
      2011000790 "11 May 2011"          1 2
      2011000790 "12 May 2011"          2 2
      2011000790 "17 Jan 2011"          4 3
      2011001052 "21 Jan 2011"          1 2
      2012001108 "19 May 2012"          2 4
      2012001108 "22 Jan 2012"          6 1
      2012001111 "22 Jan 2012"          7 2
      2012001111 "25 Jan 2012"          7 2
      2011001112 "22 Jan 2012"          3 4
      2011001112 "23 Jan 2012"          3 4
      end
      The desired output for the above sample data.
      Code:
      clear
      * id        lab testtype
      2011000320  2   2
      2011000321  23  22
      2011000363  3   2
      2011000365  2   2
      2011000366  2   2
      2011000368  77  22
      2011000369  77  22
      2011000370  57  23
      2011000371  77  24
      2011000372  77  22
      2011000495  4   2
      2011000510  7   2
      2011000525  4   2
      2011000538  2   2
      2011000604  2   2
      2011000605  2   2
      2011000615  13  24
      2011000616  35  22
      2011000617  7   2
      2011000653  7   2
      2011000657  45  22
      2011000731  77  22
      2011000732  2   2
      2011000733  2   2
      2011000734  2   2
      2011000735  1   2
      2011000736  1   2
      2011000737  33  22
      2011000789  7   2
      2011000790  124 223
      2011001052  1   2
      2012001108  26  14
      2012001111  77  22
      2012001111  7   2
      2011001112  33  44
      end

      Comment


      • #4
        Thanks for the clear presentation of your data and desired output. My understanding and assumptions are that
        • for each individual, you want to sort your data by id, nlab, and testtype, and then collapse all the observations for each id into one, converting the nlab values into string and concatenating them, and similarly for the ntesttype values
        • testdate plays no role in the process and is dropped
        • no value of nlab or ntesttype is more than a single digit
        Starting with your data read into memory, the following seems to do what you want.
        Code:
        drop testdate
        assert inrange(nlab,0,9)
        assert inrange(ntesttype,0,9)
        format id %10.0f
        sort id nlab ntesttype
        by id: generate int seq = _n
        reshape wide nlab ntesttype, i(id) j(seq)
        egen lab = concat(nlab*)
        egen testtype = concat(ntesttype*)
        replace lab = subinstr(lab,".","",.)
        replace testtype = subinstr(testtype,".","",.)
        format lab testtype %-20s
        list id lab testtype, clean noobs
        Code:
                    id   lab   testtype  
            2011000320   2     2        
            2011000321   23    22        
            2011000363   3     2        
            2011000365   2     2        
            2011000366   2     2        
            2011000368   77    22        
            2011000369   77    22        
            2011000370   57    32        
            2011000371   77    24        
            2011000372   77    22        
            2011000495   4     2        
            2011000510   7     2        
            2011000525   4     2        
            2011000538   2     2        
            2011000604   2     2        
            2011000605   2     2        
            2011000615   13    24        
            2011000616   35    22        
            2011000617   7     2        
            2011000653   7     2        
            2011000657   45    22        
            2011000731   77    22        
            2011000732   2     2        
            2011000733   2     2        
            2011000734   2     2        
            2011000735   1     2        
            2011000736   1     2        
            2011000737   33    22        
            2011000789   7     2        
            2011000790   124   223      
            2011001052   1     2        
            2011001112   33    44        
            2012001108   26    41        
            2012001111   77    22
        Again, I cannot for a minute imagine using Stata for any sort of serious analysis of this transformed data, other than very specific reporting of these transformed values. The original data is ideally formatted (with the exception of converting the string testdate to a numeric Stata Internal Format date value) for pretty much any analysis to be done in Stata.

        It sounds to me like you are a relatively inexperienced user of Stata being required to use it for this task when there are other tools you are much more productive using. If so, you have my sympathy. And it's even worse if perhaps you are under pressure to produce some output quickly. Nevertheless, I'd like to encourage you to take a step back from your immediate tasks.

        When I began using Stata in a serious way, I started, as have others here, by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. There are a lot of examples to copy and paste into Stata's do-file editor to run yourself, and better yet, to experiment with changing the options to see how the results change.

        All of these manuals are included as PDFs in the Stata installation (since version 11) and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu. The objective in doing the reading was not so much to master Stata as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax, and know how to find out more about them in the help files and manual.

        The Stata documentation is really exemplary - there's just a lot of it. The path I followed surfaces the things you need to know to get started in a hurry and to work effectively.
        Last edited by William Lisowski; 20 Sep 2016, 20:52.

        Comment


        • #5
          Here is another approach that produces the same results. It has the advantage of demonstrating that an explicit loop is unnecessary because commands like generate and replace implicitly loop over the observations.
          Code:
          drop testdate
          assert inrange(nlab,0,9)
          assert inrange(ntesttype,0,9)
          format id %10.0f
          sort id nlab ntesttype
          by id: generate int nobs = _N
          egen length = max(nobs)
          generate str`length' lab = strofreal(nlab)
          generate str`length' testtype = strofreal(ntesttype)
          by id: replace lab = lab[_n-1]+lab if _n>1
          by id: replace testtype = testtype[_n-1]+testtype if _n>1
          by id: keep if _n==_N
          format lab testtype %-20s
          list id lab testtype, clean noobs
          Code:
                      id   lab   testtype  
              2011000320   2     2        
              2011000321   23    22        
              2011000363   3     2        
              2011000365   2     2        
              2011000366   2     2        
              2011000368   77    22        
              2011000369   77    22        
              2011000370   57    32        
              2011000371   77    24        
              2011000372   77    22        
              2011000495   4     2        
              2011000510   7     2        
              2011000525   4     2        
              2011000538   2     2        
              2011000604   2     2        
              2011000605   2     2        
              2011000615   13    24        
              2011000616   35    22        
              2011000617   7     2        
              2011000653   7     2        
              2011000657   45    22        
              2011000731   77    22        
              2011000732   2     2        
              2011000733   2     2        
              2011000734   2     2        
              2011000735   1     2        
              2011000736   1     2        
              2011000737   33    22        
              2011000789   7     2        
              2011000790   124   223      
              2011001052   1     2        
              2011001112   33    44        
              2012001108   26    41        
              2012001111   77    22

          Comment


          • #6
            Thanks for your advice, assistance and suggested solutions. Much appreciated. The requirements for this task/solution are still a work in progress for the agency that requested the Stata solution. Because of this, I came up with my own solution where at this point in time I have decided to keep all records in the file and not reduce the file to one record per person because the agency is unsure on how they are going to deal with duplicate disease notification records. The data is a reflection of communicable disease data where for any given disease, a person could have multiple lab tests conducted on different test dates at the same or different laboratories. At the end of the day the files will need to be reduced to one record per person for analysis purposes.

            Regarding my knowledge of Stata, I would consider myself to be a light to moderate user of Stata but am a little rusty in some areas at the present point in time. I learn to use software as I go along and over the years I have been able to apply and adapt my knowledge of VFP to statistical packages such as Stata and SPSS. However, statistical software processes records in a completely different manner to VFP which created confusion for me in writing a program in Stata to achieve the desired outcome where I originally thought I would have to use for loops to do so but as you have noted, and as I discovered in my own reading and research to find a solution to this problem, an explicit loop is not required.

            The syntax that I wrote and included below concatenates all values of lab and testtype respectively and adds this to each person’s record. At this point in time, I have not written syntax to collapse the file to one record per person for the reasons explained above i.e. due to the unresolved issue of duplicate notifications. However, your solutions will be very useful for reducing the file to one record per person.
            Code:
            gen recno = _n
            gen str10 lab = "."
            replace lab = string(nlab)
            replace lab = trim(lab[_n-1])+trim(lab) if id==id[_n-1] & nlab > 0
            gsort -id -lab
            replace lab = lab[_n-1] if id==id[_n-1]
            sort recno
            gen str10 testtype = "."
            replace testtype = string(ntesttype)
            replace testtype = trim(testtype[_n-1])+trim(testtype) if id==id[_n-1] & ntesttype > 0
            gsort -id -testtype
            replace testtype = testtype[_n-1] if id==id[_n-1]
            sort recno

            Comment


            • #7
              Regarding my knowledge of Stata, I would consider myself to be a light to moderate user of Stata but am a little rusty in some areas at the present point in time.
              You will find the reading I recommended quickly and efficiently polishes the rust off your Stata knowledge.

              Comment


              • #8
                Yes, I certainly agree with you and intend to so.

                Comment

                Working...
                X