Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need Help making stata Dataset from Excel

    Hi Everyone,

    Thank you very much for your help from all the statalist members. I am having diffculty making a stata dataset from excel file which was poorly prepared by data collectors. Significant effort has gone into the data colection and cannot be repeated due to financial and time constraints. I was hoping to see if any of the stata gurus can help me.

    I am uploading the problem_dataset(deidentified and truncated) and the desired dataset formats for clarification pupose. The problem_dataset contains variables DXCCS1 DXCCS2 ..DXCCS9 and their counts as COUNT1...COUNT9

    The goal is to create one column DXCCS, which would include all the values of each variable DXCCS1...DXCCS9. Some values of one variable may not be present in other .

    EG-1:DXCCS9 doesnt have value 31 but present in DXCCS1-8
    Eg-2:DXCCS1 and DXCCS2 has value 56 but not in DXCCS3-9.

    My code so far that has been working for me is as follows, however I cannot run it in a loop as the cell ranges differ from sheet to sheet and I get a error. There are nearly 200 sheets and it would take really long time to run it manually for each excel workbook.

    Can any of the members please help me. I thank you very much for your time and effort. Thanks again.

    Sincerely
    Anwar

    Code:
    import excel "C:\Users\Anwar\Dropbox\Temporary\problemdataset.xlsx", sheet("1") cellrange(A1:B247) firstrow allstring  clear
    drop if DXCCS2==""
    rename DXCCS2 DXCCS3 
    tempfile temp
    save "`temp'"
    import excel "C:\Users\Anwar\Dropbox\Temporary\problemdataset.xlsx", sheet("1") cellrange(C1:D247) firstrow allstring  clear
    drop if DXCCS3==""
    merge 1:1 DXCCS3 using "`temp'"
    drop _merge 
    rename DXCCS3  DXCCS4
    tempfile temp
    save "`temp'"
    import excel "C:\Users\Anwar\Dropbox\Temporary\problemdataset.xlsx", sheet("1") cellrange(E1:F247) firstrow allstring  clear
    drop if DXCCS4==""
    merge 1:1 DXCCS4 using "`temp'"
    drop _merge 
    rename DXCCS4  DXCCS5
    tempfile temp
    save "`temp'"
    import excel "C:\Users\Anwar\Dropbox\Temporary\problemdataset.xlsx", sheet("1") cellrange(G1:H247) firstrow allstring  clear
    drop if DXCCS5==""
    merge 1:1 DXCCS5 using "`temp'"
    drop _merge 
    rename DXCCS5  DXCCS6
    tempfile temp
    save "`temp'"
    import excel "C:\Users\Anwar\Dropbox\Temporary\problemdataset.xlsx", sheet("1") cellrange(I1:J247) firstrow allstring  clear
    drop if DXCCS6==""
    merge 1:1 DXCCS6 using "`temp'"
    drop _merge 
    rename DXCCS6  DXCCS7
    tempfile temp
    save "`temp'"
    import excel "C:\Users\Anwar\Dropbox\Temporary\problemdataset.xlsx", sheet("1") cellrange(K1:L247) firstrow allstring  clear
    drop if DXCCS7==""
    merge 1:1 DXCCS7 using "`temp'"
    drop _merge 
    rename DXCCS7  DXCCS8
    tempfile temp
    save "`temp'"
    import excel "C:\Users\Anwar\Dropbox\Temporary\problemdataset.xlsx", sheet("1") cellrange(M1:N247) firstrow allstring  clear
    drop if DXCCS8==""
    merge 1:1 DXCCS8 using "`temp'"
    drop _merge 
    rename DXCCS8  DXCCS9
    tempfile temp
    save "`temp'"
    import excel "C:\Users\Anwar\Dropbox\Temporary\problemdataset.xlsx", sheet("1") cellrange(O1:P247) firstrow allstring  clear
    drop if DXCCS9==""
    merge 1:1 DXCCS9 using "`temp'"
    drop _merge 
    rename DXCCS9 DXCCS2
    Attached Files

  • #2
    The goal is to create one column DXCCS, which would include all the values of each variable DXCCS1...DXCCS9
    What does "include all the values" mean here? Is it the sum, some count, etc.? Do you want one file like the desired file for each sheet in the problem file (or like I mentioned: a sum or something else of all the count variables in all sheets? The problem file has non-integer values and the desired file has only integers.

    I'm not sure I understand exactly what you want.
    You should:

    1. Read the FAQ carefully.

    2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

    3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

    4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

    Comment


    • #3
      Anwar,

      I have the same question as Roberto regarding the fact that your "counts" are not integers. Ignoring that for the moment, however, you can try something like this:

      Code:
      import excel "problem_dataset.xlsx", sheet("1") firstrow clear
      
      gen id=_n
      reshape long DXCCS COUNT, i(id) j(j)
      
      drop if DXCCS==.
      
      tsset j DXCCS
      tsfill , full
      
      drop id
      reshape wide COUNT, i(DXCCS) j(j)
      Basically, what we are doing is making this into time-series data so that we can trick Stata into filling out a complete set of DXCCS codes for each DXCCS-COUNT pair.

      Regards,
      Joe

      Comment


      • #4
        Hello Everyone,

        Sorry Robert, I did not pose the question clearly. Let me explain what "To include all values" meant.

        Eg:

        The variable DXCCS1 has the following values 1,2, 3,5,7,9, 10,11
        The corresponding COUNT1 has values 0,16.9,4.6, 0,0,0,2,3

        The variable DXCCS2 has the following values 2,3,4,6,8,11
        The corresponding COUNT2 has values 0,10,20,5,10,40

        I wanted one column DXCCS with values 1,2,3,4,5,6,7,8,9,10. Note that 4,6 and 8 were present only in DXCC2 and not in DXCCS1.
        Therefore the value for the variable COUNT2 for obeservation 4(which was not present in )

        I highlighted the cells to make my point clear and have attached the excel sheets.

        I think the data collection team has used some estimation commands to come up with some numbers..but I am not sure. I apologize Robert for not being clear enough.


        Joe Canner:

        Your solution has worked wonderfully. I am not aware of the ttset command and am learning it now. Thank you Joe Canner.


        Thank you again Robert and Joe.

        Sincerely ,
        Anwar

        Attached Files

        Comment


        • #5
          Anwar,

          You're quite welcome. I don't do much with time series data or panel data, but I recently saw Nick Cox demonstrate how it could be used in other contexts to solve certain problems. I was glad for the opportunity to test it out in this situation, as I have an ongoing project similar to yours that could benefit from such an approach.

          Regards,
          Joe

          Comment


          • #6
            Hi Joe,
            when you have time, can you please reply with the posting details where Nick Cox demonstrated using time-series in other contexts...I would like to read it.
            Thank you again,
            Anwar

            Comment


            • #7
              I think I was thinking of this one: http://www.stata.com/statalist/archi.../msg00866.html

              Comment


              • #8
                Thank you Joe canner,
                Anwar

                Comment

                Working...
                X