Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Exporting data quality information to Excel

    Hello,

    I am new to Stata, so apologies for the convoluted question - I am not yet familiar with all of the language around the programme.

    I am working with a dataset with 5 variables and around 200 million observations. I unfortunately cannot share an example of the data as it is confidential, but here is how it looks when I browse:
    countries date 1 date 2 non-profit type
    USA 20766 . true Unincorporated Nonprofit Association (D)
    France 20467 . false Corporation - Professional - Domestic
    Italy 20660 . false Limited Partnership (D)
    USA 20453 . true Unincorporated Nonprofit Association (D)
    USA 14770 20291 false For-profit Corporation
    China 11983 21886 false Professional Corporation
    Russia . . false Limited Liability Company (D)
    China 12235 17217 false
    Corporation - Business - Foreign
    China . . true Unincorporated Nonprofit Association (D)
    China . . false Corporation - Professional - Domestic
    UK 14770 22410 false Limited Partnership (D)
    Germany 4336 9831 false Corporation - Professional - Domestic
    France 8068 10251 true Corporation - Nonprofit - Domestic
    France -8299 9817 true Corporation - Nonprofit - Domestic
    Each country is encoded as a number from 1-140 (there are 140 countries). I am able to generate variables counting the missing observations and total observations for each variable grouped by country using the following two commands:

    egen missingvar = total(missing(var)) , by(country)

    egen totalvar = total(var) , by(country)

    I would like to create an Excel spreadsheet like the one below (going all the way to 140), to record the data quality of each variable per jurisdiction, disregarding the non-profit variable:
    country code date 1 date 2 type
    total missing total missing total missing
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    I feel I am missing a step (or 10) between the missing and total observation variables I generate using the above commands, and the resulting exported data in Excel.

    Also, if I simply export the relevant variables to Excel I receive an error message stating this is not possible because I have too many observations, but this perhaps relates to the missing steps.

    Thank you in advance for your help.

    With my best wishes,
    Clara

  • #2
    First, -egen totalvar = total(var) , by(country)- does not calculate the total number of observations for a variable. It calculates the sum of all the non-missing values of the variable, which is going to be a radically different thing for most variables.

    Also, it is unclear why you want to create a variable with the total number of observations for each variable for each country: it's going to be the same thing for all variables--it will be the total number of observations for the country. Do you mean you want variable with the total number of non-missing observations for each variable for each country? If that's what you want:

    Code:
    by country_code, sort: gen total_obs = _N
    foreach v of varlist date1 date2 type {
        by country_code: egen non_missing_`v' = count(`v')
        gen missing_`v' = total_obs - non_missing_`v'
    }
    
    keep country_code total_obs *missing_*
    by country_code: keep if _n == 1
    will result in a data set in memory containing a sensible table similar to the one you say you want. It will not give you a problem with -export excel- because it will contain only one observation per country code.

    Note: Because you did not provide example data using the -dataex- command, this code was developed based on some assumptions about characteristics of the data not shown in the tableau shown in #1. It is not tested, so it may contain typos or other errors, but should at least point you in the right direction.

    I unfortunately cannot share an example of the data as it is confidential
    Yes you can share an example of the data with -dataex-. In fact you could have just used the -dataex- command to share exactly the same variables and observations that you showed in your hand-created tableau. Doing so would have enabled the data to be easily imported to Stata by anyone who wanted to help you, and would have included important metadata that cannot be shown in a tableau. By not doing that, you force would-be respondents to choose between passing over your question, or making a guess about the metadata. If we guess wrong, our code will not work and we will have wasted our time. You also will get a proposed solution that you will waste your time trying to implement, and then you will need to waste more time posting back to request further help. There is no reason not to use -dataex- here.

    Comment


    • #3
      I don't know why you'd wanna make an excel sheet about the percentage of data missing.

      Surely, you must mean exporting summary statistics (sums, means, likely) to an excel file? And in your case, if what I just said is really want you want, there's a way to do this which saves time.

      Comment


      • #4
        Dear Clyde,

        Thank you for your reply, and for pointing out the error in my command. I am clearly out of my depth, so I really appreciate your help.

        The aim of the Excel spreadsheet I present in #1 (to both yours and Jared Greathouse's point) is to show the quality of the data. So, for each jurisdiction or country, how many of the total observations are missing in each variable. For example: for the variable "company_type", I would like to count by jurisdiction the total number of of missing observations.

        You are absolutely right that the total number of observations per country would be the same for each variable. Would it then be possible to have the total number of observations of each jurisdiction as a first column, and then the number of missing observations per jurisdiction for each variable?

        Thank you for taking the time to help me with this.

        Best wishes,
        Clara

        Please find an example of the data generated by dataex, as you suggested. The jurisdiction variable corresponds to the country variable I mention in #1.

        clear
        input int jurisdiction str3 string_jur float(incorp_date dissol_date) byte non_profit int company_type
        110 "110" 20467 . 1 1320
        110 "110" 20660 . 1 1320
        110 "110" 20453 . 1 1320
        80 "80" 20766 . 1 3240
        14 "14" . . 1 .
        117 "117" 14770 . 1 792
        129 "129" 11983 12449 1 2711
        100 "100" 12235 . 1 1705
        117 "117" -19789 . 1 1170
        129 "129" 17532 17533 1 1047
        100 "100" 8039 11925 1 1757
        114 "114" 3650 . 1 548
        129 "129" 19628 . 1 .
        100 "100" 15001 15046 1 1335
        114 "114" 13313 16485 1 538
        129 "129" 18263 . 1 .
        100 "100" 18465 20291 1 2024
        117 "117" -8967 . 1 792
        100 "100" 12283 . 2 3301
        114 "114" 10778 . 2 545
        100 "100" 15021 21886 1 1335
        100 "100" 12971 15741 1 1741
        100 "100" 8040 17266 1 1757
        117 "117" 4064 . 2 756
        100 "100" 12235 13879 1 1705
        114 "114" 12386 15386 1 538
        4 "4" 12933 . 1 173
        75 "75" 18312 . 1 .
        105 "105" 16541 17217 1 726
        113 "113" . . 1 291
        133 "133" 21096 . 1 3364
        100 "100" -34793 -17197 1 1324
        117 "117" 4381 . 1 2359
        100 "100" 8041 12079 1 1198
        117 "117" 4606 . 1 2359
        100 "100" 12971 . 1 1741
        117 "117" 5374 . 1 2359
        100 "100" -20591 -17197 1 1128
        117 "117" 4514 . 1 2359
        100 "100" 12235 16351 1 1705
        117 "117" 3867 . 1 2359
        100 "100" 15067 . 1 1335
        114 "114" 12486 15365 1 538
        100 "100" 19641 21517 1 2024
        117 "117" 4394 . 1 2359
        100 "100" 20228 . 1 2024
        117 "117" 4995 . 1 2359
        100 "100" 12319 . 2 3301
        100 "100" 12974 13465 1 1741
        114 "114" 4029 . 1 548
        105 "105" 14369 . 1 1149
        125 "125" 7357 9831 1 792
        100 "100" 8053 10251 1 1757
        114 "114" 10778 11535 1 537
        117 "117" 4336 . 1 2359
        100 "100" 12235 14270 1 1189
        114 "114" 12487 14754 1 538
        117 "117" 3876 . 1 2359
        129 "129" 19628 . 1 .
        140 "140" . . 1 408
        100 "100" 15070 15214 1 1335
        114 "114" -1191 13367 1 537
        100 "100" -19733 -17197 1 1324
        100 "100" -24796 -6534 1 1324
        114 "114" 14879 17309 1 1718
        117 "117" 3591 . 1 2359
        129 "129" 17532 . 1 1062
        100 "100" 8056 17266 1 1757
        114 "114" 12479 22410 1 538
        117 "117" 4535 . 1 2359
        100 "100" 12983 13087 1 1741
        117 "117" 4918 . 1 2359
        100 "100" 15119 21886 1 1335
        114 "114" 8948 . 1 505
        100 "100" 12235 17301 1 1705
        117 "117" 4066 . 1 2359
        100 "100" 12331 . 2 3301
        117 "117" 4604 . 1 2359
        105 "105" 10593 . 1 792
        125 "125" 3923 15620 1 792
        128 "128" 3804 10226 1 1118
        100 "100" 20228 . 1 2024
        117 "117" 4746 . 1 2359
        100 "100" 12358 22231 2 3301
        114 "114" 12290 12901 1 538
        117 "117" 4961 . 1 2359
        100 "100" 12236 . 1 1705
        117 "117" 4108 . 1 2359
        100 "100" 12988 21859 1 1741
        114 "114" 6880 15193 1 537
        100 "100" 8055 17266 1 1757
        114 "114" 1357 15340 1 537
        117 "117" 5402 . 1 2359
        100 "100" 15124 15795 1 1335
        114 "114" 10778 16083 1 548
        117 "117" 4374 . 1 2359
        114 "114" 13124 . 1 1719
        117 "117" 5143 . 1 2359
        100 "100" -20853 -16467 1 1324
        114 "114" 12333 12453 1 1718
        end
        label values jurisdiction jurisdiction2
        label def jurisdiction2 4 "au", modify
        label def jurisdiction2 14 "br", modify
        label def jurisdiction2 75 "rw", modify
        label def jurisdiction2 80 "th", modify
        label def jurisdiction2 100 "us_id", modify
        label def jurisdiction2 105 "us_ma", modify
        label def jurisdiction2 110 "us_mo", modify
        label def jurisdiction2 113 "us_nc", modify
        label def jurisdiction2 114 "us_nd", modify
        label def jurisdiction2 117 "us_nj", modify
        label def jurisdiction2 125 "us_ri", modify
        label def jurisdiction2 128 "us_tn", modify
        label def jurisdiction2 129 "us_tx", modify
        label def jurisdiction2 133 "us_wa", modify
        label def jurisdiction2 140 "za", modify
        label values non_profit non_profit2
        label def non_profit2 1 "false", modify
        label def non_profit2 2 "true", modify
        label values company_type company_type
        label def company_type 173 "Australian Proprietary Company, Limited By Shares", modify
        label def company_type 291 "Business Corporation - Foreign", modify
        label def company_type 408 "Close Company", modify
        label def company_type 505 "Cooperative Association - Domestic", modify
        label def company_type 537 "Corporation - Business - Domestic", modify
        label def company_type 538 "Corporation - Business - Foreign", modify
        label def company_type 545 "Corporation - Nonprofit - Domestic", modify
        label def company_type 548 "Corporation - Professional - Domestic", modify
        label def company_type 726 "Domestic Limited Liability Company (LLC)", modify
        label def company_type 756 "Domestic Non-Profit Corporation", modify
        label def company_type 792 "Domestic Profit Corporation", modify
        label def company_type 1047 "FOREIGN PROFIT CORPORATION", modify
        label def company_type 1062 "FRGN LIMITED PRTNSHP", modify
        label def company_type 1118 "For-profit Corporation", modify
        label def company_type 1128 "Foreign Business Corporation", modify
        label def company_type 1149 "Foreign Corporation", modify
        label def company_type 1170 "Foreign For-Profit Corporation", modify
        label def company_type 1189 "Foreign Limited Liability Company", modify
        label def company_type 1198 "Foreign Limited Partnership", modify
        label def company_type 1320 "Gen. Business - For Profit", modify
        label def company_type 1324 "General Business Corporation (D)", modify
        label def company_type 1335 "General Partnership (D)", modify
        label def company_type 1705 "Limited Liability Company (D)", modify
        label def company_type 1718 "Limited Liability Company - Business - Domestic", modify
        label def company_type 1719 "Limited Liability Company - Business - Foreign", modify
        label def company_type 1741 "Limited Liability Partnership (D)", modify
        label def company_type 1757 "Limited Partnership (D)", modify
        label def company_type 2024 "Nonregistered Entity Designation of Registered Agent", modify
        label def company_type 2359 "Professional Corporation", modify
        label def company_type 2711 "ST", modify
        label def company_type 3240 "Trade Association (สมาคมการค้า)", modify
        label def company_type 3301 "Unincorporated Nonprofit Association (D)", modify
        label def company_type 3364 "WA LIMITED LIABILITY COMPANY", modify

        Comment


        • #5
          Code:
          by jurisdiction, sort: gen total_obs = _N
          foreach v of varlist incorp_date dissol_date non_profit company_type {
              by jurisdiction: egen non_missing_`v' = count(`v')
              gen missing_`v' = total_obs - non_missing_`v'
          }
          
          keep jurisdiction total_obs missing_*
          by jurisdiction: keep if _n == 1

          Comment


          • #6
            Dear Clyde,

            Thank you so much, this works well!

            Best wishes,
            Clara

            Comment

            Working...
            X