Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Determine number of obs

    Dear Profs and Colleagues,

    I am going to determine "the number of observations associated with firms that have both native and non-native workforces". There are several variables: firm ID:NPC_FIC, year, 2010-2019. if national=0 is non-native , if national=1 is native. I need to know among 441,475 firms, how many have both native and non-native workers. Surely would be few firms. What would be the command? Any ideas are appreciated.


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double(year NPC_FIC) float national
    2015 500003242 1
    2016 500003242 1
    2017 500003242 .
    2018 500003242 .
    2019 500003242 .
    2010 500003770 1
    2011 500003770 1
    2012 500003770 1
    2013 500003770 1
    2014 500003770 1
    2015 500003770 1
    2016 500003770 1
    2017 500003770 1
    2018 500003770 .
    2010 500004847 .
    2010 500008310 1
    2011 500008310 1
    2013 500008310 1
    2014 500008310 1
    2015 500008310 1
    2016 500008310 .
    2017 500008310 .
    2018 500008310 .
    2019 500008310 .
    2010 500009155 1
    2011 500009155 1
    2012 500009155 1
    2013 500009155 1
    2014 500009155 1
    2015 500009155 1
    2016 500009155 1
    2017 500009155 1
    2018 500009155 1
    2019 500009155 1
    2010 500013486 1
    2011 500013486 1
    2012 500013486 1
    2013 500013486 .
    2014 500013486 .
    2015 500013486 .
    2016 500013486 .
    2017 500013486 .
    2018 500013486 .
    2019 500013486 1
    2010 500017393 .
    2011 500017393 .
    2014 500017854 .
    2015 500017854 .
    2010 500019512 1
    2011 500019512 1
    2012 500019512 1
    2013 500019512 1
    2014 500019512 1
    2015 500019512 .
    2016 500019512 .
    2017 500019512 .
    2018 500019512 .
    2019 500019512 .
    2010 500020809 1
    2011 500020809 1
    2012 500020809 1
    2013 500020809 1
    2014 500020809 1
    2015 500020809 1
    2016 500020809 1
    2017 500020809 .
    2018 500020809 .
    2019 500020809 .
    2016 500023255 .
    2010 500026382 .
    2011 500026382 .
    2012 500026382 .
    2010 500027353 1
    2011 500027353 1
    2012 500027353 1
    2013 500027353 1
    2014 500027353 1
    2015 500027353 1
    2016 500027353 1
    2017 500027353 1
    2018 500027353 1
    2019 500027353 .
    2010 500031300 .
    2011 500031300 .
    2012 500031300 .
    2013 500031300 .
    2014 500031300 .
    2015 500031300 .
    2016 500031300 .
    2017 500031300 .
    2018 500031300 .
    2019 500031300 .
    2010 500049799 0
    2011 500049799 0
    2012 500049799 0
    2013 500049799 1
    2014 500049799 1
    2015 500049799 1
    2016 500049799 1
    2017 500049799 0
    end
    . tab national
    
       national |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |     70,372        4.52        4.52
              1 |  1,487,765       95.48      100.00
    ------------+-----------------------------------
          Total |  1,558,137      100.00
    . egen firm_count = tag(NPC_FIC)
    
    . tab firm_count, gen(num_firms)
    
    tag(NPC_FIC |
              ) |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |  1,867,665       80.88       80.88
              1 |    441,475       19.12      100.00
    ------------+-----------------------------------
          Total |  2,309,140      100.00
    Cheers,
    Paris

  • #2
    what is the meaning of a missing value for the "national" variable? i.e., is it a yes or a no or really missing - if missing how does it affect what you want? (the logic I am thinking of is to count the number of observations per NPC_FIC, then count the number of "1s" and compare those two numbers; having both would mean that the numbers of 1's is greater than 0 but less than the total number of observations for that NPC_FIC)

    Comment


    • #3
      Thank you Rich for getting back to me. Missing is just missing I should have dropped it before posting it.

      Comment


      • #4
        drop if national==.
        bys national: egen sum_native=sum(national)
        bys national: gen sum_all=_N
        gen both=(sum_native>0&sum_all>sum_native)
        bys national: keep if _n==_N

        Comment


        • #5
          Dear Jin, this command does not work.
          Code:
           sum both 
          
          
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
                  both |          3           0           0          0          0

          Comment


          • #6
            In your example data, each firm has only a single observation in each year. For the question you are posing, I would ordinarily assume that there would be many observations for each firm in each year, corresponding to single employees. Perhaps that is true of your full data set, and your example was just not well chosen to illustrate the problem.

            If your full data really has just one observation for each firm in each year, then I don't understand the variable national. How would a firm characterize its entire workforce as either non-native or native? In fact, the very firms you are seeking to identify, those who have both native and non-native, would be uncategorizable in this scheme!

            All of that said, the code below assumes that you are interested in the aggregate workforce of the firm over the entire period 2010-2019.
            Code:
            drop if missing(national)
            by NPC_FIC (national), sort: gen byte both = (national[1] != national[_N])
            egen firm_tag = tag(NPC_FIC)
            tab both if firm_tag
            Bear in mind, however, that what this is telling us is the number of firms that have a "native" workforce (whatever that means?!?) in some year(s) and a "non-native" workforce in other(s).

            If you have, as I hope you do, actually employee-level data in your full data set and you want to identify in each year how many firms have mixed native/non-native workforces in each year:
            Code:
            drop if missing(national)
            by NPC_FIC year (national), sort: gen byte both = (national[1] != national[_N])
            egen firm_tag = tag(NPC_FIC year)
            tab both if firm_tag

            Comment


            • #7
              Prof Clyde, Thank you for getting back to me.
              I believe that I made a mistake in expressing the question. Its a panel dataset. The question is that I am going to determine the number of firms that have both PT and non-PT which here is mentioned with the variable nacio. So which firms have both PT and non-PT during the 10 years? There 569,426 firms in total, surely the answer would way fewer than this amount.
              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input double(year NPC_FIC) str6 nacio float national
              2010 500000001 "PT" 1
              2010 500000002 "PT" 1
              2011 500000002 "PT" 1
              2012 500000002 "AO" 0
              2013 500000002 "PT" 1
              2014 500000002 "PT" 1
              2015 500000002 "PT" 1
              2016 500000002 "PT" 1
              2017 500000002 "PT" 1
              2018 500000002 "PT" 1
              2010 500000033 "PT" 1
              2011 500000033 "PT" 1
              2012 500000033 "PT" 1
              2013 500000033 "PT" 1
              2014 500000033 "PT" 1
              2015 500000033 "PT" 1
              2016 500000033 "PT" 1
              2017 500000033 "PT" 1
              2018 500000033 "PT" 1
              2019 500000033 "PT" 1
              2010 500000050 "PT" 1
              2011 500000050 "PT" 1
              2012 500000050 "PT" 1
              2013 500000050 "BR" 0
              2014 500000050 "PT" 1
              2015 500000050 "PT" 1
              2016 500000069 "PT" 1
              2017 500000069 "PT" 1
              2019 500000073 "PT" 1
              2010 500000083 "PT" 1
              2011 500000083 "PT" 1
              2012 500000083 "PT" 1
              2013 500000083 "AN" 0
              2014 500000083 "BR" 0
              2015 500000083 "PT" 1
              2016 500000083 "PT" 1
              2017 500000083 "PT" 1
              2018 500000083 "PT" 1
              2019 500000083 "PT" 1
              2015 500000101 "PT" 1
              2016 500000101 "PT" 1
              2017 500000101 "PT" 1
              2018 500000101 "PT" 1
              2019 500000101 "PT" 1
              2010 500000104 "UA" 0
              2011 500000106 "PT" 1
              2011 500000113 "PT" 1
              2012 500000113 "PT" 1
              2013 500000113 "PT" 1
              2014 500000113 "PT" 1
              2010 500000119 "PT" 1
              2011 500000119 "PT" 1
              2012 500000119 "PT" 1
              2010 500000121 "PT" 1
              2011 500000150 "PT" 1
              2012 500000150 "PT" 1
              2013 500000150 "PT" 1
              2014 500000150 "PT" 1
              2010 500000156 "PT" 1
              2016 500000156 "PT" 1
              2017 500000156 "PT" 1
              2018 500000156 "PT" 1
              2019 500000156 "PT" 1
              2010 500000157 "PT" 1
              2011 500000157 "PT" 1
              2012 500000157 "PT" 1
              2010 500000165 "PT" 1
              2011 500000165 "PT" 1
              2012 500000165 "PT" 1
              2013 500000165 "PT" 1
              2014 500000165 "BR" 0
              2015 500000165 "PT" 1
              2016 500000165 "BR" 0
              2017 500000165 "PT" 1
              2010 500000180 "PT" 1
              2010 500000198 "PT" 1
              2011 500000198 "BR" 0
              2012 500000198 "PT" 1
              2013 500000198 "PT" 1
              2014 500000198 "PT" 1
              2010 500000201 "PT" 1
              2011 500000201 "PT" 1
              2012 500000201 "PT" 1
              2013 500000201 "PT" 1
              2010 500000204 "PT" 1
              2011 500000204 "UK" 0
              2012 500000204 "PT" 1
              2013 500000204 "SP" O
              2014 500000204 "PT" 1
              2010 500000212 "PT" 1
              2011 500000212 "UK" 0
              2012 500000212 "PT" 1
              .
              .
              end

              Comment


              • #8
                Code:
                assert national == (nacio == "PT")
                by NPC_FIC (national), sort: gen byte has_both = (national[1] != national[_N])
                egen firm_tag = tag(NPC_FIC)
                tab has_both if firm_tag

                Comment


                • #9
                  Prof Clyde, Thank you so much. It works perfectly!
                  Code:
                  tab has_both if firm_tag
                  
                     has_both |      Freq.     Percent        Cum.
                  ------------+-----------------------------------
                            0 |    503,999       91.66       91.66
                            1 |     45,866        8.34      100.00
                  ------------+-----------------------------------
                        Total |    549,865      100.00

                  Comment

                  Working...
                  X