Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data Arrangement

    Hello!

    I am trying to clean a dataset which is a long chain of merges from various versions of EPA data and Compustat data. Below is the example from how my data looks (just few variables attached here for ease);

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long gvkey double(frsid activity_id) float year double fyear float(total_cost total_penalty)
    1075 110042068473 2600020355 2015 1970 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1971 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1972 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1973 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1974 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1975 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1976 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1977 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1978 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1979 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1980 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1981 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1982 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1983 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1984 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1985 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1986 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1987 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1988 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1989 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1990 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1991 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1992 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1993 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1994 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1995 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1996 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1997 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1998 1.685e+08 1500000
    1075 110042068473 2600020355 2015 1999 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2000 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2001 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2002 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2003 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2004 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2005 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2006 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2007 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2008 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2009 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2010 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2011 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2012 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2013 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2014 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2015 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2016 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2017 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2018 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2019 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2020 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2021 1.685e+08 1500000
    1075 110042068473 2600020355 2015 2022 1.685e+08 1500000
    1078 110007806188       4128 1991 1973     10000   10000
    1078 110007806188       4128 1991 1974     10000   10000
    1078 110007806188       4128 1991 1975     10000   10000
    1078 110007806188       4128 1991 1976     10000   10000
    1078 110007806188       4128 1991 1977     10000   10000
    1078 110007806188       4128 1991 1978     10000   10000
    1078 110007806188       4128 1991 1979     10000   10000
    1078 110007806188       4128 1991 1980     10000   10000
    1078 110007806188       4128 1991 1981     10000   10000
    1078 110007806188       4128 1991 1982     10000   10000
    1078 110007806188       4128 1991 1983     10000   10000
    1078 110007806188       4128 1991 1984     10000   10000
    1078 110007806188       4128 1991 1985     10000   10000
    1078 110007806188       4128 1991 1986     10000   10000
    1078 110007806188       4128 1991 1987     10000   10000
    1078 110007806188       4128 1991 1988     10000   10000
    1078 110007806188       4128 1991 1989     10000   10000
    1078 110007806188       4128 1991 1990     10000   10000
    1078 110007806188       4128 1991 1991     10000   10000
    1078 110007806188       4128 1991 1992     10000   10000
    1078 110007806188       4128 1991 1993     10000   10000
    1078 110007806188       4128 1991 1994     10000   10000
    1078 110007806188       4128 1991 1995     10000   10000
    1078 110007806188       4128 1991 1996     10000   10000
    1078 110007806188       4128 1991 1997     10000   10000
    1078 110007806188       4128 1991 1998     10000   10000
    1078 110007806188       4128 1991 1999     10000   10000
    1078 110007806188       4128 1991 2000     10000   10000
    1078 110007806188       4128 1991 2001     10000   10000
    1078 110007806188       4128 1991 2002     10000   10000
    1078 110007806188       4128 1991 2003     10000   10000
    1078 110007806188       4128 1991 2004     10000   10000
    1078 110007806188       4128 1991 2005     10000   10000
    1078 110007806188       4128 1991 2006     10000   10000
    1078 110007806188       4128 1991 2007     10000   10000
    1078 110007806188       4128 1991 2008     10000   10000
    1078 110007806188       4128 1991 2009     10000   10000
    1078 110007806188       4128 1991 2010     10000   10000
    1078 110007806188       4128 1991 2011     10000   10000
    1078 110007806188       4128 1991 2012     10000   10000
    1078 110007806188       4128 1991 2013     10000   10000
    1078 110007806188       4128 1991 2014     10000   10000
    1078 110007806188       4128 1991 2015     10000   10000
    1078 110007806188       4128 1991 2016     10000   10000
    1078 110007806188       4128 1991 2017     10000   10000
    1078 110007806188       4128 1991 2018     10000   10000
    1078 110007806188       4128 1991 2019     10000   10000
    end
    This is a panel data of firms (gvkey) and year (fyear) from Compustat. Further these firms have multiple facilities (frsid), and each facility can have multiple federal cases going on against it (activity_id) because of its environmental violations and in which year (year) the cases were lodged. So in this panel data, I only want the information of each facility to show in the year the case was lodged against it and all other rows should be missing. The structure should look like there is a firm panel data, then under each firm whenever the case was started against its either of facility, data should show up and all rows for other years of the firm panel should be missing with plant level information.

    As trivial as it may seem, I can't come up with the code. Any help will be greatly appreciated. Thanks!

  • #2
    Thank you for using -dataex- on your first post.

    I don't know if I'm understanding your request correctly. But it sounds like all you need to do is -keep if year == fyear-.

    If that's not right, and if nobody else posts a correct solution, please post back and for one firm (with multiple facilities) show what the result you want looks like.

    Comment


    • #3
      Thanks for getting back! I think you understand the structure but keep if year==fyear would delete all other rows where i have other firm data. I realize this data has a lot of levels firms, facilities, cases, year of cases, and panel years for firm -level variables. I am just confused how to structure it better for my analysis.

      But thanks for getting back. I will spend some more time on it.

      Comment


      • #4
        Hi Sumaya,

        I can't help with your question because I don't understand your request either. But there's another issue you need to consider. The two variables you describe as "year" may not measure the same thing. I bet the EPA variable year represents calendar year. Compustat variable fyear is not calendar year, but fiscal year. Many fiscal years end in some month other than December. Compustat variable fyr gives the month in which the fiscal year ends. The Compustat definition of fyr explains how fyear is defined when the fiscal year does not end in December.
        Devra Golbe
        Professor Emerita, Dept. of Economics
        Hunter College, CUNY

        Comment


        • #5
          Thank you so much Devra for bringing this to my knowledge. I rechecked and Compustat treats fiscal years ending January 1 through May 31 as ending in the prior calendar year. In the mean time I also worked on my data but the data arrangement is still my concern. I hope I can better explain it now;

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input long gvkey double(frsid activity_id) float complaint_year double year float(num_cases num_facility)
          1078 110007806188  5084 1994 1988 6 3
          1078 110007806188  5084 1994 1989 6 3
          1078 110007806188  4441 1992 1990 6 3
          1078 110007806188  5084 1994 1991 6 3
          1078 110007806188  5084 1994 1993 6 3
          1078 110007806188  5084 1994 1994 6 3
          1078 110007806188  5084 1994 1995 6 3
          1078 110007806188  5084 1994 1996 6 3
          1078 110007806188  5084 1994 1997 6 3
          1078 110007806188  5084 1994 1998 6 3
          1078 110007806188  5084 1994 1999 6 3
          1078 110007806188  5084 1994 2000 6 3
          1078 110007806188  5084 1994 2001 6 3
          1078 110007806188  5084 1994 2002 6 3
          1078 110007806188  5084 1994 2003 6 3
          1078 110007806188  5084 1994 2004 6 3
          1078 110007806188  4128 1991 2005 6 3
          1078 110007806188  5084 1994 2006 6 3
          1078 110007806188  5084 1994 2007 6 3
          1078 110007806188  5084 1994 2008 6 3
          1078 110007806188  5084 1994 2009 6 3
          1078 110007806188  5084 1994 2010 6 3
          1078 110007806188  5084 1994 2011 6 3
          1078 110007806188  5084 1994 2012 6 3
          1078 110007806188  5084 1994 2014 6 3
          1078 110008460744 88388 2002 1988 1 3
          1078 110008460744 88388 2002 1989 1 3
          1078 110008460744 88388 2002 1990 1 3
          1078 110008460744 88388 2002 1991 1 3
          1078 110008460744 88388 2002 1992 1 3
          1078 110008460744 88388 2002 1993 1 3
          1078 110008460744 88388 2002 1994 1 3
          1078 110008460744 88388 2002 1995 1 3
          1078 110008460744 88388 2002 1996 1 3
          1078 110008460744 88388 2002 1998 1 3
          1078 110008460744 88388 2002 1999 1 3
          1078 110008460744 88388 2002 2000 1 3
          1078 110008460744 88388 2002 2001 1 3
          1078 110018042562 28875 1994 1988 1 3
          1078 110018042562 28875 1994 1989 1 3
          1078 110018042562 28875 1994 1990 1 3
          1078 110018042562 28875 1994 1991 1 3
          1078 110018042562 28875 1994 1992 1 3
          1078 110018042562 28875 1994 1993 1 3
          1078 110018042562 28875 1994 1994 1 3
          1078 110018042562 28875 1994 1995 1 3
          1078 110018042562 28875 1994 1996 1 3
          1078 110018042562 28875 1994 1997 1 3
          1078 110018042562 28875 1994 1998 1 3
          1078 110018042562 28875 1994 1999 1 3
          1078 110018042562 28875 1994 2000 1 3
          1078 110018042562 28875 1994 2001 1 3
          1078 110018042562 28875 1994 2002 1 3
          1078 110018042562 28875 1994 2003 1 3
          1078 110018042562 28875 1994 2004 1 3
          1078 110018042562 28875 1994 2005 1 3
          1078 110018042562 28875 1994 2006 1 3
          1078 110018042562 28875 1994 2007 1 3
          1078 110018042562 28875 1994 2008 1 3
          1078 110018042562 28875 1994 2009 1 3
          1078 110018042562 28875 1994 2010 1 3
          1078 110018042562 28875 1994 2011 1 3
          1078 110018042562 28875 1994 2013 1 3
          end
          Now my data has a firm identifier (gvkey), establishment identifier (frsid), enforcement case identifier (activity_id), the year the monitoring started against the establishment (complaint_year) and the fiscal year (year). (num_cases) is the number of cases ever filed against each establishment and (num_facilities) is the number of facilities for each firm. My objective would be to analyze the behaviors of multi-establishment firms in the face of enforcement; do they reduce their emissions or shift emissions to their other plants where there has been an enforcement and what drives such choices.

          I am not sure how I can arrange this data for such an analysis but I think that the data for the enforcement cases should only be shown for years when the monitoring year coincides with the financial year. For other years, the data should be missing for the cases. Now I am not confident if this should be the case. I am looking for a panel structure at plant-level when i have to do plant-level analysis, and then at firm-level for firm-level analysis. Any help in how to best arrange this data will be greatly appreciated.

          Comment


          • #6
            I'm still having trouble understanding what you want to do. You say you want to do both firm-level and plant-level analysis, but it's not clear what questions you want to ask at each level.

            I also don't understand your data. Your case identifier 5084 (activity-id, gvkey =1078) was evidently brought in fiscal year 1994. Was it still active 20 years later in fy 2014? How was it associated with the firm in 1988? How does establishment 110018042562 have both 6 and 1 cases ever filed (num_cases) in 1988? I wonder if some of your merges did not do what you meant for them to do. Perhaps that has to do with the "various versions" of EPA data you mention. Are the versions of EPA data in fact different EPA datasets defined over different units of observation (i.e., plant vs firm) or over different variables or both?

            Also, I notice now that you have defined your identifier variables as numeric. I think there's less room for trouble if you define them as strings. For instance, when I read in your data, the frsid is 1.1...e+11. And I would stick with the original name for fiscal year-- fyear-- rather than calling it year. You are undoubtedly going to have items defined by calendar year, and mixing up the two is another potential source of trouble.

            Devra Golbe
            Professor Emerita, Dept. of Economics
            Hunter College, CUNY

            Comment

            Working...
            X