Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correlation between two observations across multiple variables

    Hello fellow Stata users:

    I have a panel dataset of patenting footprints of various companies. Each line in the dataset is a unique "company-year" identified by variables Company and Year. The other variables (columns) are patent classes and each value in the dataset is the number of patents that Firm X had in Year Y in patent class Z. So the patent data are arranged "horizontally" for each firm-year.

    Now, here is my question. In Excel, I can easily compute correlations between any two firm-years. I simply type CORREL(..., ...) and then highlight the two horizontal arrays of firm-year patenting data. Is there a way to do the same in Stata? The "corr" command computes "vertical" correlations, that is correlations between specific variables across all observations. I need correlations between observations across all patent class variables. I have searched high and low but have not been able to figure out how to do it in Stata. Of course, I can export the data into Excel and compute those correlations manually in Excel but I would really like to be able to do it in Stata.

    Any help will be greatly appreciated! Thank you! I will be happy to clarify if necessary.

  • #2
    This question would benefit greatly from some sample data, say three different firms in four different years with five patent classes: 12 firm/year observations, each with 7 variables: firm, year, and patent counts for five patent classes. Even the best descriptions of data are no substitute for an actual example of the data. And that data should not be provided as a screen shot, an Excel spreadsheet, or a copy-and-paste into a table in the Statalist post.

    Instead, be sure to follow the advice of the Statalist FAQ and use the dataex command to do this. If you are running version 15.1 or a fully updated version 14.2, dataex is already part of your official Stata installation. If not, run ssc install dataex to get it. Either way, run help dataex and read the simple instructions for using it. dataex will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use dataex.
    Last edited by William Lisowski; 09 Mar 2019, 18:32.

    Comment


    • #3
      William,

      Thank you for your response. I have included the dataex code below so you can see a sample of my data:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str33 Company int(Year P424 P435 P514 P546) byte P623
      "ABBOTT"     2000 15  8  61 19  2
      "ABBOTT"     2001 19 13  34 13  6
      "ABBOTT"     2002 12 13  31 17 10
      "ABBOTT"     2003  9 15  27 18 14
      "MERCK & CO" 2000 32 46 162 76  1
      "MERCK & CO" 2001 23 50 149 74  0
      "MERCK & CO" 2002 25 36 140 81  0
      "MERCK & CO" 2003 17 23 169 96  0
      "PFIZER"     2000 19 23 160 84  0
      "PFIZER"     2001 15 26 129 53  0
      "PFIZER"     2002 28 16 124 74  0
      "PFIZER"     2003 10  8 132 85  0
      end
      To restate my problem: I need to compute correlations between e.g. ABBOTT and PFIZER for year 2000 across all patent classes (each patent class is a variable whose name starts with P followed by a number). Instead of normal "vertical" correlations between variables, I would like to compute "horizontal correlations" between observations. I have hundreds of patent classes and many thousands of firm-year observations in my dataset.

      Thank you!

      Comment


      • #4
        Thanks for the excellent example data. Here is some sample code that I hope will start you in a useful direction. Discussion follows the code.
        Code:
        generate str20 cyear = "C_" + substr(word(Company,1),1,10) + "_" + strofreal(Year)
        drop Company Year
        reshape long P, i(cyear) j(class) string
        list, clean noobs
        reshape wide P, i(class) j(cyear) string
        list, clean noobs
        rename (PC_*) (C_*)
        replace class = "P"+class
        ds, varwidth(20)
        list class C_*_2000, abbreviate(17) noobs clean
        correlate C_*_2000
        Code:
        . ds, varwidth(20)
        class          C_ABBOTT_2002  C_MERCK_2001   C_PFIZER_2000  C_PFIZER_2003
        C_ABBOTT_2000  C_ABBOTT_2003  C_MERCK_2002   C_PFIZER_2001
        C_ABBOTT_2001  C_MERCK_2000   C_MERCK_2003   C_PFIZER_2002
        
        . list class C_*_2000, abbreviate(16) noobs clean
        
            class   C_ABBOTT_2000   C_MERCK_2000   C_PFIZER_2000  
             P424              15             32              19  
             P435               8             46              23  
             P514              61            162             160  
             P546              19             76              84  
             P623               2              1               0  
        
        . correlate C_*_2000
        (obs=5)
        
                     | C_A~2000 C_M~2000 C_P~2000
        -------------+---------------------------
        C_ABBOT~2000 |   1.0000
        C_MERCK_2000 |   0.9653   1.0000
        C_PFIZE~2000 |   0.9528   0.9843   1.0000
        The key is that Stata observations are not interchangeable with variables: correlate computes the correlation between variables across observations. (In Excel, there's no such distinction between variables and observations, CORREL just works on bunches of numbers, for all it matters one could be a column and the other a row.)

        So while you now have observations identified by Company and Year and variables identified by Patent Class (P*), you need to have variables identified by Company and Year and observations identified by Patent Class.

        The hard part is turning an arbitrary company name into part of a variable name. I did it by taking the first word or first 10 characters, whichever is shorter, and hoping that would be distinct. I end up with C_ in front of it so that if you have a company name that starts with a digit it will still be a legitimate variable name. The easier way is to use encode to assign each company name to a distinct number - but they you have to keep track of what's what. So you will need to pay attention to this part of the code, which is why I hope it starts you in a useful direction rather than promise it will be what you need.

        There was something else I meant to say, but it doesn't come to me at the moment. Run my code on your example data and understand the output. Ask any followup questions that come to mind. Then expand to your entire dataset.

        Comment


        • #5
          William,

          Thank you so much for your help and advice! I will run the code on the entire sample and will get back to report the outcome.

          Sincerely,
          Alex Martynov

          Comment


          • #6
            William,

            I have been able to do what you suggested on the entire sample. Thank you for your invaluable help!

            I did run into a few problems but I managed to solve all of them. The only surprising thing for me was the fact that Stata seems to need a certain number of observations in order to compute correlations. I got the "no observations" error when trying to calculate correlations for companies with relatively few observations. For example, for a company with 5 observations per year, Stata refused to report any correlations telling me that there were "no observations." Do you know if this is normal?

            Thank you!
            Alex Martynov

            Comment


            • #7
              When you specify the -correlate- command with a list of variables, only observations with no missing values on any of those variables are included in the calculations. So a firm may have five observations, but if each of them has a missing value for at least one of the variables mentioned in the -correlate- command, then there will be no observations for the purposes of calculating -correlate-.

              Comment


              • #8
                When I run the correlation command like this:

                corr *CORIXA*2006 *CORIXA*2007

                I get the correlation between these two years (two separate variables) of the company CORIXA. Yet when I try

                corr *CORIXA*

                I get the "no observations" error. It is strange since when I run the following command:

                corr *AMGEN*

                I get the full correlation table for every possible pair of years for company AMGEN.

                Why is it not working for CORIXA for at least 2006 and 2007 when I write *CORIXA*?

                Comment


                • #9
                  When you run -corr *CORIXA*- the list of variables to be included in the correlation includes all variables in your data set that includes "CORIXA" somewhere in its name. There are, presumably, of those than just the ones that also end in 2006 or 2007. So there are a lot more opportunities for an observation to drop out of the calculations: as I said in #7, if an observation has missing values for any of the variables mentioned, it is excluded from all of the correlations. So an observation might have no missing values for the *CORIXA*2006 and *CORIXA*2007 variables, but it might have a missing value on some other variable like CORIXA2008. In that case, the observation is not used for any of the correlations at all.

                  If you want to get calculations where an observation is always included for any correlation where it has non-missing values for the two particular variables involved, then use -pwcorr-, not -correlate-. Beware, however, that for some purposes, correlations calculated in this way are not usable. I do not see anywhere in this thread an explanation of how you plan to use these correlations once you get them, so I can't advise you whether the results from -pwcorr- will be valid for your purpose or not.

                  Comment


                  • #10
                    Thank you! I have run pwcorr *CORIXA* and it worked. I got exactly the same correlation for CORIXA_2006 and CORIXA_2007 as with corr *CORIXA*2006 *CORIXA*2007.

                    I will be using these correlations to measure the "closeness" of a company's patenting portfolios in different years (high correlation = close patenting portfolios). I will also be comparing patenting portfolios of different companies in the same year and in different years.

                    Comment


                    • #11
                      For these purposes, you are fine.

                      Just make sure you don't try to do any matrix algebra with the matrix of correlations or use that matrix as summary input data for procedures like factor analysis, or regressions.

                      Comment

                      Working...
                      X