Euclidean distance: Pre-processing methods/requirements of the components

Chan Ge

Join Date: Feb 2024

Posts: 27
#1

Euclidean distance: Pre-processing methods/requirements of the components

01 Mar 2024, 14:41

Dear all,

could anyone with expertise in Euclidean distance educate me about pre-processing the components?
Say ED = sqrt(x^2 + y^2 + z^2). Should x, y, z (with the range of [0,1[) be standardized, max-min normalized, or percentile ranked to [0,1]?
I saw practices of standardization. Is standardization a requirement or recommendation?
Thank you.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

01 Mar 2024, 18:41

The question, posed generically as you have done, has no answer. Or rather it has infinitely many answers. The euclidean distance can be calculated with any kinds of variables. The decision of whether or in which way to transform the variables before you apply it depends on the meanings of those variables and the use to which you will be putting the resulting euclidean distance.
Comment
Chan Ge

Join Date: Feb 2024

Posts: 27
#3

02 Mar 2024, 11:08

Originally posted by Clyde Schechter View Post

The question, posed generically as you have done, has no answer. Or rather it has infinitely many answers. The euclidean distance can be calculated with any kinds of variables. The decision of whether or in which way to transform the variables before you apply it depends on the meanings of those variables and the use to which you will be putting the resulting euclidean distance.

Thank you for pointing out the issue, Clyde. Sorry that I didn't realize this.
I want to calculate two firms' asymmetric strengths on five dimensions:
x1 = the dissimilarity of product portfolio descriptions
x2 = the difference in two firms' intangible stock scaled by its total assets
x3 = the difference in two firms' profit margin
x4 = the difference in two firms' advertising intensity (advertising expense / net sales)
x5 = the difference in two firms' fixed assets intensity (property, plant, and equipment scaled by total assets)

Viewing the five dimensions as collectively important for a firm's operation success, I want to assess how different the two firms are regarding their collective resources.

I hope that the above might have provided useful information. Please kindly let me know if some further information is helpful to solicit inputs to the question.
Thank you again.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

02 Mar 2024, 12:41

This kind of content is way beyond my area of expertise, so I'm not going to be able to give you specific advice here. What I can say is that it is likely that these five variables are all on very different scales. In fact, they aren't even all in the same units: X1 is some kind of abstract measure, X2, X4, and X5 are dimensionless, and X3 is in currency units.

Consequently, a Euclidean distance calculation is going to be most sensitive to the numerically largest one (or, more accurately, to the one which exhibits the largest differences among the firms.) That might be just what you need, or it might be entirely destructive to the purposes of your research, or somewhere in between. So these variables (may) need to be rescaled in such a way as to have their influence on the distance metric be proportional to their importance, where importance is defined in terms of relevance to whatever you are trying to relate this distance measure to in your research. This kind of importance weighting assignment needs to be done by somebody who is familiar with the substance of your research--it is not a statistical question. It may be that other Forum members who work in you area will see this post and respond.

I will say that if your intent is to give all five of these variables equal influence on your difference measure, despite their differences in units and scale, you would probably be better off using the Mahalnobis distance, in that it is scale invariant and also accounts for correlations among the variables. It can be thought of as the multivariate generalization of standardized difference. But, of course, I have no idea if this is appropriate for your purposes.
Comment
Chan Ge

Join Date: Feb 2024

Posts: 27
#5

02 Mar 2024, 13:54

Originally posted by Clyde Schechter View Post

This kind of content is way beyond my area of expertise, so I'm not going to be able to give you specific advice here. What I can say is that it is likely that these five variables are all on very different scales. In fact, they aren't even all in the same units: X1 is some kind of abstract measure, X2, X4, and X5 are dimensionless, and X3 is in currency units.

Consequently, a Euclidean distance calculation is going to be most sensitive to the numerically largest one (or, more accurately, to the one which exhibits the largest differences among the firms.) That might be just what you need, or it might be entirely destructive to the purposes of your research, or somewhere in between. So these variables (may) need to be rescaled in such a way as to have their influence on the distance metric be proportional to their importance, where importance is defined in terms of relevance to whatever you are trying to relate this distance measure to in your research. This kind of importance weighting assignment needs to be done by somebody who is familiar with the substance of your research--it is not a statistical question. It may be that other Forum members who work in you area will see this post and respond.

I will say that if your intent is to give all five of these variables equal influence on your difference measure, despite their differences in units and scale, you would probably be better off using the Mahalnobis distance, in that it is scale invariant and also accounts for correlations among the variables. It can be thought of as the multivariate generalization of standardized difference. But, of course, I have no idea if this is appropriate for your purposes.

Thank you Clyde. Your illustrations are very helpful. Highly appreciated. Yes, I intend to give them equal influence. I thought of standardization to account for differences in units and scale if using Euclidean distance. But, your highlight of possible correlations among the component proxies is worth further considerations. I'll see whether and how they may differ. Thank you a lot again.
Comment
Chan Ge

Join Date: Feb 2024

Posts: 27
#6

02 Mar 2024, 13:59

A further clarification of my description of x1-5: I would say all five components are dimensionless. x1 is in essence one minus a cosine similarity score of two pieces of text. x3 is profit margin scaled by net sale.
Comment

Announcement

Euclidean distance: Pre-processing methods/requirements of the components

Comment

Comment

Comment

Comment

Comment