Saturday, 18 July 2020

Difference between correlation coefficient & coefficient of determination

Hello friends, Today I am going to discuss about the difference between coefficient of determination and correlation coefficient. In my mind, this concept has given confusion when I was trying to do Extreme value analysis for meteorological data. Finally, I could clear my doubt and I felt that there would be someone out there, who might have some doubt/misunderstanding on this topic.

Please be patient with the equations. I tried to give sufficient theoretical explanation to match with math. After all, we need math for implementing this stuff in computer.

General definitions and significance:
Correlation coefficient or Pearson’s correlation coefficient ‘r’ for two data sets ‘y1’ and ‘y2’ is defined as

This parameter takes range [-1, 1] and its magnitude signifies how well the relation between data y1 and y2 be represented by (any arbitrary) straight line. The sign denotes whether they are directly related (y1 increases -> y2 increases), or inversely related (y1 increases -> y2 decreases)
Coefficient of determination R2 for a data y as a function of x and fitted function ŷ is defined as

Here, ȳ signifies mean of y. Sr is also called as SSE (sum of squared errors).
This parameter signifies how much (what fraction) of total variation in the data y w.r.t x is explained by the fitted function ŷ.
For linear fits:
For linear fit ŷ (x) for data set y(x), r is correlation coefficient between y and ŷ. For linear fit, the coefficient of determination says, how much (what fraction) of total variation in the data y w.r.t. x is explained by a given straight line ŷ. This definition and the definition of ‘r’ mentioned above seem to be similar. Both talk about linear relations. However, the main idea between these two parameters defers. Let’s see in detail…
 ‘r’ tells us how far the data ‘y(x)’ is linear, while ‘R2’ tells us how well given linear fit ŷ predicts the actual data y.
We can observe from above statements that for ‘r’ the slope and intercept of ŷ does not matter as far as ŷ is linear. It is like, “to identify the unknown nature of data y, we find its correlation with a straight line. If r comes high, it means data y is more like a straight line”. For that matter, even the ‘x’ serves as ŷ, if one wants to find ‘r’ for data set ‘y’. So, ‘r’ can be determined as

Due to this nature of ‘r’, it helps in identifying whether a linear model is able to predict the data ‘y’ well or not even before fitting a line ŷ. Whereas, R2 helps in identifying whether given linear function is able to predict data ‘y’ well or not. Observe these bold font phrases carefully.
A linear fit can be determined for a given data ‘y’ by various methods. In the context of extreme value analysis, we are using least square method and order statistics (Lieblein method) approach.

Now let’s come to our original question.
Why different linear fits have same ‘r’ value?
Answer: because both are linear fits! Yes. ‘r’ is just checking whether data shows linear variation or not. That’s all!.
BUT, we are under impression that fit should be good to have good ‘r’ value. No!
Actually, better fit gives better R2 value. Not ‘r’ value. Let’s see the basic question now.
Is   or not?
Yes, if the linear fit ŷ is obtained by least square fit of data y (so, for graphical method, r2 = R2). See the mathematical proof in Appendix 1.
Need not be, for any other straight line ŷ (so, for numerical method, r2 ≠ R2)
That is why, we are seeing same ‘r’ for both lines, but, different SSE (or R2) as expected.

So, to conclude our case…

To check whether a given model (Gumbel, Frechet, Weibull, Log-normal) is suitable for data set or not, dataset is pre-processed to get it into specific and respective coordinates (for example, ln(dP) and –ln(-ln(P(x))) for Gumbel) in which these model take linear shape. Then, in these coordinates if the data is correlated well (r is high) with straight line, then the corresponding model is suitable for the data set. Highest correlation coefficient denotes best model for this data set.
Now, we have two approaches to get the fit viz. least squares and order statistics (graphical and numerical respectively). So, they may give different straight line fits for the same data set. So, here comes for our help the coefficient of determination R2. The higher its value, the better is the fit. We can alternately use SSE for the same purpose. The lower the SSE, the better is the fit.

Is there any problem in interpreting the r or R2 values published in literature?
After seeing above discussion, this doubt may arise in our mind.
Mostly, we use least square fit for regression. As far as linear regression is concerned, the value of r and R2 are directly related. So, there will not be any problem in our interpretation or intuition. However, we need to be cautious if least square method is not used OR if it is not a linear fit. Even in case of multiple linear regression, r and R2 are not related. So, beware of this.

Appendix 1
Let’s find correlation coefficient between y and ŷ.
To obtain least square fit  , the fundamental relations that we use are

From these relations (2,3), it is easy to show that

 ---------- (5)

Using the relation (4) in relation (1)

The first sum in the numerator of above equation can be shown as zero using the relations (4) and (5). This will result in

Hence, proved!
Here, the relations (4) and (5) are central requirements for r = sqrt(R2) and least square method is defined on these relations. In this way, only for least square fit r and R2 are related.