Hello friends, Today I am going to discuss about the difference between coefficient of determination and correlation coefficient. In my mind, this concept has given confusion when I was trying to do Extreme value analysis for meteorological data. Finally, I could clear my doubt and I felt that there would be someone out there, who might have some doubt/misunderstanding on this topic.

Please be patient with the equations. I tried to give
sufficient theoretical explanation to match with math. After all, we need math for implementing this stuff in computer.

**Background:**

__General definitions and significance:__
Correlation coefficient or Pearson’s correlation coefficient
‘r’ for two data sets ‘y1’ and ‘y2’ is defined as

This parameter takes range [-1, 1] and
its magnitude signifies how well the relation between data y1 and y2 be represented
by (any arbitrary) straight line. The sign denotes whether they are directly
related (y1 increases -> y2 increases), or inversely related (y1 increases
-> y2 decreases)

Coefficient of determination R

^{2}for a data y as a function of x and fitted function ŷ is defined as
Here, ȳ signifies mean of y. S

_{r}is also called as SSE (sum of squared errors).
This parameter signifies how much (what fraction) of total
variation in the data y w.r.t x is explained by the fitted function ŷ.

__For linear fits:__
For linear fit ŷ (x) for data set
y(x), r is correlation coefficient between y and ŷ. For linear fit, the
coefficient of determination says, how much (what fraction) of total variation
in the data y w.r.t. x is explained by a given straight line ŷ. This definition
and the definition of ‘r’ mentioned above seem to be similar. Both talk about
linear relations. However, the main idea between these two parameters defers.
Let’s see in detail…

‘r’ tells us how far the data ‘y(x)’ is linear,
while ‘R

^{2}’ tells us how well given linear fit ŷ predicts the actual data y.
We can observe from above statements
that for ‘r’ the slope and intercept of ŷ does not matter as far as ŷ is
linear. It is like, “to identify the unknown nature of data y, we find its
correlation with a straight line. If r comes high, it means data y is more like
a straight line”. For that matter, even the ‘x’ serves as ŷ, if one wants to
find ‘r’ for data set ‘y’. So, ‘r’ can be determined as

Due to this nature of ‘r’, it helps in
identifying

**whether a linear model**is able to predict the data ‘y’ well or not even before fitting a line ŷ. Whereas, R^{2}helps in identifying**whether given linear function**is able to predict data ‘y’ well or not. Observe these bold font phrases carefully.
Okay.

A linear fit can be determined for a
given data ‘y’ by various methods. In the context of extreme value analysis, we are using least square
method and order statistics (Lieblein method) approach.

**Now let’s come to our original question.**

**Why different linear fits have same ‘r’ value?**

**Answer:**because both are linear fits! Yes. ‘r’ is just checking whether data shows linear variation or not. That’s all!.

BUT, we are under impression that fit
should be good to have good ‘r’ value. No!

Actually, better fit gives better R

^{2}value. Not ‘r’ value. Let’s see the basic question now.**Yes**, if the linear fit ŷ is obtained by least square fit of data y (so, for graphical method, r

^{2}= R

^{2}). See the mathematical proof in Appendix 1.

**Need not be**, for any other straight line ŷ (so, for numerical method, r

^{2}≠ R

^{2})

That is why, we are seeing same ‘r’
for both lines, but, different SSE (or R

^{2}) as expected.

*So, to conclude our case…*

To check whether a given model
(Gumbel, Frechet, Weibull, Log-normal) is suitable for data set or not, dataset
is pre-processed to get it into specific and respective coordinates (for
example, ln(dP) and –ln(-ln(P(x))) for Gumbel) in which these model take linear
shape. Then, in these coordinates if the data is correlated well (r is high)
with straight line, then the corresponding model is suitable for the data set.
Highest correlation coefficient denotes best model for this data set.

Now, we have two approaches to get the
fit

*viz.*least squares and order statistics (graphical and numerical respectively). So, they may give different straight line fits for the same data set. So, here comes for our help the coefficient of determination R^{2}. The higher its value, the better is the fit. We can alternately use SSE for the same purpose. The lower the SSE, the better is the fit.

*Is there any problem in interpreting the r or R*^{2}values published in literature?
After seeing above discussion, this doubt may arise in our mind.

Mostly, we use
least square fit for regression. As far as linear regression is concerned, the
value of r and R

^{2}are directly related. So, there will not be any problem in our interpretation or intuition. However, we need to be cautious if least square method is not used OR if it is not a linear fit. Even in case of multiple linear regression, r and R^{2}are not related. So, beware of this.**Appendix 1**

Let’s find correlation coefficient between y and ŷ.

To obtain least square fit
, the fundamental relations that we use are

From these relations (2,3), it is easy to show that

Using the relation (4) in relation (1)

The first sum in the numerator of above equation can be shown as
zero using the relations (4) and (5). This will result in

Hence, proved!

**Here, the relations (4) and (5) are central requirements for r = sqrt(R**

^{2}) and least square method is defined on these relations. In this way, only for least square fit r and R^{2}are related.