Hello friends, Today I am going to discuss about the difference between coefficient of determination and correlation coefficient. In my mind, this concept has given confusion when I was trying to do Extreme value analysis for meteorological data. Finally, I could clear my doubt and I felt that there would be someone out there, who might have some doubt/misunderstanding on this topic.
Please be patient with the equations. I tried to give
sufficient theoretical explanation to match with math. After all, we need math for implementing this stuff in computer.
Background:
General definitions
and significance:
Correlation coefficient or Pearson’s correlation coefficient
‘r’ for two data sets ‘y1’ and ‘y2’ is defined as
This parameter takes range [-1, 1] and
its magnitude signifies how well the relation between data y1 and y2 be represented
by (any arbitrary) straight line. The sign denotes whether they are directly
related (y1 increases -> y2 increases), or inversely related (y1 increases
-> y2 decreases)
Coefficient of determination R2
for a data y as a function of x and fitted function ŷ is defined as
Here, ȳ signifies mean of y. Sr is also called as SSE
(sum of squared errors).
This parameter signifies how much (what fraction) of total
variation in the data y w.r.t x is explained by the fitted function ŷ.
For linear fits:
For linear fit ŷ (x) for data set
y(x), r is correlation coefficient between y and ŷ. For linear fit, the
coefficient of determination says, how much (what fraction) of total variation
in the data y w.r.t. x is explained by a given straight line ŷ. This definition
and the definition of ‘r’ mentioned above seem to be similar. Both talk about
linear relations. However, the main idea between these two parameters defers.
Let’s see in detail…
‘r’ tells us how far the data ‘y(x)’ is linear,
while ‘R2’ tells us how well given linear fit ŷ predicts the actual
data y.
We can observe from above statements
that for ‘r’ the slope and intercept of ŷ does not matter as far as ŷ is
linear. It is like, “to identify the unknown nature of data y, we find its
correlation with a straight line. If r comes high, it means data y is more like
a straight line”. For that matter, even the ‘x’ serves as ŷ, if one wants to
find ‘r’ for data set ‘y’. So, ‘r’ can be determined as
Due to this nature of ‘r’, it helps in
identifying whether a linear model
is able to predict the data ‘y’ well or not even before fitting a line ŷ. Whereas,
R2 helps in identifying whether
given linear function is able to predict data ‘y’ well or not. Observe
these bold font phrases carefully.
Okay.
A linear fit can be determined for a
given data ‘y’ by various methods. In the context of extreme value analysis, we are using least square
method and order statistics (Lieblein method) approach.
Now
let’s come to our original question.
Why
different linear fits have same ‘r’ value?
Answer:
because
both are linear fits! Yes. ‘r’ is just checking whether data shows linear
variation or not. That’s all!.
BUT, we are under impression that fit
should be good to have good ‘r’ value. No!
Actually, better fit gives better R2
value. Not ‘r’ value. Let’s see the basic question now.
Yes, if the linear
fit ŷ is obtained by least square fit of data y (so, for graphical method, r2
= R2). See the mathematical proof in Appendix 1.
Need
not be, for any other straight line ŷ (so, for numerical method, r2
≠ R2)
That is why, we are seeing same ‘r’
for both lines, but, different SSE (or R2) as expected.
So, to conclude our case…
To check whether a given model
(Gumbel, Frechet, Weibull, Log-normal) is suitable for data set or not, dataset
is pre-processed to get it into specific and respective coordinates (for
example, ln(dP) and –ln(-ln(P(x))) for Gumbel) in which these model take linear
shape. Then, in these coordinates if the data is correlated well (r is high)
with straight line, then the corresponding model is suitable for the data set.
Highest correlation coefficient denotes best model for this data set.
Now, we have two approaches to get the
fit viz. least squares and order
statistics (graphical and numerical respectively). So, they may give different
straight line fits for the same data set. So, here comes for our help the
coefficient of determination R2. The higher its value, the better is
the fit. We can alternately use SSE for the same purpose. The lower the SSE,
the better is the fit.
Is
there any problem in interpreting the r or R2 values published in
literature?
After seeing above discussion, this doubt may arise in our mind.
Mostly, we use
least square fit for regression. As far as linear regression is concerned, the
value of r and R2 are directly related. So, there will not be any
problem in our interpretation or intuition. However, we need to be cautious if
least square method is not used OR if it is not a linear fit. Even in case of
multiple linear regression, r and R2 are not related. So, beware of
this.
Appendix
1
Let’s find correlation coefficient between y and ŷ.
To obtain least square fit
, the fundamental relations that we use are
From these relations (2,3), it is easy to show that
Using the relation (4) in relation (1)
The first sum in the numerator of above equation can be shown as
zero using the relations (4) and (5). This will result in
Hence, proved!
Here, the
relations (4) and (5) are central requirements for r = sqrt(R2) and
least square method is defined on these relations. In this way, only for least
square fit r and R2 are related.