Correlation and Linear Regression, Scatterplot


Pearson's product moment correlation coefficient r.
a measure of the strength and direction of any linear association/relationship between two variables

-1 ≤ r ≤ 1

X the independent/treatment/explanatory/predictor variable (can be controlled or manipulated)
Y the dependent/response variable.
Linear:
   Scatterplot points are straight line-ish. Each unit change of X, the Y changes by a constant on average.
Direction:
   x,y both go up together or both go down together: positive correlation, positive r.
   One goes up, the other goes down: negative correlation, negative r.
Strength: the closer |r| is to 1, the stronger the relationship/association. Observed Y's are close to predicted-by-X's. Residuals are small.
Correlation might mean one variable causes the other but by itself is insufficient to establish causation.

Type or paste the X and Y data. Can be multiple lines for colored groups in scatterplot.
XY
Group 1
Group 2
Group 3
Group 4
Group 5

      

n=    ∑x=    ∑y=    ∑x2=    ∑y2=    ∑xy=
SSx=∑(x-x̄)2:    SSy=∑(y-ȳ)2:     SSxy=∑(x-x̄)(y-ȳ):       ∑zxzy=
Centroid: (x̄= ,    ȳ= )    sx=     sy=    
Covariance=∑(x-x̄)(y-ȳ) / n-1 = (r=Cov/(sxsy))    CovN

r=    
r2= Coefficient of determination. % of variation in y explainable by the regression line (the proportion of the variance in Y attributable to the variance in X). The linear relationship between the variables explains r2% of the variation in the data.
SST (total variation SSy, ∑(y-ȳ)2) = SSR (explained variation ∑(ŷ-ȳ)2) + SSE (unexplained variation ∑(y-ŷ)2)
    r2= SSR/SST= explained variation / total variation
MSE=   RMSE(standard error of estimate, se)=√(SSE/df): (spread of points around regression line)
sβ=    tc=t.05/2    C.I. slope ± (, )   
t=β/sβ=r/√((1-r2)/(n-2)):   If t>tc, Reject H0    df=n-2:          F=

H0:ρ=0 no correlation, HA:ρ≠0 yes correlation.    "rho"
r critical value α=.05: α=.01: (or look in r table)    Reject H0 if |r|>|crit.val|
p-value:   If p < α, reject H0.

Scatterplot

line of best-fit, trendline. If r is significant.
Linear regression line: ŷ = b1x + b0     slope b1=r(sy/sx)=SSxy/SSx:    y-intercept b0=ȳ-b1x̄=
It goes thru the centroid.     The slope b1 is the same sign as r.
A residual is the vertical difference between a y and the line.
The line minimizes the sum of the squared residuals: ∑(y-ŷ)2=     SD of the residuals:
The greater the |r|, the more clustering of the data around the regression line.

X min: X max: Y min: Y max:
Can change the appearance of the line; its "angle".

mean distance from (x̄,ȳ) =    SD of distance=   

Covariance matrix Σ:
Eigenvalues: λ1          λ2
Eigenvectors:


ŷ is the predicted by regression formula. residual (or prediction error): difference between y and ŷ.

Residual plot: (x,y-ŷ) i.e. (x,residual)   (x,actual y - predicted y)
Regression is good (i.e. relationship between x and y is linear and regression line can be used for prediction) if the residual plot has no pattern and doesn't widen or narrow.


Spearman Rank Correlation Coefficient

If populations are not normal.
    d=difference in ranks.
If both data sets have the same ranks, rs will be +1, if both data sets have opposite ranks, rs will be -1.

rs:     Critical value α=0.05:
If |rs| > C.V., reject H0. We have a linear relationship!
If |rs| < C.V., fail to reject H0. We don't have a linear relationship.


Prediction.

x0: should be in X's interval (interpolation). Avoid? extrapolation.

ŷ:

CL 95% Margin of error E:    CL 95% Prediction interval (PI): (,)

CL 99% Margin of error E:    CL 99% Prediction interval (PI): (,)


Examples      Anscombe's Quartet      1969 draft
Generate correlated data

no linear association/relationship if r≈0

r unchanged:  X±c ior Y±c,   aX ior bY (a,b same sign)
r sign flips: aX or bY (a,b different signs)

r≈0: no linear relationship but might be non-linear

r is not "cause-and-effect"
r measures association, not causation
r can be greatly affected by outliers, sensitive to outliers
r unaffected by swapping X and Y
r unaffected by changing measuring units

|r| .2,.3 "weakly" correlated
|r| .3,.5 "moderately" correlated
|r| .7+   "strongly" correlated

Homoscedasticity: the spread of data points about the regression line is the same 
  thruout the interval of the independent variable.

"outliers" within  the pattern of data strengthen r
"outliers" outside the pattern of data weaken r

positive residual: regression line underestimated actual value
negative residual: regression line overestimated  actual value

slope of regression line is average increase/decrease of response variable Y 
 for one unit increase of X


lottery
334 127 300 227 202 180 164 145 255
54   16  41  27  23  18  18  16  26

Clusters
1 1 1 2 2 2 3 3 3 10
1 2 3 1 2 3 1 2 3 10

1 1 2 2 9 9 10 10
1 2 1 2 9 10 9 10

1  8 8 8 9 9 9 10 10 10
10 1 2 3 1 2 3  1  2  3

Anscombe
10   8    13   9    11   14  6    4   12   7    5
9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 4.74

10   8    13    9    11   14   6    4    12   7    5
7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73

small MPG
2844 3109 2870 3095 2915 2985 2563 3009 2798 2468 2598 2558
  36   38   41   33   42   31   40   37   35   40   38   37
r=-.395

large MPG
3608 3962 4253 4006 3754 3859 3874 4674 4321 4346 3891 3957
  32   27   25   31   28   30   30   24   27   26   33   31

tar
20  27  27  20  20  24  20  23  20  22  20  20  20  20  20  20  24  20  21  25  23  20  22  20  10
1.1 1.7 1.7 1.1 1.1 1.4 1.1 1.4 1.0 1.2 1.1 1.1 1.1 1.1 1.1 1.8 1.6 1.1 1.2 1.5 1.3 1.1 1.3 1.1 1.8
(10,1.8) outlier--big effect

year(1960=1),CPI
1    13    26    35    42  43  49    53  55  59
29.6 44.4 109.6 152.4 180 184 214.5 233 237 252.2

video BP
138 130 135 140 120 125 120 130 130 144 143 140 130 150
82   91 100 100  80  90  80  80  80  98 105  85  70 100

video boats-manatees
68 68 68 70 71 73 76 81 83 84
53 38 35 49 42 60 54 67 82 78
vid says RMSE(se)=6.61  mine:6.85 they use wrong numbers for m and b...

POTUS-opponent heights
177 191 169 190 196 174 179 181 171 177 168 184 195 180 173 177
183 190 169 177 179 177 182 181 194 195 190 177 181 193 184 194

quadratic?
10   8    13   9    11   14   6    4    12   7    5
9.15 8.14 8.74 8.77 9.27 8.09 6.13 3.09 9.13 7.26 4.74

test data:
35   12  65  47  21  32  52  15  57
210 160 285 255 180 220 190 170 275

34 108 64 88 99 51
 5  17 11  8 14  5

uniform random 0-10
1 5 2 6 7 1 2 2 1 2 8 3 3 0 6 5 3 6 9 1 2 4 3 5 2 10 2 1 4 5
3 6 10 7 1 4 7 2 9 2 8 3 1 1 7 8 10 2 4 7 3 0 1 6 6 10 6 7 7 1

Spearman rank correlation coefficient.   Rs=0.657  CV=0.738
209 353 19 201 344 132 401 126
 23  31  7  12  26   5  24   4

Spearman    Rs=.817  CV=.700
361   270  306   22  35  10   8  12  21
2844 1967 1371 1064 667 241 188 154 125

1 1 1 1 2 2 2 2 3  3 3 3 4 4 4 4 5 5 5  5 6 6 6 6 7 7 7 7 8  8 8 8 1 2 3 4 5 6 7  8
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

**************************************
Exercise data

r≈0
116 99 99 96 97 86 98 102 95 107 97 102 100 108 94 93 95 93 101 114 109 113 94 90 99 108 81 107 98 90 109 118 103 99 93 104 100 84 99 92 98 105 117 90 108 121 111 107 92 136 99 102 81 106 93 119 104 108 99 102 86 95 96 97 90 99 96 100 96 105 123 109 93 105 115 100 94 110 89 94 93 95 99 87 95 104 88 114 95 98 105 93 118 92 115 95 88 84 81 92
11.3 9.3 9.3 9.9 9.4 10.9 9.9 10.2 9.4 9.6 9.8 9.1 9.4 9.6 9.9 10.1 8.3 10.7 9.7 10.1 11.5 9.9 10.9 10.4 10.9 10.8 9.2 9.8 10.7 10.5 9.7 9.4 11.3 10.2 8.7 10.1 10.4 8.7 9.3 10.4 9.7 11.4 9.3 9.6 9.2 9.5 10.5 10.0 7.5 10.0 9.2 12.3 9.1 10.1 9.9 10.9 10.7 8.8 10.8 10.3 9.7 10.2 9.2 9.2 10.8 8.0 11.0 11.2 10.4 9.7 9.4 11.4 9.7 9.4 10.5 10.0 9.6 10.1 10.3 9.7 9.4 10.0 11.4 11.5 9.8 10.0 9.1 8.7 10.1 10.8 9.7 10.3 9.4 9.7 10.8 10.1 10.2 11.5 10.4 9.9

r≈.2
102 104 97 111 103 96 93 81 102 95 109 104 110 103 98 100 91 93 108 119 94 100 81 111 86 108 80 103 96 99 112 104 94 104 84 118 71 91 98 109 106 106 90 81 104 115 105 88 118 97 113 97 104 89 105 103 100 122 97 88 102 109 104 114 91 88 106 108 82 104 107 110 90 89 113 103 90 81 94 117 115 100 102 98 102 97 94 93 105 103 96 123 85 91 102 106 100 98 99 95
9.7 9.8 10.9 10.9 10.7 10.1 9.3 9.2 9.6 9.7 9.1 9.4 10.3 9.8 8.6 10.7 7.5 11.8 12.0 9.8 10.5 11.0 10.7 10.5 10.1 10.6 10.4 9.9 10.3 8.9 9.5 10.0 11.0 11.6 7.7 11.5 10.3 8.2 10.6 9.9 10.4 8.0 8.8 9.5 10.5 10.9 10.2 10.7 11.0 8.9 11.0 8.9 10.7 9.9 9.6 10.5 10.8 11.2 10.5 9.8 11.5 9.3 11.6 11.7 10.4 9.4 10.1 8.5 9.4 9.9 9.2 9.7 9.8 10.1 9.0 10.9 9.3 9.6 9.2 10.8 10.2 9.6 10.0 8.4 9.1 9.0 10.9 11.0 9.7 11.7 10.3 9.6 8.6 9.6 9.7 8.4 10.0 9.7 10.9 10.3

r≈.3
102 97 107 69 118 90 100 116 81 113 94 100 104 91 93 104 107 102 93 79 89 109 95 100 119 116 103 108 106 78 93 98 126 104 102 102 103 92 105 88 109 97 99 99 111 119 82 80 96 85 116 97 98 106 96 95 101 100 102 99 82 91 101 84 99 101 111 98 118 107 124 111 118 98 102 96 100 93 102 108 112 110 115 87 113 94 102 111 99 79 104 85 90 103 89 93 90 102 79 114
9.5 11.0 11.0 8.4 11.0 10.0 10.7 11.0 9.6 10.0 8.6 8.1 9.9 10.6 8.8 10.4 10.0 10.4 9.2 8.9 9.9 9.1 9.2 11.0 10.3 10.2 9.4 9.8 9.9 10.0 8.5 10.8 10.8 11.7 10.8 9.3 11.5 9.8 8.9 9.3 10.1 9.5 9.7 11.9 9.4 10.7 8.6 9.2 9.3 10.8 8.9 10.4 9.9 11.3 8.9 9.3 10.7 10.0 10.9 11.9 10.3 10.2 9.6 10.4 11.1 10.2 11.1 7.7 9.6 8.9 11.2 9.1 8.3 10.1 9.3 9.6 9.4 10.8 11.4 9.8 10.3 9.3 10.3 9.2 10.9 9.8 11.2 12.1 10.3 8.4 9.9 8.3 10.4 9.5 9.7 9.8 11.7 11.1 9.9 11.3

r≈.5
88 108 101 109 95 107 90 92 105 86 105 102 97 99 89 110 100 101 87 112 108 108 99 91 73 101 104 104 102 91 96 94 99 113 110 87 98 100 101 101 106 121 106 113 106 83 90 100 83 119 83 101 109 96 92 111 105 91 99 90 107 109 89 101 123 103 89 95 94 93 85 113 107 89 117 93 99 99 100 100 91 101 98 93 81 139 88 115 102 100 112 104 114 100 113 106 89 101 82 100
11.0 10.0 9.6 11.4 9.5 10.6 9.9 9.5 10.4 9.6 9.8 9.6 9.2 10.5 8.6 10.2 10.8 10.1 9.0 12.2 10.5 10.1 9.7 9.5 7.7 10.5 11.3 10.2 10.2 9.2 11.3 11.0 8.8 10.9 9.6 9.7 10.0 9.5 9.6 10.5 8.4 11.8 10.7 10.5 10.2 8.9 9.5 10.5 7.8 11.2 10.1 9.7 10.6 10.7 9.0 11.6 9.7 11.1 10.2 9.6 10.7 11.0 8.3 9.8 11.0 10.5 9.8 9.0 10.4 9.2 10.0 12.2 9.8 9.0 10.0 9.6 9.9 8.4 8.4 10.1 10.0 9.8 10.7 10.6 8.9 10.6 10.6 10.5 9.5 9.8 10.0 10.3 9.7 10.1 11.4 10.3 10.3 10.1 8.5 8.3

r≈.7
95 103 107 105 104 89 99 96 106 99 97 94 89 99 111 100 120 107 106 103 98 107 109 96 108 99 114 95 90 100 85 108 108 113 104 106 98 114 118 104 108 97 96 109 97 101 90 111 116 86 100 93 98 96 92 92 78 89 82 111 96 84 100 103 102 110 92 100 91 100 93 107 100 95 114 83 102 103 117 86 110 102 104 70 82 116 100 93 93 87 115 110 91 104 110 95 101 90 102 104
8.3 11.5 10.3 10.2 10.3 9.0 9.8 8.3 10.7 10.1 11.0 9.4 9.8 10.2 11.0 9.2 11.2 10.3 10.5 9.5 11.1 12.2 11.3 10.1 10.5 9.2 10.0 8.8 9.1 9.9 8.3 10.9 9.7 11.9 10.0 10.3 9.2 11.5 12.2 10.7 10.1 9.3 9.4 11.3 9.5 10.7 9.4 11.2 12.8 9.9 7.9 8.0 10.4 9.6 10.3 9.1 9.7 6.8 8.8 9.7 10.8 9.0 10.0 9.4 11.2 10.8 10.2 9.2 8.4 9.5 10.0 11.3 10.1 9.4 10.8 8.6 9.3 11.0 10.1 9.8 11.4 9.6 11.0 7.0 7.8 12.3 8.2 9.6 10.3 9.1 12.2 10.2 10.1 10.1 10.7 9.5 10.4 10.2 9.5 10.6

r≈.9
122 102 115 109 93 97 85 102 106 98 112 76 90 113 110 121 99 93 108 82 111 99 94 95 95 109 118 82 96 120 111 94 95 108 104 100 91 91 106 106 106 93 108 104 85 103 94 83 103 79 109 95 101 107 98 86 98 93 104 103 107 96 89 101 95 107 109 110 110 87 83 121 119 108 100 93 86 96 124 92 92 106 113 94 97 106 91 98 103 99 93 111 79 92 91 99 102 92 86 113
11.7 10.6 11.5 11.4 10.0 9.6 9.2 10.4 11.0 9.6 10.9 8.0 9.5 12.2 12.0 11.8 9.7 9.3 10.2 8.0 11.0 9.0 9.2 9.7 8.9 10.3 11.5 8.9 9.1 11.4 10.7 10.4 10.1 11.0 10.3 10.0 9.0 9.6 10.3 10.7 11.1 9.8 11.3 9.9 8.5 9.8 9.7 8.6 10.5 8.5 11.3 9.9 9.7 10.4 9.9 8.5 10.5 9.0 10.7 10.4 10.5 9.5 8.9 11.0 9.5 10.2 10.8 10.7 11.0 8.8 7.0 10.9 11.7 10.6 9.6 8.9 8.7 9.5 12.4 9.5 9.3 10.1 10.6 10.0 9.9 11.4 9.9 9.3 10.3 9.9 8.7 10.6 6.8 9.1 9.8 9.8 10.2 9.6 8.5 10.8