Correlation and Linear Regression, Scatterplot

a measure of the strength and direction of the association/relation between two variables (might be none if r≈0).
Correlation might mean one variable causes the other but by itself is insufficient to establish causation.
Pearson's product moment correlation coefficient r.

-1 ≤ r ≤ 1

X the independent/treatment/explanatory/predictor variable; Y the dependent/response variable.
Direction:
   x,y both go up together or both go down together: positive r.
   One goes up, the other goes down: negative r.
Strength: the closer |r| is to 1, the stronger the relationship/association.

Type or paste the X and Y data. Can be multiple lines for colored groups in scatterplot.
XY
Group 1
Group 2
Group 3
Group 4
Group 5

      

n=    ∑x=    ∑y=    ∑x2=    ∑y2=    ∑xy=
SSx=∑(x-x̄)2:    SSy=∑(y-ȳ)2:     SSxy=∑(x-x̄)(y-ȳ):       ∑zxzy=
Centroid: (x̄= ,    ȳ= )    sx=     sy=    
Covariance=∑(x-x̄)(y-ȳ) / n-1 = (r=Cov/(sxsy))    CovN

r=    
r2= Coefficient of determination. % of variation in y explainable by the regression line (the proportion of the variance in Y attributable to the variance in X). The linear relationship between the variables explains r2% of the variation in the data.
SST (total variation SSy, ∑(y-ȳ)2) = SSR (explained variation ∑(ŷ-ȳ)2) + SSE (unexplained variation ∑(y-ŷ)2)
    r2= SSR/SST= explained variation / total variation
MSE=   RMSE(standard error of estimate, se)=√(SSE/df): (spread of points around regression line)
sβ=    t.05/2    C.I. slope ± (, )   
t=β/sβ=r/√((1-r2)/(n-2)):    df=n-2:    H0:ρ=0 no correlation, HA:ρ≠0 is a correlation. p_value:
r critical value α=.05: α=.01: (or look in table) Reject H0 if |r|>|crit.val|
F=

Linear regression line: ŷ = b1x + b0     slope b1=r(sy/sx)=SSxy/SSx:    y-intercept b0=ȳ-b1x̄=
It goes thru the centroid. It minimizes the sum of the squared residuals (difference between each y and the line).

X min: X max: Y min: Y max:

mean distance from (x̄,ȳ) =    SD of distance=   

Covariance matrix Σ:
Eigenvalues: λ1          λ2
Eigenvectors:


Residual plot: (x,y-ŷ) i.e. (x,residual)
Regression is good if the residual plot has no pattern and doesn't widen or narrow.

Prediction.
x0:

ŷ:

95% Margin of error E:    95% Prediction interval (PI): (,)

99% Margin of error E:    99% Prediction interval (PI): (,)


Examples      Anscombe's Quartet

lottery
334 127 300 227 202 180 164 145 255
54 16 41 27 23 18 18 16 26

Clusters
1 1 1 2 2 2 3 3 3 10
1 2 3 1 2 3 1 2 3 10

1 1 2 2 9 9 10 10
1 2 1 2 9 10 9 10

Anscombe
10 8 13 9 11 14 6 4 12 7 5
9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 4.74

10 8 13 9 11 14 6 4 12 7 5
7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73

small MPG
2844 3109 2870 3095 2915 2985 2563 3009 2798 2468 2598 2558
36 38 41 33 42 31 40 37 35 40 38 37
r=-.395

large MPG
3608 3962 4253 4006 3754 3859 3874 4674 4321 4346 3891 3957
32 27 25 31 28 30 30 24 27 26 33 31

tar
20 27 27 20 20 24 20 23 20 22 20 20 20 20 20 10 24 20 21 25 23 20 22 20 20
1.1 1.7 1.7 1.1 1.1 1.4 1.1 1.4 1.0 1.2 1.1 1.1 1.1 1.1 1.1 1.8 1.6 1.1 1.2 1.5 1.3 1.1 1.3 1.1 1.1
(10,1.8) outlier

year(1960=1),CPI
1 13 26 35 42 43 49 53 55 59
29.6 44.4 109.6 152.4 180 184 214.5 233 237 252.2

video BP
138 130 135 140 120 125 120 130 130 144 143 140 130 150
82 91 100 100 80 90 80 80 80 98 105 85 70 100

video boats-manatees
68 68 68 70 71 73 76 81 83 84
53 38 35 49 42 60 54 67 82 78
vid says RMSE(se)=6.61  mine:6.85 they use wrong numbers for m and b...

test data:
35 12 65 47 21 32 52 15 57
210 160 285 255 180 220 190 170 275

34 108 64 88 99 51
5 17 11 8 14 5

uniform random 0-10
1 5 2 6 7 1 2 2 1 2 8 3 3 0 6 5 3 6 9 1 2 4 3 5 2 10 2 1 4 5
3 6 10 7 1 4 7 2 9 2 8 3 1 1 7 8 10 2 4 7 3 0 1 6 6 10 6 7 7 1