Statistics pdf - د. أحمد

THE CORRELATION / REGRESSION

MODEL

Asst Prof Dr. Ahmed Samir Al-Naaimi

MBChB, MSc epi, PhD

Department of Community Medicine

Baghdad College of medicine

Email:

info@topmedresearch.com

Learning Objectives

Obtain a measure of the linear relationship between
two quantitative random variables (X &Y).

Interpret the value of linear correlation coefficient (r).

Master the calculation of t test statistic for (r).

Interpret the scatter diagram.

Understand the elements of simple linear regression
model.

Interpret the parameters of linear regression model.

Understand the requirements and uses of linear
correlation and simple linear regression.

Pearson’

s Correlation Coefficient (r)

It is a measure of the strength of linear (or straight line)
relationship between two interval/ ratio scale
quantitative (continuous) variables under the
assumption of normal distribution.

Its value lies between (-1 to +1) inclusive.

-1: perfect inverse linear correlation.

+1: Perfect direct linear correlation.

0: No correlation

Pearson’

s Correlation Coefficient (r)-2

The value of (r) indicates the strength of the
relationship between the 2 quantitative variables:

—

<0.2

: very weak

—

0.2 to 0.39 : weak

—

0.4 to 0.69

: moderate

—

0.7 to 0.89: strong

—

0.9

: very strong

Pearson’

s Correlation Coefficient (r)-3

—

The sign of (r) indicates the direction of the relationship.
Positive correlation indicates that high scores on one
variable is associated with high scores on a second
variable (i.e. an increase in one variable is associated with
a corresponding increase in the second one).

—

Negative correlation indicates that high scores on one
variable is associated with low scores on the second
variable (i.e. an increase in one variable is associated with
a reciprocal reduction in the second one).

Testing significance of (r)

The (r) value represents a sample value and can be used
to test the hypothesis:

= 0 there is NO relationship between X and Yin the

population

≠0

where

r = linear correlation coefficient (statistic) between two

variables in the sample

(rho)=linear correlation coefficient (parameter) between
the same two variables in the population

Testing significance of (r)

The sampling distribution of r is approximately normal
(but bounded at -1.0 and +1.0) when

is large and

distributes as

when

is small. The simplest formula for

computing the appropriate t value to test significance of a
correlation coefficient employs the t-distribution with df=n
-2

Testing significance of (r)-Example

Example: Suppose you observe that

r= 0.50

between literacy

rate and political stability in 10 nations. Is this relationship
"strong"? Is it significant statistically?

—

Since

is between 0.4 to 0.69, the linear correlation is

moderately strong “

”

For

and one-tailed test (unidirectional test, since

we want to see that >0 at =0.05), the critical value (decision
rule) of

1- =1-0.05=0 .95, df=8

= 1.86

—

We calculate the “

Test Statistic”

, which is

than the

of 1.86. So the null hypothesis of no relationship in the

population ( =0) cannot be rejected and we conclude that
there is no statistically significant linear correlation between
literacy and political stability.

Testing significance of (r)-comments

Note that a relationship can be strong and yet not
significant. Conversely, a relationship can be weak but
significant. The key factor is the size of the sample (n).
For large samples, it is easy to achieve significance, and
one must pay attention to the strength of the correlation
to determine if the relationship makes sense.

Example

-Systolic Blood

Pressure Readings (mmHg)
by two methods in 25
Patients with Essential
Hypertension

Patient No.

Method I

Method II

132

130

138

134

144

132

146

140

148

150

152

144

158

150

130

122

162

160

168

150

172

160

174

178

180

168

180

174

188

186

194

172

194

182

200

178

200

196

204

188

210

180

210

196

216

210

220

190

220

202

Method I

Systolic Blood pressure readings (mm Hg), 25 Patients with essential hypertension

Patient No.

Method I

Method II

132

130

17424

16900

17160

138

134

19044

17956

18492

144

132

20736

17424

19008

146

140

21316

19600

20440

148

150

21904

22500

22200

152

144

23104

20736

21888

158

150

24964

22500

23700

130

122

16900

14884

15860

162

160

26244

25600

25920

168

150

28224

22500

25200

172

160

29584

25600

27520

174

178

30276

31684

30972

180

168

32400

28224

30240

180

174

32400

30276

31320

188

186

35344

34596

34968

194

172

37636

29584

33368

194

182

37636

33124

35308

200

178

40000

31684

35600

200

196

40000

38416

39200

204

188

41616

35344

38352

210

180

44100

32400

37800

210

196

44100

38416

41160

216

210

46656

44100

45360

220

190

48400

36100

41800

220

202

48400

40804

44440

Total

4440

4172

808408

710952

757276

n ∑XY-( ∑X) ( ∑Y)

r =----- ---- ---- ---- ---- ---- ---- ---- ----- ---- -- (not for
memorization)

√[ n∑X

–(∑X)

] [ n ∑Y

–(∑Y)

]

(25)(757276) -(4440) (4172)

r =---------------------------------------------------------------

√[ (25)(808408) –(4440)

][ (25)(710952) –

(4172)

]

408220

=--------------= 0.955

427611.05

≠

Critical value (decision rule)

1- =0.95, df=23

= 1.714

|Test statistic| > |Decision rule|

So we reject the H

in favor of H

, There is a statistically

significant very strong positive (direct) linear correlation
between the 2 methods of measuring blood pressure.

Scatter Diagram

—

The form of the relationship between two variables can be
presented visually in a Scatter Diagram which is a graphic
device used to visually summarize the relationship between
two variables

—

The X-axis is the horizontal axis and represents the
independent variable, while Y–

axis is the vertical axis and

represents the dependent variable. In correlation model one
need not know which is the dependent and independent
variable, while in regression model this distinction is crucial.

—

The closer the dots that represent pairs of observations for
study subjects to the regression line the stronger is the linear
correlation.

Scatter Diagram-example

Scatter diagram with fitted regression line (r=0.81)

There is a strong (r=0.81) positive linear correlation
between body weight and body height in children and
adolescents.

Scatter Diagram-examples

Simple Linear Regression

It is helpful in:

—

Ascertaining the probable form of the relationship
between variables.

—

Predict or estimate the value of one variable
corresponding to a given value of another variable.

—

Another way to quantify the strength of association
between 2 quantitative variables under the assumption
of normal distribution (

The independent variable (x) is pre-selected

and called non-random or mathematical variable. For each
value of x there is a set of normally distributed values of
Y.

Simple Linear Regression-2

The least square method is used to predict the regression line
that best represents the linear relationship between X and Yas
shown in the formula below:

= a+ b

a= intercept (constant): the point where the line crosses the

vertical axis (i.e. amount of Ywhen X= 0)

b=slope (regression coefficient): amount by which Ychanges

for each unit change in X . If its value is negative, for each
unit increase in x the Yis expected to decrease by a mean
quantity of b. However if its value is positive, we expect that
Yincreases by a mean quantity of b for each unit increase in
x.

X=independent (explanatory) variable

Y=dependant (response) variable

Simple Linear Regression-3

Use of regression model for prediction:

If we enter

a specific value of X in the regression equation one can
predict the value of Y.

Use of regression model for assessing the effect
size

or strength of association between 2 quantitative

variables measured on interval/ ratio scale. The higher
the value of b (regression coefficient) the stronger is
the effect of x (independent, explanatory or exposure)
on the value of Y (dependent, response or outcome).
i.e. stronger dose-response linear relation.

Simple Linear Regression-3

Power of prediction of the model:

The overall

prediction power of the model is measured by R

(

) which is equal to the

square value of r (linear correlation coefficient). It
measures the proportion of observed variation in the
response variable explained by the regression model.

Simple Linear Regression-4

—

The

least square method

is used to estimate the 2

points needed to draw the regression line. The
predicted value for Y which lies on the regression line,
based on the specific value of X should give the least
possible error from the actual values of Y associated
with that X value.

—

The calculated regression coefficient (beta or slope) is

also tested for statistical significance by t-test against
the null hypothesis of

beta=0

at the population level.

—

The overall regression equation is tested for statistical
significance

The

model should

statistically significant before we are able to generalize
the results to reference population.

Example:

To evaluate the performance of a new test on 11

patients an e xperiment was done with paired measurements of
scores obtained on the new te st and the standardize d test. The
results are shown below

Patient No.

Score on New Test (X)

Score on standardized Test (Y)

106

100

114

Patient

No.

Score on

New

Test (X)

Score on

standardized Test

(Y)

2500

3721

3050

3025

3721

3355

3600

3481

3540

4225

5041

4615

4900

6400

5600

5625

5776

5700

6400

8100

7200

106

7225

11236

9010

8100

9604

8820

100

9025

10000

9500

100

114

10000

12996

11400

825

916

64625

80076

71790

n∑XY –

[ (∑X) (∑Y)]

b =------------------------- (not for memorization)

n∑X

–(∑X)

[ (11)(71790)] –

[ (825) (916)]

b =------------------------------------------= 1.124

[(11)(64625)] –(825)

∑Y–b ∑X

916 –[ 1.1236 (825)]

a =------------------

= ----------------------------- = - 0.997

Y= a + bX
Y=- 0.997 + 1.124 X
For each one score increase in X the value of Yis expected to
increase by a mean of 1.124 score.

Scores on new test