.7J7
.7J7.7J7
.7J7
.7J7
.7J7
.7J7
.7J7
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
FB1
Medical statistics
Lecture One
Introduction to medical statistics
Statistics
Is the discipline concerned with the treatment (handling) of numerical data derived from groups of individuals Statistical methods are the methods especially adapted to the elucidation of data affected by multiplicity of causes
It is a fact that we are living in the information age (information revolution) For example, about 0.5 million new articles are published only in the medical field annually
Thus we need to know how to obtain, how to analyse, and how to interpret these information (which are called data) Data are available in the form of numbers (values)
Is that field of statistics in which the data being analysed were derived from the biological sciences and medicine
Biostatistics
Data (datum) The raw material of statistics is called data. It is obtained either as a measurement or as a process of counting.
Value It is the numerical representative of the measurement of the variable
Sources of data
Routine records, such as hospital medical records Surveys, if the data needed to answer a question are not available from routine records Experiments External sources, in form of published reports, data banks, or the research literature
Variable Any characteristic that can take different values in different occasions, places, persons, and time, e.g. height, weight, age, etc...
Variables are one of two types
Quantitative variables are of two main types
There is another classification of variables according to measurements or measurement scales. Measurement means the assignment of numbers to objects or events according to a set of rules these rules include
Measurements and measurement scales
Population It is the largest collection of entities of which we have an interest at a particular time, sharing at least one characteristic in common
Sample The sample may be defined as a part of population, subset of population chosen in a representative way to be as much as possible representative for the population (random, or non-random) The method applied to collect a sample is called sampling
Lectures Two Three
Summarisation and presentation of data
Data organisation (Ordered array)
It is the enlistment or the arrangement of the data according to their magnitude from the smallest to the largest or vice versa. The benefits of ordered array are
Determine the smallest value (Xs) and the largest value (Xl) Determine the range Easy to present the data by table To find the value of median
Data presentation is either by 1-Neumerical (numbers) 2- Tables as a-Master table b- Simple frequency distribution table c- Class interval frequency distribution table 3- Graphs (Pictorial presentation of data)
Data presentation
When we have the data composed of small sample size (n20) it is easy to present them by numerical (numbers) simple data, while if the data is more than 20 values or observations it is better to present them by tables
It contains the information regarding all variables included in the study (spreadsheet in the computer Excel). From master table the information regarding one or two variables will be taken and presented in simple frequency or an other type of tables.
It is the arrangement of data according to their magnitude and the frequency of occurrence of each magnitude.
Simple frequency distribution table
Parity
Frequency
Cum.F
R.F.
C.R.F
R.F.
C.R.F.
Primigravida (0)
25
25
0.25
0.25
25
25
14
39
0.14
0.39
14
39
16
55
0.16
0.55
16
55
18
73
0.18
0.73
18
73
4 more
27
100
0.27
1.00
27
100
Total
100
--
1.00
--
100
--
Table (1) The parity distribution of mothers attending ANC clinic in the Al-Muntezeh PHCC for the year 2010
The characteristics of tables
The data of continuous quantitative type is presented here as intervals, the steps to present the data by class interval table is as following Count the number of observations. Determine the smallest and the largest values.
Class-interval frequency distribution table
Decide whether to present them in simple or in class interval table. To present them in class interval table we have to determine the number of class intervals according to Sturges formula
K13.322 log10 n
Then determine the width of class interval W Then determine the class interval Then present the frequency of observations according to this class interval by tallying
The additional characteristics of class interval tables
The haemoglobin level in g/dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2010
10.2
13.7
10.4
14.9
11.5
12.0
11.0
13.3
12.9
12.1
9.4
13.2
10.8
11.7
10.6
10.5
13.7
11.8
14.1
10.3
13.6
12.1
12.9
11.4
12.7
10.6
11.4
11.9
9.3
13.5
14.6
11.2
11.7
10.9
10.4
12.0
12.9
11.1
8.8
10.2
11.6
12.5
13.4
12.1
10.9
11.3
14.7
10.8
13.3
11.9
11.4
12.5
13.0
11.6
13.1
9.7
11.2
15.1
10.7
12.9
13.4
12.3
11.0
14.6
11.1
13.5
10.9
13.1
11.8
12.2
K130322 log10 n 13.322 X log10 70 13.322 X 1.85 16.15 7.15 7 The width of class interval R (Max-Min)/ K (15.1-8.8)/ 7 1
Table 3 The haemoglobin level in g/ dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2010
Hemoglobin (g/ dL)
Tallying
Freq.
Cum.F
R.F.
C.R.F.
R.F.
C.R.F.
8-
0.014
0.014
1.4
1.4
9-
III
0.043
0.057
4.3
5.7
10-
IIIII IIIII IIII
14
18
0.2
0.257
20.0
25.7
11-
IIIII IIIII IIIII IIII
19
37
0.27
0.528
27.1
52.8
12-
IIIII IIIII IIII
14
51
0.2
0.728
20.0
72.8
13-
IIIII IIIII IIII
13
64
0.186
0.914
18.6
91.4
14-
IIIII
69
0.071
0.985
7.1
98.5
15-15.9
70
0.014
1.00
1.4
100
Total
70
--
1.00
--
100
--
The Graphical Representation of Data
Types of graphs
Table 4 The method of delivery of 600 babies born in Bintul Huda Teaching Hospital for the year 2010
Method of delivery
No. of births
Percentage
Normal vaginal delivery
478
79.7
Forceps delivery
65
10.8
Caesarean section
57
9.5
Total
600
100
Figure 1 The method of delivery of 600 babies born in Bintul Huda Teaching Hospital for the year 2010
Figure 2 The method of delivery of 600 babies born in Bintul Huda Teaching Hospital for the year 2010
Pie chart
It is a graphic representation used to present data of qualitative type in shape of circle The size of the slice for each category is determined by the equation f/ n 3600.
Figure 3 The method of delivery of 600 babies born in Bintul Huda Teaching Hospital for the year 2010
Histogram
It is a graphic representation used to present continuous quantitative data arranged in class-interval It is composed of number of bars adherent to each other
Figure 4 The haemoglobin level in g/ dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2010
Line graph (frequency polygon)
So the line graph will join the X-axis at these two ends. The area of line graph below the line above the X-axis is equal to the area of histogram, equal to one unit, equal to 100, equal to the probability.
Also line graph is used when we want to present two groups by one graph for the purpose of comparison, which is not possible by histogram (as one bar of group 1 will cover another bar from group 2)
Figure 5 The haemoglobin level in g/dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2010
The characteristics of graphs
Stem-and-Leaf display
Another graphical method of representing data
Lecture Four
Measurement of central location
Data summarisation
Data summarisation is either by Measurements of central tendency (average measurements, measurements of location, and measurements of position) Measurements of variability (dispersion, distribution measurements) Skewness Kurtosis
Measures of central tendency
Descriptive measure is a single number used as a means to summarize data. Statistic is a descriptive measure computed from the data of a sample. Parameter is a descriptive measure computed from the data of a population.
Mean Mode Median
Mean
Properties of the mean
Uniqueness Simplicity Since each and every value in a set of data enters into the computation of the mean, it is affected by each value. Therefore, extreme values have an influence on the mean
Mode
Is that value which occurs most frequently.
Median
Is the value that divides the set into two equal parts after sorting them into an ascending or descending pattern
If the number of values is odd, the median will be the middle value when all values have been arranged in order of magnitude.
When the number of values is even, there is no single middle value. Instead, there are two middle values. In this case, the median is taken to be the average of these two middle values, when all values have been arranged in order in order of magnitude.
Properties of median
Uniqueness Simplicity It is not as drastically affected by extreme values as is the mean
Mode
Is that value which occurs most frequently in a set of observations. A set of values may have more than one mode ( e.g,bimodal, trimodal).
Lecture Five
Measurement of variability
Measures of dispersion
Measures of variability (Dispersion)
The degree to which numerical (quantitative data) tend to spread about an average value is called variation or dispersion of the data. The variation is something that is in the nature of data, i.e. the data always do not come as one value.
There are a lot of measures of variation (dispersion) available, but the most commonly used are
Range
The range is of limited use in statistics as a measure of variability because it takes in consideration only two values and neglects the others.
These two values, considered by the range, are the two extreme ones (smallest and the largest), which are not of high interest in biostatistics to describe the variation perfectly
The uses of range
It gives an idea about the extent of data distribution (the scale or range on which the data extend or spread). In determining the width of class interval in case of class interval table (wR/K).
Variance
The variance is defined as the average of the squared deviation of observations away from their mean in a set of observations. Or The scatter of values about their mean
E.g. Suppose we have five persons with their haemoglobin level (g/dl) measurements (8, 9, 10, 11, 12).
Hemoglobin level (g/dl)
Difference, deviation d(Xn-X)
D2 (Xn-X)2
8-10-2
9-10-1
10
10-100
11
11-101
12
12-10 2
Standard deviation
The SD is defined as the squared root of the variance, or it can be defined as the average of the deviation of observations away from their mean in a set of observations.
It is a measure widely used in biostatistics as a measure of variability If the value of SD is high, it means the data posses a large variation and vice versa
Coefficient of variation (CV)
It is the standard deviation expressed in percentage out of the mean.
To compare the variability of two groups for the same variable measured by the same unite and they have the same SD value but different means.
Table The parity distribution of mothers attending ANC clinic in the PHCC of the Al-Muntezeh PHCC for the year 2010
Parity
frequency
Cum. f
xf
r.f.
c.r.f.
r.f.
c.r.f.
x2f
3l
0.03
0.03
1
15
18
15
0.15
0.18
15
18
15
24
42
48
0.24
0.42
24
42
96
27
69
81
0.27
0.69
27
69
243
15
84
60
0.15
0.84
15
84
240
10
94
50
0.10
0.94
10
94
250
100
36
0.06
1.00
100
216
Total
n100
--
1.00
--
100
--
For the calculations Mean ( )
2.9 Mode 3 (it has the highest frequency i.e. 27) Median position
50.5 (50th, 51st) From the column of cumulative frequency, the Median 3 Or Median 50th percentile (half of 100 50) so from the column of C.R.F the median 3
Table The haemoglobin level in g/ dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2010
hemoglobin (g/dL)
Freq.
Mid point
MP x f
Cum. f
r.f.
c.r.f.
r.f.
c.r.f.
MP2 x f
18 -
8.5
8.5
0.014
0.014
1.4
1.4
72.25
9-
9.5
28.5
0.043
0.057
4.3
5.7
270.75
10-
14
10.5
147.0
18
0.2
0.257
20
25.7
1543.5
11-
19
11.5
218.5
37
0.27
0.528
27.1
52.8
2512.75
12 -
14
12.5
175.0
51
0.2
0.728
20
72.3
2187.5
13-
13
13.5
175.5
64
0.186
0.914
18.6
91.4
2369.25
14-
14.5
72.5
69
0.071
0.985
7.1
98.5
1051.25
15-15.9
15.5
15.5
70
0.014
1.00
1.4
100
240.25
Total
n 70
--
--
1.00
--
100
--
For the calculations Mean ( )
12.01 g/ dl
Mode 11.5 g/dl (C.I of 11-11.9) which has the highest frequency i.e. 19) Median position 35th
From column of cum. F. the median lies in C.I 11-11.9 Median
C.R.F curve for calculating the exact value of the median in continuous quantitative data arranged in class interval.
L
x W
L
x W 11
x 1 11.89g/dl
2.08
Introduction to sampling
Data collection
It is difficult to study all population of interest to reach a conclusion regarding certain parameter (variable) and the effect of different factors on such parameter. It needs time, money, efforts, and manpower, through census of the population.
But census determines only the demographic characteristics of the population. No medical information can be gathered from census.
So a sample is taken from the population by sampling, which is as representative as possible for the population When it is done properly, we can generalise its findings on the population.
Reasons for sampling
Sample can be studied easily (population needs time, manpower, and efforts). Less expensive than studying the entire population Sample results are usually more accurate than results obtained from population.
If samples are properly selected, probability methods can be used to estimate the error in the resulting statistics. To reduce the heterogeneity, so that a sample of specific characteristics can be studied, i.e. not whole population
Sampling
There are two main types of sampling Probability (random sampling) which is the best method that allows us to infer from the sample drawn to the population Non-probability (Non-random sampling)
In this type of sampling, each person in the population has an equal chance (probability) to be included in the sample as the others. So there is no bias that prefer any person to be included in the sample
Random sampling
This method allows to select a sample that is as representative as possible to the population, making it possible to generalise the findings in the sample on the population. There are different, methods of random sampling
Simple random sampling
The individuals are coded by letters or numbers (to make it more random than names). Next, the required number of individuals is selected, and each one has the same chance of being chosen in the sample.
This selection can be achieved by labeling a card for each individual in the population, shuffling them well, and then selecting the appropriate required number of cards.
A more convenient method is the random digit table. First of all look at the total number of the population and see how many digits it comprises 00000 to select numbers within this digit range.
An arbitrary point will be chosen from random digit table and then we go in the list to read numbers moving down or across rows as preferred until the required number of different individuals have been selected.
If a selected random digital number is larger than the population total or if it is zero, it will be ignored. If a chosen number is repeated it will be ignored
Sometimes, in case if we are dealing with infinite population (population composed of endless number, such as patients attending the outpatient clinic)
Systematic random sampling
So it is convenient to carry out sampling in a systematic way (through regular interval).
Example
The total number of patients attending outpatient clinic in Al-Hussein Teaching Hospital is about to be 500 daily. We want to select a sample of 100 patients, so the interval is 500/ 100 5th
This type of sampling is used when we have a population composed of quite different strata or distinct subgroups.
Stratified random sampling
As if we have a population composed of males and females.
The selection of a sample that does not take into account these distinct subgroups will yield a sample that may be totally composed of males or of females or of different percentages of males and females as that of the population.
So we use stratified random sampling, in which we divide the population according to these subgroups and then we select the required number by simple random sampling from each subgroup
By this method we select a sample that the percentage of males and females are the same of that in the population
When we have a large population extended over a large geographical area, it is better to carry a multi-stage random sample.
Clustered random sampling
By this method we select in stages. In each stage, the selection is done by simple random sampling
Probability
Example Tossing by a dice Experiment Tossing a six-sided dice and S 1, 2, 3, 4, 5, 6 A roll an even number A 2, 4, 6
Methods of assigning probability
Example Tossing a dice S 1, 2, 3, 4, 5, 6 P (1) P(2) P(3)P (4)P(5)P(6) A roll an even number 2, 4, 6 P (A) 3/6 0.5
Empirical probability is simply the relative frequency that some event is observed to happen (or fail). Number of times an event occurred divided by the number of trials P (A) Where N total number of trails nA Number of outcomes producing A
Relative frequency example
Children No.
Frequency
Relative frequency
40
40/215 0.19
80
80/215 0.37
50
50/215 0.23
30
30/215 0.14
10
10/215 0.05
5/215 0.02
Sum
215
215/215 1.00
Basic concepts of probability
Definitions
Laws of Probability
Examples
Type of position
Gender
Total
Managerial
11
Professional
31
13
44
Technical
52
17
69
Clerical
2.7
31
Total
100
55
155
P (T C) P (T) P (C)
Probability Distribution
The normal probability distribution
Figure
Central Limit theorem
Standard normal distribution (curve)
Z- score
Normal curve table
Steps for figuring percentage above or below a z-score
Convert raw score to z-score, if necessary Draw a normal curve - indicate where z-score falls - Shade area you are trying to find Find the exact percentage with normal curve table
Figure
Steps for figuring a z-score or raw score from a percentage
Figure
Figure
0.61
We can also determine how much of the area under the normal curve is found between any point on the curve and the Once you have a z-score, you can use the table to find the area of the z-score 0.61 (from table A) 0 .2291 0.23 Therefore, 22.9 or 23
Figure
The confidence interval and limit
The Confidence Interval approach is based upon the normal curve distribution
The characteristics of normal distribution could be applied to the distribution of the sample means
95 of samples drawn from a population fall within 1.96 x standard error of the mean 99 of samples drawn from a population fall within 2.58 x standard error of the mean
Alternatively
The probability that 1.96 SE contains population mean () is 0.95 The probability that 2.98 SE contains population mean () is 0.99
A confidence interval is a range of values within which the population parameter is expected to occur
If we have a sample mean and we want to estimates the population mean, we need to construct confidence limits, around the sample mean
Confidence Interval ( CI)
Where Z the critical value in a normal probability distribution for computing the upper and lower estimates
95 confidence limits -1.96 , 1.96
Confidence Interval forAssumptions population SD is known population is normally distributed If population is not normal, use large sample ( 30)
Factors affecting interval width
Sample size (n) Variability in the population usually estimated by SD Desired level of confidence, usually 95 or 99
For Qualitative data
Probability distributions
Probability distributions of discrete variables Binomial distribution The Poisson distribution Probability distributions of continuous variables The normal distribution
Probability distribution of a discrete random variable
It is a table, graph, formula, or other device used to specify all possible values of a discrete random variable along with their respective probabilities. Cumulative probability distribution
Binomial distribution
The Poisson distribution
N.B. In Poisson distribution, the mean and variance are equal.
Continuous probability distributions
When the number of observations, n, approaches infinity, and the width of the class intervals approaches zero, the frequency polygon approaches a smooth curve. Such a curve is used to represent graphically the distribution of continuous random variables
The
The standard normal distribution
Is a normal distribution with mean zero and standard deviation 1
The binomial Distribution
It is a probability distribution of a binomial variable
Properties of a binomial experiment
A sequence of (n) identical trials 2 outcomes are possible for each trial success and failure The probability of success, denoted p, does not change from trial to trial The trials are independent
In a binomial experiment the number of successes in (n) repeated trials is a discrete random variable called (r)
Binomial Equation
Example
Solution
P0 5 (0.2)0 (1- 0.2)5-0 0 (5- 0)
Standard Normal Distribution Curve
Z-Score
A data value greater than the sample mean will have a Z-score greater than zero A data value the mean will have a Z-score of zero
Normal Curve Table
It gives the precise percentage of scores (values) between the mean (Z-score of zero) and any other Z-score
It can be used to determine Proportion of scores above or below a particular Z-score Proportion of scores between the mean and a particular Z-score Proportion of scores between two Z-scores
Steps for figuring percentage above or below a Z-score
Convert raw score to Z-score Draw a normal curve Indicate where Z-score falls Shade the area you are trying to find 3. Find the exact percentage with normal curve table
Tests of significance
The Z test, The t test, and The X test
Tests of Significance
What is a test of significance A/ It is a formal procedure for comparing observed data with a hypothesis whose truth we want to assess. The results of tests are expressed in terms of a probability that measure how well the data and hypothesis agree
Stating Hypothesis
A hypothesis is a statement about parameters in the population, ex 1 2 Hypotheses are only concerned with the population
Null hypothesis (Ho)
Alternative Hypothesis (Ha)
Types of statistical tests
Steps of hypothesis testing
Significance level
The Z test
Z
5. Find the critical value a. for Z 1.96 0.05 b. for Z 2.58 0.01
6. Decision Reject Ho if test statistics critical value i.e. P value the significance level
Z- Test for differences between 2 means Z
Testing the difference between 2 sample proportions Z Where Sp1-p2 P (Pooled)
T-test
One Sample T-testThe df is the number of scores in a sample that are free to vary The df is a function of the sample size determines how spread of the distribution is (compared to the normal distribution)
The T-distribution
Finding tcrit using t-table
T-table is very similar to the standard normal table The bigger the sample size (or df), the closer the t-distribution is to a normal distribution
T-test for two sample means
The X Test
Lectures Thirteen - Fifteen
The concept of community diagnosis as an application of statistics in measuring population health
In clinical practice and in clinical surveys, we describe patients as individuals with some labeling (diagnosis)
In epidemiology, the concern is different Here, we are interested both in the sick and the non-sick persons We describe the sick, the characteristics, the events in relation to the total population to which these attributes are related
Rates
All rates have 1. Numerator cases or events 2. Denominator population at risk 3. Time limit or reference period 4. A standard multiplication factor, usually a multiple of 10
Population at risk
are those individuals who are at risk of getting ill and thus contributing to the cases (they became ill or diseased or die or give birth to live babies), which form the numerator Generally, the numerator is part of the denominator
Proportions
Ratios
express the number of persons with a characteristic relative to the number of persons without the characteristics. The numerator is not part of the denominator. Ratio is not a common epidemiological parameter.
Example
In a village, there were 6000 persons. During the year 2001, a total of 240 live births took place, of which 115 were female births. Use these data to measure frequencies of births as events in this population
Definition of basic rates
For convenience, commonly used epidemiological rates can be grouped into three groups
1. Rates related to fertility
These are useful indicators in health and demographic characteristics of population. The rates include
a. Crude birth rate (CBR)
b. General fertility rate (GFR)
c. Marital specific fertility rate
d. Total fertility rate (TFR)
This refers to the total births a woman has during her full reproductive life. It equals to the summation of age -specific fertility rates when based on cross-sectional studies.
2. Rates related to morbidity
Which of these two rates do you expect to have higher value if both are calculated for the same population during a given year
Incidence rate is more useful in the following situations
To study disease of short duration To study the aetiology of disease To evaluate preventive measures To determine the risk of acquiring of disease To assess transmission of infectious agent
Prevalence rate
Point prevalence rate (PnPR)
Period prevalence rate (PrPP)
Notes
Relationship of incidence and prevalence
Actually point prevalence of any disease is a function of its incidence rate and the rate at which cases die or completely recover
3. Rates related to mortality
These rates measure the impact of disease on the population in terms of death, thus they reflect in general the severity of disease and the quality of health care services
The commonly used mortality rates are
4. Other useful rates
Exercise