Hypothesis Testing

by Georgina Wilson González and Karpagam Sankaran

Table of Contents

General Description
Types of Tests
Steps in Hypothesis Testing
Practical Examples
    A) One Tailed
    B) Two Tailed
Limitations for Environmental Sampling
    A) Multiple Comparisons
    B) Multiple Constituents
    C) Difficulty in Meeting Assumptions

General Description

There are two types of statistical inferences: estimation of population parameters and hypothesis testing. Hypothesis testing is one of the most important tools of application of statistics to real life problems. Most often, decisions are required to be made concerning populations on the basis of sample information. Statistical tests are used in arriving at these decisions.

There are five ingredients to any statistical test :

    (a) Null Hypothesis
    (b) Alternate Hypothesis
    (c) Test Statistic
    (d) Rejection/Critical Region
    (e) Conclusion

In attempting to reach a decision, it is useful to make an educated guess or assumption about the population involved, such as the type of distribution.

Statistical Hypotheses : They are defined as assertion or conjecture about the parameter or parameters of a population, for example the mean or the variance of a normal population. They may also concern the type, nature or probability distribution of the population.

Statistical hypotheses are based on the concept of proof by contradiction. For example, say, we test the mean (m) of a population to see if an experiment has caused an increase or decrease in m. We do this by proof of contradiction by formulating a null hypothesis.

Null Hypothesis : It is a hypothesis which states that there is no difference between the procedures and is denoted by H0. For the above example the corresponding H0 would be that there has been no increase or decrease in the mean. Always the null hypothesis is tested, i.e., we want to either accept or reject the null hypothesis because we have information only for the null hypothesis.

Alternative Hypothesis : It is a hypothesis which states that there is a difference between the procedures and is denoted by HA.

Table 1. Various types of H0 and HA
Case Null Hypothesis H 0 Alternate Hypothesis H A
1 m1 = m2 m1 ¹ m2
2 m1 < m2 m1 > m2
3 m1 > m2 m1 < m2

Test Statistic : It is the random variable X whose value is tested to arrive at a decision. The Central Limit Theorem states that for large sample sizes (n > 30) drawn randomly from a population, the distribution of the means of those samples will approximate normality, even when the data in the parent population are not distributed normally. A z statistic is usually used for large sample sizes (n > 30), but often large samples are not easy to obtain, in which case the t-distribution can be used. The population standard deviation s is estimated by the sample standard deviation, s. The t curves are bell shaped and distributed around t=0. The exact shape on a given t-curve depends on the degrees of freedom. In case of performing multiple comparisons by one way Anova, the F-statistic is normally used.It is defined as the ratio of the mean square due to the variability between groups to the mean square due to the variability within groups. The critical value of F is read off from tables on the F-distribution knowing the Type-I error aand the degrees of freedom between & within the groups.

Rejection Region : It is the part of the sample space (critical region) where the null hypothesis H0 is rejected. The size of this region, is determined by the probability (a) of the sample point falling in the critical region when H0 is true. a is also known as the level of significance, the probability of the value of the random variable falling in the critical region. Also it should be noted that the term "Statistical significance" refers only to the rejection of a null hypothesis at some level a.It implies only that the observed difference between the sample statistic and the mean of the sampling distribution did not occur by chance alone.

Conclusion : If the test statistic falls in the rejection/critical region, H0 is rejected, else H0 is accepted.

Go to Table of Contents

Types of Tests

Tests of hypothesis can be carried out on one or two samples. One sample tests are used to test if the population parameter (m) is different from a specified value. Two sample tests are used to detect the difference between the parameters of two populations (m1 and m2).

Two sample tests can further be classified as unpaired or paired two sample tests. While in unpaired two sample tests the sample data are not related, in paired two sample tests the sample data are paired according to some identifiable characteristic. For example, when testing hypothesis about the effect of a treatment on (say) a landfill, we would like to pair the data taken at different points before and after implementation of the treatment.

Both one sample and two sample tests can be classified as :

One tailed test : Here the alternate hypothesis HA is one-sided and we test whether the test statistic falls in the critical region on only one side of the distribution.

  1. One sample test: For example, we are measuring the concentration of a lake and we need to know if the mean concentration of the lake is greater than a specified value of 10mg/L.
    Hence, H0: m £ 10 mg/L, vs, HA: m > 10 mg/L.

  2. Two sample test: In Table1, cases 2 and 3 are illustrations of two sample, one tailed tests. In case 2 we want to test whether the population mean of the first sample is lesser than that of the second sample.
    Hence, H0: m1 ³ m2 , vs, HA: m1 < m2.

Two tailed test : Here the alternate hypothesis HA is formulated to test for difference in either direction, i.e., for either an increase or a decrease in the random variable. Hence the test statistic is tested for occurrence within either of the two critical regions on the two extremes of the distribution.

  1. One sample test: For the lake example we need to know if the mean concentration of the lake is the same as or different from a specified value of 10 mg/L.
    Hence, H0: m ¹ 10 mg/L, vs, HA: m = 10 mg/L.

  2. Two sample test: In Table 1, case 1 is an illustration of a two sample two tailed test. In case 1 we want to test whether the population mean of the first sample (m1) is the same as or different from the mean of the second sample (m2).
    Hence H0: m1 = m2 , vs, HA: m1 ¹ m2.

Given the same level of significance the two tailed test is more conservative, i.e., it is more rigorous than the one-tailed test because the rejection point is farther out in the tail. It is more difficult to reject H0 with a two-tailed test than with a one-tailed test.

The diagram associated with the link illustrates the critical region(s) for one and two tailed tests.

one and two tailed tests

Go to Table of Contents


When using probability to decide whether a statistical test provides evidence for or against our predictions, there is always a chance of driving the wrong conclusions. Even when choosing a probability level of 95%, there is always a 5% chance that one rejects the null hypothesis when it was actually correct. This is called Type I error, represented by the Greek letter a.

It is possible to err in the opposite way if one fails to reject the null hypothesis when it is, in fact, incorrect. This is called Type II error, represented by the Greek letter b. These two errors are represented in the following chart.

Table 2. Types of error
Type of decisionH0 trueH0 false
Reject H0Type I error (a) Correct decision (1-b)
Accept H0Correct decision (1-a)Type II error (b)

A related concept is power, which is the probability of rejecting the null hypothesis when it is actually false. Power is simply 1 minus the Type II error rate, and is usually expressed as 1-b.

When choosing the probability level of a test, it is possible to control the risk of committing a Type I error by choosing an appropriate a.

This also affects Type II error, since they are are inversely related: as one increases, the other decreases. To appreciate this in a diagram, follow this link:

Choice of a

There is little control on the risk of committing Type II error, because it also depends on the actual difference being evaluated, which is usually unknown. The following link leads to a diagram that illustrates how at a fixed a value, the b value changes according to the actual distribution of the population:

Changes in b

The consequences of these different types of error are very different. For example, if one tests for the significant presence of a pollutant, incorrectly deciding that a site is polluted (Type I error) will cause a waste of resources and energy cleaning up a site that does not need it. On the other hand, failure to determine presence of pollution (Type II error) can lead to environmental deterioration or health problems in the nearby community.

Go to Table of Contents

Steps in Hypothesis Testing


Identify the null hypothesis H0 and the alternate hypothesis HA.


Choose a. The value should be small, usually less than 10%. It is important to consider the consequences of both types of errors.


Select the test statistic and determine its value from the sample data. This value is called the observed value of the test statistic. Remember that a t statistic is usually appropriate for a small number of samples; for larger number of samples, a z statistic can work well if data are normally distributed.


Compare the observed value of the statistic to the critical value obtained for the chosen a.


Make a decision.
If the test statistic falls in the critical region:

Reject H0 in favour of HA.
If the test statistic does not fall in the critical region:

Conclude that there is not enough evidence to reject H0.

Go to Table of Contents

Practical Examples

A) One tailed Test

An aquaculture farm takes water from a stream and returns it after it has circulated through the fish tanks. The owner thinks that, since the water circulates rather quickly through the tanks, there is little organic matter in the effluent. To find out if this is true, he takes some samples of the water at the intake and other samples downstream the outlet, and tests for Biochemical Oxygen Demand (BOD). If BOD increases, it can be said that the effluent contains more organic matter than the stream can handle.

The data for this problem are given in the following table:

Table 3. BOD in the stream

One tailed t-test :

Upstream Downstream
6.782 9.063
5.809 8.381
6.849 8.660
6.879 8.405
7.014 9.248
7.321 8.735
5.986 9.772
6.628 8.545
6.822 8.063
6.448 8.001

  1. A is the set of samples taken at the intake; and B is the set of samples taken downstream.
  2. Choose an a. Let us use 5% for this example.
  3. The observed t value is calculated
  4. The critical t value is obtained according to the degrees of freedom
The resulting t test values are shown in this table:

Table 4. t-Test : Two-Sample Assuming Equal Variances

Upstream Downstream
Mean 6.6539 8.6874
Variance 0.2124 0.2988
Observations 10 10
Pooled Variance 0.2556
Hypothesized Mean Difference 0
Degrees of freedom 18
t stat -8.9941
P(T<t) one-tail 2.22 x 10-08
t Critical one-tail 1.7341
P(T<t) two-tail 4.45 x 10-08
t Critical two-tail 2.1009

5) Make a decision.... Is the effluent polluting the stream? See Answer

B) Two tailed Test

Let us asume that an induced bioremediation process is being conducted at a contaminated site. The researcher has obtained good cleanup rates by injecting a mixture of nutrients into the soil in order to maintain an abundant microbial community. Someone suggests using a cheaper mixture. The researcher tries one patch of land with the new mixture, and compares the degradation rates to those obtained from a patch treated with the expensive one to see if he can get the same degradation rates.

The data for this problem are shown in the following table:

Table 5. Degradation rates on treatment with different nutrients.

Two tailed t-test
Cheap Nutrient Expensive Nutrient
7.1031 9.6662
6.4085 10.1320
8.8819 9.0624
7.0094 8.8136
4.6715 9.2345
6.6135 9.9949
6.5877 9.4299
6.2849 8.8012
6.6789 9.9249
6.5542 8.1739

1) A is treated with the cheap nutrient; and B is treated with the expensive one.

    H0: mA= mB
    HA: mA ¹ mB

2) Choose a. We will use 5%, as in the previous example.
3) The observed t value is calculated.
4) The critical t value must be obtained according to the degrees of freedom.

Assuming that variances from the two sets are unequal, we obtain the following t test:

Table 6. t-test: Two sample Assuming Unequal Variances

Cheap Nutrient Expensive Nutrient
Mean 6.6794 9.3233
Variance 1.0476 0.3917
Observations 10 10
Hypothesized Mean Difference 0
Degrees of freedom 15
t Stat -6.9691
P(T<t) one-tail 2.25 x10-6
t Critical one-tail 1.7531
P(T<t crit) two-tail 4.51 x 10-6
t critical two tail 2.1315

5) Make a decision... Was the expensive nutrient actually necessary? See Answer

Go to Table of Contents

Limitations for Environmental Sampling

Although hypothesis tests are a very useful tool in general, they are sometimes not appropriate in the environmental field. The following cases illustrate some of the limitations of this type of test:

A) Multiple Comparisons

z and t tests are very useful when comparing two population menas. However, when it comes to comparing several population means at the same time, this method is not very appropriate.

Suppose we are interested in comparing pollutant concentrations form three different wells with means m1, m2 and m3. We could test the following hypothesis:

    H0: m1 = m2 = m3
    HA: not all means are equal

We would need to conduct three different hypothesis tests, which are shown here:

Table 7.Hypothesis tests needed for testing three different populations
m1 = m2
m1 ¹ m2
m2 = m3
m2 ¹ m3
m1 = m3
m1 ¹ m3

For each test, there is always the possibility of committing an error. Since we are conducting three such tests, the overall error probability would exceed the acceptable ranges, and we could not feel very confident about the final conclusion. Table 8 shows the resulting overall a if multiple t tests are conducted. Assume that each k value represents the number of populations to be compared.

Table 8. Probability of committing a type I error by using multiple t tests to seek differences between all pairs of k means.

Level of Significance used in the t tests

Number of means (k) 0.20 0.10 0.05 0.02 0.01 0.001
2 0.20 0.10 0.05 0.02 0.01 0.001
3 0.41 0.23 0.13 0.05 0.03 0.003
4 0.58 0.36 0.21 0.09 0.05 0.006
5 0.71 0.47 0.23 0.13 0.07 0.009
10 0.96 0.83 0.63 0.37 0.23 0.034
20 1.00 0.98 0.92 0.71 0.52 0.109
¥ 1.00 1.00 1.00 1.00 1.00 1.00

Note : The particular values were derived from a table by Pearson (1942) by assuming equal population variances and large samples.

A better method for comparing several population means is an analysis of variance, abbreviated as ANOVA.

ANOVA test is based on the variability between the sample means. This variability is measured in relation to the variability of the data values within the samples. These two variances are compared through means of the F ratio test.

If there is a large variability between the sample means, this suggests that not all the population means are equal. When the variability between the sample means is large compared to the variability within the samples, it can be concluded that not all the population means are equal.

B) Multiple Constituents

In example 1, we were only testing for BOD, so only one t test was necessary. If we had been trying to trace more than one pollutant, which is usually the case, we would have to take out different tests for each pollutant in order to determine if the effluent was similar to the receiving stream. Then we would have the same proplem we encountered with multiple comparisons: Overall a would increase. Table 8 applies to this case too. The k value in this case would represent the number of pollutants instead of a number of populations.

C) Difficulty in meeting assumptions

The tests used in the testing of hypothesis, viz., t-tests and ANOVA have some fundamental assumptions that need to be met, for the test to work properly and yield good results. The main assumptions for the t-test and ANOVA are listed below.

The primary assumptions underlying the a t-test are:

  1. The samples are drawn randomly from a population in which the data are distributed normally distributed.
  2. In the case of a two sample t-test, s12 = s22.Therefore it is assumed that s12 and s22 both estimate a common population variance, s2. This assumption is called the homogeneity of variances
  3. In the case of a two sample t-test, the measurements in sample 1 are independent of those in sample 2.

Like the t-test, analysis of variance is based on a model that requires certain assumptions. Three primary assumptions of ANOVA are that:

  1. Each group is obtained randomly, with each observation independent of all other observations and the groups independent of each other.
  2. The samples represent populations in which the data are normally distributed.
  3. s12 = s22 = s32 = ... = sk2. The assumption of homogeneity of variances is similar to the discussion above under the t-test. The group variances are assumed to be an estimate of a common variance, s2.

In actual experimental or sampling situations, the underlying populations are not likely to be exactly normally distributed with exactly equal variances. Both the t-test and ANOVA are quite robust and yield reliable results when some of the assumptions are not met. For example, if n1 = n2 = ... = nk, ANOVA tends to be especially robust with respect to the assumption of homogeneity As the number of groups tested, k, increases there is a greater effect on the value of the F-statistic. It is also seen that a reasonable departure from the assumption of population normality does not have a serious effect on the reliability of the F-statistic or the t-statistic. It is essential however that the assumption of independence be met. The analysis is not robust for non-independent measurements.These factors are to be taken into consideration while testing hypotheses.

Go to Table of Contents


Chase, W., F. Brown, 1992. General Statistics. 2nd Edition. John Wiley and Sons. New York.
Comprey, A.L. and Lee H.B.1995. Elementary Statistics : A Problem Solving Approach. 3rd Edition. Kendall/Hunt Publishing Co., Dubuque.
Johnson, R. A. and Bhattacharya, G. K, 1992 Statistics : Principles and Methods. 2nd Edition. John Wiley and Sons.
Schefler, W. C, 1988. Statistics : Concepts and Applications. The Benjamin/Cummings Publishing Co. Inc.

Go to Table of Contents

Sampling & Monitoring Primer Table of Contents

Previous Topic

Next Topic

Send comments or suggestions to:
Student Authors: Georgina Wilson González, gwilsong@vt.edu , and Karpagam Sankaran, ksankara@vt.edu
Faculty Advisor: Daniel Gallagher, dang@vt.edu
Copyright © 1997 Daniel Gallagher
Last Modified: 09-10-1997