The advice given in the kit concerning ‘significance’ in the difference between the average salaries of various groups can be seen as a useful rule of thumb. It is intended to be easy to understand and easy to apply.
However, the term 'significant' has a precise statistical meaning, which is about the probability of an observed difference of whatever size arising by chance when there is no real difference. The convention is that where the probability of the difference having arisen by chance is 5% or less (i.e. where the test shows p ≤ 0.05), this is described as 'statistically significant', and where the probability of a chance result is 1% or less (p≤ 0.01) as 'highly statistically significant'.
Statistical significance is a function of the size of the difference and the size of the samples. In practice, this means that where there is a genuine difference, however small, the difference will be 'statistically significant' if the sample sizes are large. However, the difference may not be what is meant by ‘significant’ in normal parlance.
It also means that where a very large difference is observed – a difference that would be regarded as important or substantial - this could fall short of 'statistical significance' if the numbers are very small.
Thus, if you used statistical significance alone as a guide to where further investigation should be carried out, you could end up investigating insubstantial differences between large groups of men and women. Conversely, if you use 'effect size' - the amount of the difference as your sole guide, you could end up looking into differences that resulted from chance variations.
Judging what is significant
The equal pay kit suggests an 'effect size' of a 5% difference in the pay of men and women doing equal work, or where there is a pattern of differences favouring one sex or another, a 3% difference, as 'significant' and therefore justifying further investigation. (The way to calculate this is to see whether the difference between the salaries for the two groups is more than 5% - or 3% - of the lower of the two salaries.) This is a sensible rule of thumb, but needs also to be applied sensibly. If you have the resources to do so, you should also calculate statistical significance.
Where numbers are very small, a difference of 3%, or even 5%, may not merit further investigation, although you might want to keep this particular type of work under careful review. And where a statistically significant difference is found, it may be worth looking into this even if the gap is less than 3%. You would certainly want to examine the size of the difference in any cases where it was found to be statistically significant.
In general it is preferable to investigate a non-significant difference rather than to fail to investigate a significant difference, so, unless there is a substantial resource cost, you should 'if in doubt, check it out'.
The most common test of statistical significance used to compare two means is the t-test. The t-test requires that certain assumptions about the data are satisfied – that the data are normally distributed (the classic 'bell curve'), and that there are equal 'variances' for the two groups. However, in practice it is considered sufficiently robust to give valid results even if these two assumptions are not fully satisfied, and there is also a version of the test that does not assume equal variances.
There are also 'non-parametric' tests that can be used if it is likely that the assumptions are violated. These use the median rather than the mean to make comparisons between groups. They can be a useful check on the validity of the t-test.
The t-test can only be used to compare the means of two different groups, so can be used to compare men and women, white and non-white, disabled and non-disabled, but not for example to compare the mean pay of three different main ethnic groups. A more complex analysis of variance (ANOVA) can be used for this purpose, but you do need to be able to understand and interpret the results properly. There are also non-parametric techniques for comparing more than two different groups.