Fill the two by two table with integer values. A proportional prevalence value can be given optionally. or
Give a proportional sensitivity value in the ++ box and a proportional specificity value in the -- box. A proportional prevalence value can be given optionally. On top of that, if you want, you can give a population or sample size.
If you think there isn't a lot you can do with a 2*2 table, think again. The analysis of the diagnostic effectiveness of a test is quite complicated and in a two by two table you can study all the intricacies which play a role in the development and application of diagnostic instruments. The analysis presents the situation of a test with two possible results, negative or positive, diseased or healthy, fail or pass, against the objective measurement of the outcome, also measured dichotomously. Outcome measurement might for example be that we wait a while for the disease to develop or not to develop, use the result of a highly valid laboratory procedure, confirmative surgery, or if the pupil indeed follows the predicted career, to take an example outside medicine. The input for the procedure is simple, in a two by two table you classify the number of times a test did a correct positive prediction that the individual is affected by the problem, a correct negative prediction, an incorrect positive prediction, an incorrect negative predictions.
Following, the various indicators that are presented in the output of this SISA procedure are discussed. In the discussion attention is given to various considerations in diagnostic test development theory. Lastly, most output is presented with estimates of the variance and standard errors. Often the continuity corrected Wilson (which is equivalent to Fleiss's quadratic confidence interval) and sometimes Wald's Confidence Intervals are also presented; it should be considered that for the data that is used in test development these are to be preferred above the ones directly based on the standard errors. If one would want to use another confidence interval please note the percentage and the number of cases on which this percentage is based and use the one mean procedure.
True positives, false positives, true negatives, false negatives. Note that the perspective of the test is taken, not the perspective of the (outcome) standard. Positive means that the test result indicates that a problem (disease) is present but that doesn't have to be necessarily true. A false positive for example is someone who tested positive but who is in fact problem free. Presented are respectively the numbers in this particular table and the proportions for each of the cells relative to the complete table.
Accuracy, number and proportion of all the observations in the table which have been classified correctly by the test.
Kappa is a measure of agreement and takes on the value zero if there is no more agreement between test and outcome then can be expected on the basis of chance. Kappa takes on the value 1 if there is perfect agreement; i.e. the test always correctly predicts the outcome. It is considered that Kappa values lower than 0.4 represent poor agreement, values between 0.4 and 0.75 fair to good agreement, and values higher than 0.75 excellent agreement. Negative Kappa indicates a problem in the application of the test. Kappa is dependent not only on the quality of the test, i.e., the inside of the table, but also on the prevalence of the disease in the population in which the test is applied, kappa is also sensitive to the distribution of cases in the table margin. Basically what Kappa shows is that for the same sensitivity and specificity the agreement between test and outcome will decrease with a decreasing prevalence. In Kappa terms a test will perform wearse in low prevalence populations.
Sensitivity. The probability that an individual which is diseased is indeed tested as diseased. You would want the sensitivity of a test to be high if having the "disease" is relatively serious and the "cure" is relatively inexpensive and easily available. One would expect the sensitivity to be larger as 0.5.
Specificity. The probability that an individual which is not diseased is tested as not diseased. You would want the specificity to be high if having the disease is not so serious and the "cure" is relatively expensive in money and other terms. Note that there is a tradeoff between specificity and sensitivity, high specificity mostly means low sensitivity, and vice versa. For example, a diagnostic instrument is based on some type of blood count and treatment is given if the blood count is below a certain level. If you give treatment at a relatively high blood-count and thus give relatively many patients treatment (high sensitivity, low specificity) people who do not need it will get unnecessary treatment. If you are conservative and set blood count level for inclusion in treatment low (low sensitivity, high specificity) people who require treatment will be missed. One would expect the specificity to be larger as 0.5.
Positive likelihood. Indicates how much more likely it is to get a positive test in the diseased as opposed to the non-diseased group. Some authors state that this likelihood gives the relationship between the pre- (probability of having the disease before being tested) and the post test probability (probability of having the disease after testing positive) of having the disease. That can't be correct (because the ratio between the pre and the post test probability is equal to the positive predictive accuracy divided by the prevalence).
Negative likelihood. Indicates how much more likely it is to get a negative test in the non-diseased as opposed to the diseased group.
Diagnostic Odds Ratio. Often used as a measure of the discriminative power of the test. Has the value one if the test does not discriminate between diseased and not diseased. Very high values above one means that a test discriminates well. Values lower than one mean that there is something wrong in the application of the test.
Error Odds Ratio. Indicates if the probability of being wrongly classified is highest in the diseased or in the non-diseased group. If the error odds is higher than one the probability is highest in the diseased group (and the specificity of the test is better than the sensitivity), if the value is lower than one the probability of an incorrect classification is highest in the non-diseased group (and the sensitivity of the test is better than the specificity).
Youden's J. Is used to study the overall performance of a test. The J takes on the value 1 if a diagnostic test discriminates perfectly and without making any mistakes. If you want to minimize the probability of making an error you should maximize the Youden's J. That, of course, is of only theoretical importance, such decisions should be taken on the basis of a cost-benefit analysis of treatment as opposed to no treatment.
Positive predictive accuracy. In a table representative of the population it gives: 1) the post-test probability, the probability for an individual in the population who tested positive of having the disease; 2) of those who tested positive the fractions who were correctly and who were not-correctly classified.
Negative predictive accuracy. In a table representative of the population it gives: 1) the post-test probability, the probability for an individual in the population who tested negative of not having the disease; 2) of those who tested negative the fractions who were correctly and who were not-correctly classified.
The chi-square gives the probability of the relationship between the row and column variable being caused by chance. It might be the case that the fact that there are more observations in the top left and the bottom right cell is caused by a chance fluctuation. In that case the test would not be predictive of the outcome and the observed relationship would be spurious. If the probability (p) of the chi-square is low (customarily "low" mostly means less than 0.05) it is unlikely that the observed results are caused by chance. Use the Pearson's chi-square in the case the total number of cases is relatively high, Yate's chi-square otherwise. Before one develops a program for making a diagnostic test it is advisable to calculate the required sample size to ensure that the chi-square for the table will show a statistically significant result.
Pearson's Correlation. Much the same story as for the Kappa.
Bayesian approach. Diagnostic tests are often first developed in research (hospital) settings. Sometimes an experimental group is formed of available patients to be compared with a control group from the general population. These controls are often selected on the basis that it is unlikely that they have been exposed to the disease. For example, in developing a test for syphilis the outcome of the test in a group of people who show all the clinical manifestations of syphilis is compared with the outcome of the test in a group of nuns. Some kind of procedure will be applied to minimize error rates in both groups. However, in practice tests are often used in primary care or general screening environments. The prevalence of the disease in these circumstances will be very much lower as was the case in the experimental situation. Sensitivity and specificity of tests tend to be quite robust under these circumstances and to change little between high and low prevalence populations. However, positive and negative predictive accuracy particularly are very sensitive to the prevalence of the disease in the population in which the test is applied. Therefore an approach has been developed on the basis of Bayes theorem which makes it possible to enter the prevalence of the disease as a piece of prior knowledge in the equation. Corrected positive and negative predictive accuracies can then be calculated for different populations. To do this in SISA you need to enter a proportion, between zero and one, in the 'prevalence' box. Note that a test requires to be optimized for the population in which it is applied and that practical and financial reasons will have to play a role besides clinical considerations. The Bayesian approach can be valuable to model the various practical options.
Summarizing all of the above it seems that sensitivity and specificity and the two predictive accuracies are probably the most valuable of the indicators. Sensitivity and specificity give a good view of the quality of the test relatively independent of circumstances. The predictive accuracies give a view of what happens in different practical situations in terms of numbers and proportions tested with correct and incorrect results. Predictive accuracies also give the post test probability of having the disease, an essential piece of information to communicate to the patient together with his or her test result.
For a sample size calculation give either a proportional sensitivity value in the ++ box, or a proportional specificity value in the – box. You must give a proportional prevalence value in the prevalence box. Give the maximum width of the confidence interval around the sensitivety or the specificity as a proportion in the bottom Number/Size box. Sample sizes are calculated according to Buderer’s formula.
During an epidemic it is interesting to use the Diagnostics calculator to work through a scenario of using antibody testing to protect populations. Individuals with a positive antibody test could be offered an immunity passport as they are a lower risk. Such a passport would give the owner more freedom of movement. If an antibody test with passport would be offered to a population, how would it work out? To take the example of Covid-19, antibody tests for Covid-19 available in 2020 had a sensitivity of about 60% and a specificity of about 95%.
In May 2020, the prevalence of individuals who had Corona in Europe was estimated to be under 5%. We fill in the sensitivity of 0.6 in the ++ box, and the specificity of 0.95 in the -– box, and 0.05 in the prevalence box:
For the general population 7.8% would test positive, and would get their immunity passport, in 3% of the case that would be justified, in 4.8% of the cases it wouldn’t. The positive predictive accuracy is 38.7%, which is the percentage of passport holders who indeed have immunity. The rest would run, and be, a considerable risk.
This is the reason why practitioners prefer to test only symptomatic individuals, because the baseline prevalence will be higher among these groups and the outcome more acceptable. Or to test among groups who have been particularly exposed to the virus and might therefore have a higher base prevalence, such as healthcare workers. For healthcare workers it might be that if they test positive to an antibody test they are immune and run less risks in their work and are less of a risk to the patients. They might no longer need as much protection. Say, 25% of health care workers have previously been ill. Then we fill in 0.25 in the prevalence box:
The positive predictive accuracy is now 80%. These are the healthcare workers who maybe can do with less protection after a positive antibody test and are no longer a risk to others or themselves. The point is, of the 100% health care workers who tested positive we do not know who are the 80% who were correctly classified and who are the 20% who were not correctly classified.
Grimes DA, Schulz KF. Uses and abuses of screening tests. Lancet 2002;359(9309):881-884. ->Medline
Knottnerus JA, Weel C, Muris JW. Evaluation of diagnostic procedures. Br Med J 2002;324:477-480. ->BMJ
Irwig LM, Bossuyt PM, Glasziou PP, Gatsonis CA, Lijmer JG. Designing studies to ensure that estimates of test accuracy are transferable. Br Med J 2002;324:669-671. ->BMJ
Steurer J, Fischer JE, Bachmann LM, Koller M, ter Riet G. Communicating accuracy of tests to general practitioners: a controlled study. Br Med J 2002;324:824-826. ->BMJ
Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, Lijmer JG, Moher D, Rennie D, de Vet HC. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Br Med J 2003;326:41-44. ->BMJ