Simple Interactive Statistical Analysis

Go to data input procedure
Go to table input procedure

Categorical Table Analysis

Explanation.

The program is meant to provide a full analysis of a two by r table. Mostly it concerns the description and analysis of differences in a dichotomous -percentage- outcome within various groups or empirical categories. The interest is then to study and compare the presence or absence of a certain characteristic between groups, for example, the difference in the proportion of smokers between different age groups or geographical regions. The usual contingency table analysis instruments such as chi-squares and various statistical tests are available. Additionally a few concepts from oneway analysis of variance are also applied, such as basic multiple contrasts and (LRX) variance analysis. A 2xr table is shown below. This table concerns two columns and five rows, the counted integer numbers found in the yellow boxes are the data used for analysis.

Change to c->r orientation. Tables are in principle causally "directionless" and that is mostly the way things are presented in statistical programs. However, in reality age and gender come before smoking and exercise behaviour, to claim that the opposite causal direction is just as likely is utterly nonsensical. We are not going to present new statistics but we are going to try to help the user by omitting certain information dependent on the requested direction. The tables and statistics we present independent of the requested direction are truly "directionless". The default direction of the analysis is the one of the table below, some multi categorical variable in rows, can be age groups, municipal areas and the like, "explains" a dichotomous variable in the columns. The alternative can be obtained by checking c->r orientation. An example of this orientation can be found on the ordinal help page, it concerns a dichotomous variable in the columns which explains a categorical dependent variable in the rows, i.e. the difference between males and females on some Likert type attitude or opinion variable.

To do a similar analysis on data with a continuous dependent variable, such as weight in kilos or pounds, or length in cm or inches, do a oneway anova combined with the means procedure.

Table1. Example of a 2xr table:

Active and non-active respondents

Active

Not-active

very young

10

24

young

15

15

intermediate

19

21

old

18

17

very old

22

14

Active and non-active respondents
	Active	Not-active
very young	10	24
young	15	15
intermediate	19	21
old	18	17
very old	22	14

The procedure concerns three webpages which follow on one another.

Table specification and data or table input
Check, edit and input data, and specify analysis
Results of the analysis

This help page is about how to edit the data and specify the analysis. Another help page explains how to input data or a table.

Descriptives.

Show tables gives the usual cross tables with counts and cell percentages. Row percentages in r->c orientation and column percentages in c->r orientation. If Chi-squares are also requested a table of expected values is further provided. Expected values are the values you would get given the marginal counts and if rows and columns are independent. To study the level of association in the table you can compare this table with the table of counts.

The point probability is the probability of this unique single table occurring by chance. The probability mostly used in statistics concerns all tables which have the same or more association between row and columns as the observed table. The rarely used point probability gives you the probability of only this table.

Ordinal pairs presents an account of the numbers of pairs in the table. The different pairs form the basis of many analyses of ordinal association. Concordant pairs consist of individuals paired with other individuals who score both lower on the column and lower on the row variable. Discordant pairs consist of individuals paired with other individuals who are lower on the one, and higher on the other variable. Tied pairs are individuals paired with others who have the same score on either the rows or the columns.

A Confidence Interval is given for the first column percentage value relative to the total number of cases in each row in r->c orientation and relative to the total for the first column in c->r orientation. The width of the Confidence Interval can be set under the table options lower down, 95 is the most often used value for confidence intervals. You can, for example, compare two towns in the percentage of smokers by studying the confidence intervals. If the confidence interval between two percentages do not overlap the difference is significant at the given confidence interval level. If a confidence interval of a particular percentage includes another percentage the difference between the two percentages is not significant. Use the the t-test procedure to confirm your results for the difference between two proportions. Beware of chance capitalization. In comparing all these confidence intervals you are bound to find some significant purely on the basis of chance fluctuation. Please read the Bonferroni help page for a discussion on this topic and maybe apply the Bonferroni procedure. Note that the percentages in this procedure are not correlated.

Statistics.

Chisquare concerns the usual procedures to determine the probability of independence between rows and columns. As the Likelihood Ratio Chi-square (LRX) is additive it is used for description in some other procedures also. Use this Chi square procedure to collect base data for LRX comparison and analysis. If the Chi-square procedure is requested together with the tables option a table with expected values is presented. The difference between the expected and the observed values found in the table of counts forms the basis of the Chi square calculation. The lager the difference between the expected and observed counts the greater the contribution (of the cell) to the Chi-square. The Chi-square is a nominal test, so it considers the difference in distribution, differences as they are.

The t-test procedures can be used for multiple comparisons between sets of proportions, the number in the row divided by the total, multiply with 100 to get the percentages. In the default r->c orientation you can select t-test for comparing each row with each other row, or to compare each row with the sum of all other rows, for the proportion in the first column. The number of t-tests you get for the first options equals n.rows*(n.rows-1)/2. So for a seven row table you can get at most 7*6/2=21 comparisons. For the second option the number of t-test equals the number of rows. In the alternative c->r orientation you can compare for each row the proportion in the first column with the proportion in the second column. Beware of chance capitalization. In comparing all these differences in means you are bound to find some significant purely on the basis of chance fluctuation. Please read the Bonferroni help page for a discussion on this topic and maybe apply the Bonferroni procedure. Note that the proportions in this procedure are not correlated.

The ordinal statistics which follow now are less often used in a design were the interest is in the difference between various groups with regard to the absence or presence of a characteristic. However, for the table above they can tell you if activity levels increase or decrease with increasing age.

Goodman and Kruskal"s Gamma and Kendall"s Tau-a are based on the ordinal pairs, counted with the option above. You will get the sample standard deviations and p-values for the difference between the observed association and the expected (no) ordinal association of 0 (zero). Gamma is the difference between the number of concordant and discordant pairs divided by the sum of concordant and discordant pairs; Tau-a is the difference between the number of concordant and discordant pairs divided by the total number of pairs. Gamma usually gives a higher value than Tau and is (for other reasons as well) usually considered to be a more satisfactory measure of ordinal association. The p-values are supposed to approach the exact p-value for an ordinal association asymptotically, and the program shows that they generally do that reasonably well. But, beware of small numbers: the p-values for the gamma and Tau become too optimistic!

Goodman and Kruskal"s Lambda. Explanation can be found on the RxC helppage.

Kolmogorov-Smirnov Two Sample Test assesses if the largest proportional cumulative difference in a table has been caused by chance fluctuation or not. In the table above this difference equals [0.14] (top right cell). The program echoes the Chi-square value of the expected largest proportional difference, (Chi-2= 3.673) and the p-value of the difference between the observed and the expected largest difference, with two degrees of freedom. The p-value in this example equals 0.15933, the difference in ordering between males and females may well have been caused by chance fluctuation.
The probability value presented is single-sided. The literature considers that the Kolmogorov Smirnov test has very little power with a high chance of a type II error, i.e. of not finding a difference when there is one. Unless there are serious theoretical or other reasons for using the K-S, use of the Gamma test is preferable. This procedure ONLY in c->r orientation.

Ridit analysis has a strong descriptive nature. A Ridit test has the neutral -no difference between two orderings- value of 0.5. This is based on the notion that if two orderings "A" and "B" are the same and one draws an individual "a" and "b" from each of these orderings the probability that "a" has a higher position in her ordering as "b" in his ordering equals 0.5. If, however, the observations on "A" tend to be clustered in the higher positions, and observation in "B" are clustered relatively lower, the probability of individual "a" having a higher position than "b" will increase above 0.5 up to the maximum probability of 1.0, "a" certainly being higher than "b". Similarly, "b-s" position will decrease below 0.5 towards 0, certainly being lower as "a". A Ridit of 0.8 means then that "a" has a 80% chance of having a higher position after random selection than "b". This procedure ONLY in c->r orientation.

The Ridit procedure is based on:
Selvin S. A further note on the interpretation of ridit analysis. American Journal of Epidemiology 1977;105:16-20. ->AJE
Fleiss JL. Ridit Analysis in Dental Clinical Studies. Journal of Dental Research 1997;58:2080-2084. ->JDR

The clustered rows procedure can be used to calculate the intra correlation coefficient -or rho- in a table with multiple rows and dichotomous outcomes. The procedure can be used in two ways, for clustered samples and for agreement studies with multiple judgements for each subject. An example of a cluster sample is to estimate the prevalence of a bovine disease in a country by first sampling farms, and after that sampling a proportion of animals in each farm. The rows are the data for the farms, the counts within the rows the numbers of diseased and non-diseased animals on each farm. The true effective sample size is somewhere in between the number of farms, the conservative estimate, and the number of animals sampled, the Simple Random Sample estimate. To determine the correct sample size for analysis the intra correlation coefficient is used. On the basis of the intra correlation coefficient the design effect can be used to calculate the correct standard error. The rows in a cluster sample table are sometimes called random rows, as opposed to fixed rows. Fixed rows are the result of some a-priori classification, for example the gender classification is fixed, we don"t randomly select males and females as a group from an infinite number of gender groups. Agreement studies consider that each row concerns data of a subject with a number of positive and negative judgements. If the judges would judge the subjects in a similar way then each subject would only have positive judgements, or only negative judgements. Kappa would be one. If the judgements for the subjects would randomly fluctuate around a similar mean for each subject, kappa would be zero. The z-test for kappa or rho calculates the likelihood of kappa or rho being equal to zero.

The Random/Clustered rows procedure is based on:
Fleiss, JL. Cuzick, J. The reliability of dichotomous judgments: Unequal numbers of judges per subject. Applied Psychological Measurement 1979;3:537-542. ->APM
Fleiss JL. Statistical methods for rates and proportions, 2nd edition. New York [etc.]: John Wiley 1982.

Table options.

Swap swaps the columns. Use this if you want to change the orientation for example for a C.I. table or multiple t-tests or if you want positive instead of negative outcomes in an ordinal test.

Order orders the rows. If you only check the order option it will order the row labels in an ascending order putting labels beginning with capital letters together. Check or uncheck the other options for different sorting. If you do not check any of the options there is no sorting and the table will be analysed as it is.

Reduce the table to a given number of rows. The procedure chops off the last rows, if you want the first rows to be chopped off change the order so that the first rows become the last rows. A small LRX table analyses this procedure. Can be compared with the SPSS SELECT IF procedure by giving labels a particulaly high value, sorting and reducing them out. The total number of cases for the table will be reduced.

Combine rows which have the same label by summing the numbers of observations for each row. The procedure considers the case of characters in labels. So if a label has capital letters and another label is the same but without capital letters the two labels are considered to be different. A small LRX table analyses this procedure. Can be compared with the SPSS RECODE procedure, for example, you can easily sum 10 rows into three new ones by having only three differently named row labels. The total number of cases for the table will be the same before and after combining.

The table options are executed in the above order. So first the program swaps, then orders, reduces and lastly the program combines.

C.I. sets the width of confidence intervals used. 95% is most often used.

Further.

Weighted tables and weighing corrected Chi-squares are available. To this we will soon add weighted Confidence Intervals and weighing corrected t-tests. In the longer term we will provide weighing corrected Gamma, Tau and Ridits. To do a weighted analysis you will have to enter individual level one case per row data with case weights in the data input page.

The procedure is meant for relatively small tables. Number of rows is in principle limited to 120, but might be less dependent on your browser and other settings. Is also rather less with weighted data as more info has to be transferred.

TOP of page