Thursday, May 22, 2008

Discriminant Analysis

Introduction
Discriminant function analysis, a.k.a. discriminant analysis or DA, is used to classify cases into the values of a categorical dependent, usually a dichotomy. If discriminant function analysis is effective for a set of data, the classification table of correct and incorrect estimates will yield a high percentage correct. Discriminant function analysis is found in SPSS under Analyze, Classify, Discriminant. One gets DA or MDA from this same menu selection, depending on whether the specified grouping variable has two or more categories.

Multiple discriminant analysis (MDA) is an extension of discriminant analysis and a cousin of multiple analysis of variance (MANOVA), sharing many of the same assumptions and tests. MDA is used to classify a categorical dependent which has more than two categories, using as predictors a number of interval or dummy independent variables. MDA is sometimes also called discriminant factor analysis or canonical discriminant analysis.

Purposes

There are several purposes for DA and/or MDA:

  • To classify cases into groups using a discriminant prediction equation.
  • To test theory by observing whether cases are classified as predicted.
  • To investigate differences between or among groups.
  • To determine the most parsimonious way to distinguish among groups.
  • To determine the percent of variance in the dependent variable explained by the independents.
  • To determine the percent of variance in the dependent variable explained by the independents over and above the variance accounted for by control variables, using sequential discriminant analysis.
  • To assess the relative importance of the independent variables in classifying the dependent variable.
  • To discard variables which are little related to group distinctions.
  • To infer the meaning of MDA dimensions which distinguish groups, based on discriminant loadings.
Procedures and Assumption

Discriminant analysis has two steps: (1) an F test (Wilks' lambda) is used to test if the discriminant model as a whole is significant, and (2) if the F test shows significance, then the individual independent variables are assessed to see which differ significantly in mean by group and these are used to classify the dependent variable.

Discriminant analysis shares all the usual assumptions of correlation, requiring linear and homoscedastic relationships, and untruncated interval or near interval data. Like multiple regression, it also assumes proper model specification (inclusion of all important independents and exclusion of extraneous variables). DA also assumes the dependent variable is a true dichotomy since data which are forced into dichotomous coding are truncated, attenuating correlation.

DA is an earlier alternative to logistic regression, which is now frequently used in place of DA as it usually involves fewer violations of assumptions (independent variables needn't be normally distributed, linearly related, or have equal within-group variances), is robust, handles categorical as well as continuous variables, and has coefficients which many find easier to interpret. Logistic regression is preferred when data are not normal in distribution or group sizes are very unequal. However, discriminant analysis is preferred when the assumptions of linear regression are met since then DA has more stattistical power than logistic regression (less chance of type 2 errors - accepting a false null hypothesis). See also the separate topic on multiple discriminant function analysis (MDA) for dependents with more than two categories.


No comments: