|
IDL Analyst Reference Guide: Categorical and Discrete Data Analysis |
|
The IMSL_CAT_GLM function analyzes categorical data using logistic, Probit, Poisson, and other generalized linear models.
| Note This routine requires an IDL Analyst license. For more information, contact your ITT Visual Information Solutions sales or technical support representative. |
Result = IMSL_CAT_GLM(n_class, n_continuous, model, x [, CASE_ANALYSIS=variable] [, CLASS_VALS=variable] [, COVARIANCES=variable] [, COEF_STAT=variable] [, CRITERION=variable] [, /DOUBLE] [, EPS=value] [, IFIX=value] [, IFREQ=value] [, INDICIES_EFFECTS=array] [, INIT_EST=array] [, IPAR=value] [, ITMAX=value] [, LAST_STEP=variable] [, MAX_CLASS=value] [, MEANS=variable] [, N_CLASS_VALS=variable]
[, /NO_INTERCEPT] [, OBS_STATUS=variable] [, VAR_EFFECTS=array])
An integer value indicating the number of estimated coefficients in the model.
Model used to analyze the data. The six models are listed in Table 17-2.
| Note The lower bound of the response variable is 1 for model = 3 and is 0 for all other models. See the Discussion section for more information about these models. |
Number of classification variables.
Number of continuous variables.
Two-dimensional array of size n_observations by (n_class + n_continuous) + m containing data for the independent variables, dependent variable, and optional parameters, where n_observations is the number of observations.
The columns must be ordered such that the first n_class columns contain data for the class variables, the next n_continuous columns contain data for the continuous variables, and the next column contains the response variable. The final (and optional) m – 1 columns contain optional parameters, see keywords Ifreq, Ifix, and Ipar.
Named variable into which a two-dimensional array of size n_observations by 5 containing the case analysis is stored.
Case statistics are computed for all observations except where missing values prevent their computation.
Named variable into which a one-dimensional array of length:

containing the distinct values of the classification variables in ascending order is stored. The first N_Class_Vals(0) elements of Class_Vals contain the values for the first classification variables, the next N_Class_Vals(1) elements contain the values for the second classification variable, etc.
Named variable into which a two-dimensional array of size n_coefficients by n_coefficients containing the estimated asymptotic covariance matrix of the coefficients is stored. For Itmax = 0, this is the Hessian computed at the initial parameter estimates.
Named variable into which a two-dimensional array of size n_coefficients by 4 containing the parameter estimates and associated statistics is stored.
Named variable into which the optimized criterion is stored. The criterion to be maximized is a constant plus the log-likelihood.
If present and nonzero, double precision is used.
Convergence criterion. Convergence is assumed when maximum relative change in any coefficient estimate is less than Eps from one iteration to the next or when the relative change in the log-likelihood, criterion, from one iteration to the next is less than Eps/100.0. Default: Eps = 0.001
Column number Ifix in x containing a fixed parameter for each observation that is added to the linear response prior to computing the model parameter. The `fixed' parameter allows one to test hypothesis about the parameters via the log-likelihoods.
Column number Ifreq in x containing the frequency of response for each observation.
One-dimensional index array of length Var_Effects(0) + Var_Effects(1) + ... + Var_Effects(n_effects - 1). The first Var_Effects(0) elements give the column numbers of x for each variable in the first effect. The next Var_Effects(1) elements give the column numbers for each variable in the second effect. The last Var_Effects(n_effects - 1) elements give the column numbers for each variable in the last effect. Keywords Indicies_Effects and Var_Effects must be used together.
One-dimensional array of length n_coef_input containing initial estimates of parameters (n_coef_input can be completed by IMSL_REGRESSORS). By default, unweighted linear regression is used to obtain initial estimates.
Column number Ipar in x containing the value of the known distribution parameter for each observation, where x(i, Ipar) is the known distribution parameter associated with the i-th observation. The meaning of the distributional parameter depends upon model as shown in Table 17-3:
Default: When model ¹ 2, each observation is assumed to have a parameter value of 1. When model = 2, this parameter is not referenced.
Maximum number of iterations. Use Itmax = 0 to compute Hessian, stored in Covariances, and the Newton step, stored in Last_Step, at the initial estimates (The initial estimates must be input. Use keyword Init_Est). Default: Itmax = 30
Named variable into which an one-dimensional array of length n_coefficients containing the last parameter updates (excluding step halvings) is stored. For Itmax = 0, Last_Step contains the inverse of the Hessian times the gradient vector, all computed at the initial parameter estimates.
An upper bound on the sum of the number of distinct values taken on by each classification variable. Default: Max_Class = n_observations by n_class
Named variable into which an one-dimensional array containing the means of the design variables is stored. The array is of length n_coefficients if keyword No_Intercept is used, and n_coefficients - 1 otherwise.
Named variable into which an one-dimensional array of length n_class containing the number of values taken by each classification variable is stored; the i-th classification variable has N_Class_Vals(i) distinct values.
If present and nonzero, there is no intercept in the model. By default, the intercept is automatically included in the model.
Named variable into which an one-dimensional array of length n_observations indicating which observations are included in the extended likelihood is stored.
One-dimensional array of length n_effects containing the number of variables associated with each effect in the model, where n_effects is the number of effects (source of variation) in the model. Keywords Var_Effects and Indicies_Effects must be used together.
The IMSL_CAT_GLM function uses iteratively re-weighted least squares to compute (extended) maximum likelihood estimates in some generalized linear models involving categorized data. One of several models, including the probit, logistic, Poisson, logarithmic, and negative binomial models, may be fit.
Note that each row vector in the data matrix can represent a single observation; or, through the use of keyword Ifreq, each row can represent several observations. Also note that classification variables and their products are easily incorporated into the models via the usual regression-type specifications. The models available in IMSL_CAT_GLM are listed in Table 17-4.
|
Model
|
PDF of the Response Variable
|
Parameterization
|
|---|---|---|
|
|
f (y) = (ly exp (-l) ) / y!
|
l = N x exp (w + h)
|
|
|
![]() |
![]() |
|
|
![]() |
![]() |
|
|
![]() |
![]() |
|
|
![]() |
q = F (w + h)
|
|
|
![]() |
q = 1 - exp (-exp (w + h) )
|
Here, F denotes the cumulative normal distribution, N and S are known distribution parameters specified for each observation via the keyword Ipar, and w is an optional fixed parameter of the linear response, gi, specified for each observation. (If keyword Ifix is not used, then w is taken to be 0.) Since the log-log model (model = 5) probabilities are not symmetric with respect to 0.5, quantitatively, as well as qualitatively, different models result when the definitions of "success" and "failure" are interchanged in this distribution. In this model and all other models involving q, q is taken to be the probability of a "success".
The computations proceed as follows:

and, when model = 3, the linear relationship is given by:
while if model = 4, F-1 (q) = Xb. When computing initial estimates, standard modifications are made to prevent illegal operations such as division by zero. Regression estimates are obtained at this point, as well as later, by use of IMSL_MULTIREGRESS (Chapter 2, Regression).
denote the log of the probability of the i-th observation for coefficients b. In the least-squares model, the weight of the i-th observation is taken as the absolute value of the second derivative of:

with respect to:

(times the frequency of the observation), and the dependent variable is taken as the first derivative Y with respect to gi, divided by the square root of the weight times the frequency. The Newton step is given by:

where all derivatives are evaluated at the current estimate of g and
bn+1 = b – Db. This step is computed as the estimated regression coefficients in the least-squares model. Step halving is used when necessary to ensure a decrease in the criterion.

where:
is the value of gi when evaluated at the optimal:
The denominator of this expression is used as the "standard error of the residual" while the numerator is "raw" residual. Following Cook and Weisberg (1982), the influence of the i-th observation is assumed to be:
This is a one-step approximation to the change in estimates when the i-th observation is deleted. Here, the partial derivatives are with respect to b.
This example is from Prentice (1976) and involves mortality of beetles after five hours exposure to eight different concentrations of carbon disulphide. The table below lists the number of beetles exposed (N) to each concentration level of carbon disulphide (x, given as log dosage) and the number of deaths which result (y). The data is shown in Table 17-5:
|
Log Dosage
|
Number of Beetles Exposed
|
Number of Deaths
|
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The number of deaths at each concentration level are fitted as a binomial response using logit (model = 3), probit (model = 4), and log-log (model = 5) models. Note that the log-log model yields a smaller absolute log likelihood (14.81) than the logit model (18.78) or the probit model (18.23). This is to be expected since the response curve of the log-log model has an asymmetric appearance, but both the logit and probit models are symmetric about q = 0.5.
.RUN
PRO print_results, cs, means, ca, crit, ls, cov
PRINT, ' Coefficient Satistics'
PRINT, ' Standard Asymptotic ', $
'Asymptotic'
PRINT, ' Coefficient Error Z-statistic ', $
'P-value'
PM, cs, FORMAT = '(4F13.2)'
PRINT
PRINT, 'Covariate Means = ', means, FORMAT = '(A18, F6.3)'
PRINT
PRINT, ' Case Analysis'
PRINT, ' Residual ', $
'Standardized'
PRINT, ' Predicted Residual Std. Error Leverage', $
' Residual'
PM, ca, FORMAT = '(5F12.3)'
PRINT
PRINT, 'Log-Likelihood = ', crit, FORMAT = '(A18, F9.5)'
PRINT
PRINT, ' Last Step'
PRINT, ls
PRINT
PRINT, 'Asymptotic Coefficient Covariance'
PM, cov, FORMAT = '(2F12.4)'
END
model = 3
nobs = 8
x = ([[1.690, 1.724, 1.755, 1.784, 1.811, 1.836, 1.861, 1.883],$
[6, 13, 18, 28, 52, 53, 61, 60], $
[59, 60, 62, 56, 63, 59, 62, 60]])
ncoef = IMSL_CAT_GLM(0, 1, model, x, Ipar = 2, Eps = 1.0e-3, $
Coef_Stat = cs, Covariances = cov, $
Criterion = crit, Means = means, $
Case_Analysis = ca, Last_Step = ls, Obs_Status = os)
print_results, cs, means, ca, crit, ls, cov
Coefficient Satistics
Standard Asymptotic Asymptotic
Coefficient Error Z-statistic P-value
-60.76 5.21 -11.66 0.00
34.30 2.92 11.76 0.00
Covariate Means = 1.793
Case Analysis
Residual Standardized
Predicted Residual Std. Error Leverage Residual
0.058 2.593 1.792 0.267 1.448
0.164 3.139 2.871 0.347 1.093
0.363 -4.498 3.786 0.311 -1.188
0.606 -5.952 3.656 0.232 -1.628
0.795 1.890 3.202 0.269 0.590
0.902 -0.195 2.288 0.238 -0.085
0.956 1.743 1.619 0.198 1.077
0.979 1.278 1.119 0.138 1.143
Log-Likelihood = -18.77818
Last Step
-3.67824e-08 1.04413e-05
Asymptotic Coefficient Covariance
27.1368 -15.1243
-15.1243 8.5052
STAT_TOO_MANY_HALVINGS—Too many step halvings. Convergence is assumed.
STAT_TOO_MANY_ITERATIONS—Too many iterations. Convergence is assumed.
STAT_TOO_FEW_COEF—Init_Est is used and "n_coef_input" = #. The model specified requires # coefficients.
STAT_MAX_CLASS_TOO_SMALL—The number of distinct values of the classification variables exceeds "Max_Class" = #.
STAT_INVALID_DATA_8—"N_Class_Values(#)" = #. The number of distinct values for each classification variable must be greater than one.
STAT_NMAX_EXCEEDED—The number of observations to be deleted has exceeded "lp_max" = #. Rerun with a different model or increase the workspace.
IDL Online Help (March 06, 2007)