Median chat – Gary C. White

Median chat

Estimation of the overdispersion parameter, c, for the global model is one of the key issues in applying Program MARK to encounter data. The parametric bootstrap goodness-of-fit procedure was an attempt to develop a general procedure, but was found to be biased for Cormack-Jolly-Seber (CJS) data (White 2002). Program RELEASE provides useful goodness-of-fit (GOF) tests and estimates of c for the CJS data type, and programs ESTIMATE and BROWNIE provide similar capabilities for dead recovery data. However, most of the data types in MARK do not have a useful GOF procedure to assess the validity of the global model. The median chat procedure is an attempt to develop a general approach to the estimation of c.

Methodology

Likelihood theory leads to the deviance and its associated degrees of freedom as a measure of the GOF of a model, with chat estimated as deviance/df. Deviance is defined as the difference between -2log Likelihood for the model of interest and the -2log Likelihood of the saturated model. Asymptotically, the deviance statistic is chi-square distributed. However, for finite sample sizes, the deviance is not closely enough distributed as chi-square to be generally useful. The median chat routine is an attempt to correct for the bias of the deviance chat.

The median chat approach is to simulate data with a range of c values, obtaining a deviance chat = deviance/df for each of the simulated data sets. Then, a logistic regression is performed to estimate the value of c to simulate that would result in 1/2 of the simulated deviance/df values greater than the observed deviance/df, and hence 1/2 of the simulated values less than the observed deviance/chat. The procedure requires the user to specify the range of c values to simulate (lower and upper bounds, and the total number of points based on these bounds), and the number of replicate simulations to generate for each of the specified range of c values. Note that the lower bound can not <1, as there is no biologically reasonable model to explain underdispersed data, and there is no way to generate underdispersed data in MARK. Typically, a small set of c values over a wide range should be used to generate the resulting deviance/df values, to find out the approximate range in which to simulate c to focus the simulated data around the likely value of c that will result. The logistic regression analysis is performed by MARK as a known fate model. Output consists of the estimated value of c and a SE that reflects the sampling variation of the estimated c, derived from the logistic regression analysis. These estimates are provided in a notepad window preceding the known fate output. In addition, a graph of the observed proportions along with the predicted proportions based on the logistic regression model is provided. The initial dialog box where the simulation parameters are specified also has a check box to request an Excel spreadsheet to contain the simulated values. This spreadsheet is useful for additional analysis, if desired.

The two-sided 95% confidence interval on c is obtained by picking off the 0.025 and 0.975 probability values from the logistic regression function. In addition, because the lower confidence bound on c is often less than1, a one-sided 95% confidence bound is also provided. This value is probably of more general value than the two-sided interval, given that c has a lower bound of 1. Note that all of the values derived from the logistic regression function have sampling variation, which is expressed for chat as a sampling SE. Replicate simulation will produce slightly different results. Thus, the user should consider running multiple sets of simulations and taking the means of these simulations for estimates of chat and the confidence bounds.

The median chat approach appears to work well. In comparisons for the CJS data type to the RELEASE model, the median chat is biased high, as much as 15% in one case of phi = 0.5 with 5 occasions. However, the median chat has a much smaller standard deviation for the sampling distribution than the chat estimated by RELEASE. That is, the mean squared error (MSE) for the median chat is generally about 1/2 of the MSE for the RELEASE estimator. Thus, on average, the median chat is closer to truth than the RELEASE chat, even though the median chat is biased high.

Available Data Types

Only a hand-full of the data types in MARK can be used with the median chat procedure. To see which data types are possible, open the Help tab and then the Data Types menu choice. In the list of data types, those marked with a # sign can be run through the median chat procedure.

Jolly-Seber Data Types. Although none of the Jolly-Seber data types are marked with a # sign, goodness of fit of all of these data types can be assessed via the Cormack-Jolly-Seber data type. This is because all the lack of fit in the Jolly-Seber data types can only come from the recaptures. Hence, you can use the c-hat value estimated from the CJS data type for any of the Jolly-Seber data types.

Huggins Closed Captures. Only the Huggins closed captures data type can be used with the median chat. This is because the Huggins models condition on the number of unique animals captured, M(t + 1). So the median chat procedures uses the observed value of M(t + 1) to simulate data. However, because of this conditioning, the median chat procedure cannot be used with robust design data types and the Huggins closed capture model because the conditioning on M(t + 1) then precludes survival rate estimation.

Known Fates. Although you can process known fate data with the median chat, this effort is really not assessing goodness-of-fit because you can always create a known fate model where the deviance is zero, i.e., the saturated model. So, what you are really doing when you run the known fate data through the median chat procedure is assessing just how much structureal lack of fit remains in your model compared to the saturated model.

No individual Covariates. One of the current limitations of the median chat goodness-of-fit procedure is that individual covariates are not allowed. This is because the real parameters are passed to the simulator to generate the simulated data — a fix to avoid having to deal with the multitude of link functions in the true model.