Regression on a continuous variable is straightforward, but the benefits in discretization (i.e., classifying variable value to one of a finite number of bins) appear in many data analysis problems, when it is important to understand the general trend in the variable simply and intuitively. Besides simplicity, discretization reduces dimensionality (and complexity), and suits models of discrete data.
Discretization may also improve accuracy, which is the measure most discretization strategies aim to maximize. Yet, accuracy may tell only a limited part of the story and the results may often be insignificant. This can be seen in the degenerated scenario of discretization into a single class (bin/group) for which accuracy is perfect but the model supplies no useful information. Adding more bins will not be favored by the accuracy measure, as presumably it will induce errors into the model, but will raise the amount of information in the discretization.
That is, there is a tradeoff between the accuracy of a discretization scheme and the amount of information it provides. Increase in the amount of information in the scheme often comes at the expense of the scheme accuracy. For example, discretization of a continuous target variable, such as a production tool work in process (WIP) using more levels increases information but also raises the errors in WIP discretization.
To trade between the different goals of maximization of accuracy and maximization of information and to direct and evaluate discretization, an information measure (IM) is suggested using the mutual information (MI) between predictions and true decisions
where Errs is a measure of the severity of the discretization error, which is a normalized ‘discretization distance’ between a predicted group xÎX and the true group yÎY (i.e., the distance between (x,y) and (x,x), which represents the perfect decision), p(x,y) is the joint probability distribution of predicted and true groups, and p(x) and p(y) are the marginals.
The superiority of IM over other performance measures is manifested in various scenarios; For example, when the balance among the classes (groups) changes or the number of groups increases, or when the error severity,
increases.
In addition, an unsupervised, IM-based discretization strategy, called IM-based split (IMS), is suggested. This strategy determines the number and positions of the discretization splits to increase the amount of information in the discretization while minimizing the error severity.
IMS takes into consideration the ability of a prediction model and ‘tries’ to balance between the amount of information the model supplies and the damage its errors cause. It is done in a greedy fashion by repeatedly finding splits of the predicted target variable that incrementally improve IM. This results in a discretization scheme that is not restricted by the number of bins or to bins of equal width or instance frequency. Instead, the bin number and widths are determined by the prediction ability of the model, as exploited by IMS.
IMS is evaluated using WIP real data collected for a chain of tools in Tower Semiconductor manufacturing FAB.