With the growing dependency on data resources, with rapidly increasing volume and complexity - data quality (DQ) is of greater concern. This study observes the degradation of data accuracy over time, due to changes in the real-world entities or behaviors that the data describes. Even if data is captured correctly, the real-world state may change over time. If the data is not updated accordingly, it may no longer reflect the real-world state correctly, and become inaccurate. Obviously, decisions that rely on inaccurate data might lead to fault decisions, often with major negative consequences. The goal of this study is to develop a model that reflects DQ as a dynamic process, and may help assessing and predicting accuracy degradation over time. Another important aspect addressed by this study is the substantial cost-benefit tradeoffs associated with DQ improvement. The task of detecting and correcting accuracy defects, by comparing data to the corresponding real-world`s values, is expensive and resource-demanding. The model developed in this study may help assessing whether the benefits from DQ improvement justify the associated costs, and recommend the optimal point in time at which data values should be evaluated and possibly reacquired.
The model takes continuous-time Markov chain approach, assuming a finite number of states, each reflecting a possible value of a certain key data attribute. The probabilities of transitioning between states are known, stable, and independent of past transitions, and the time spent in each state is exponentially distributed. The model assumes some known damage in cases where the data state does not meet the real-world value. The damage is state-dependent and described as a non-decreasing (possibly constant) function of time. Under those assumptions, it is possible to estimate the expected damage of a data record, given its current state and the time passed since the last transition. The decision whether or not to correct the data can now be evaluated by comparing the potential benefits of correcting (the elimination of potential damage), versus the correction cost.
The model is evaluated with a real-world dataset that captures the workflow of a relevant business scenario. The dataset reflects the activity of a firm that handles insurance claims, where as part of the claim-handling process employees have to update the insurant`s status if changed. Insurants often neglect to report their current status; hence, the dataset is subject to major inaccuracies, which often translate to major monetary losses for the company. Contacting Insurants and updating their details is costly and time consuming. Currently the contact decision is guided by some heuristics that are based on employees` experience. The goal of the empirical evaluation is to assess the model`s feasibility and potential contribution, by evaluating its performance in predicting status transitions versus the current heuristic, and the potential time and cost saving associated with such prediction.