in front of the two articles — Analysis on the premise of data quality and analysis 1 the premise of data quality were introduced in 2 by Data Profiling to get the data of the statistical information, and use the Data Auditing to evaluate whether the data quality problems, data quality problems can be audited by the integrity, accuracy and consistency three aspects. This article introduces the last piece – data correction (Data Correcting).
audit data help us find data in the existing problems, and these problems can sometimes use some methods for correction, so as to enhance the overall quality of the data, the data is modified in order to accomplish this task, corrections can be made from the following aspects:
fills missing values
the easiest way to record missing problems is to recover the data. In general, the lack of statistical index data can be re obtained from the raw data, while the original data missing can be recovered from the extracted data source or backup data. If the original data is completely lost, basically irretrievable.
for the lack of a field value, a lot of data will introduce a method using some statistics to repair, is actually to the missing value estimation or prediction, will generally use the average mode before and after the averaging method, or the change trend of using method fitting index regression analysis after prediction. These methods can not use other way to retrieve the value in the absence or re calculated, and missing value change rules under the premise are desirable, when the index value is lost through this kind of method according to a few days before the data to a numerical prediction of the day. But most of the time the website analysis if the underlying log has missing values, it is difficult to predict the missing values, because the access details are almost untraceable, so the access records are missing values and lack of these fields can significantly affect the calculation of some statistical indicators, the simplest way is to abandon the record, but this direct approach to filter out some of the missing records will only be used for access log does not require very precise data, if the operation of the website, these transactions need to ensure complete and accurate data on the vast calculation is not directly discarded, but also for the lack of access log or abnormal records also need to filter based on statistical basis this kind of data, the general principle is less important if the field is missing or abnormal records accounted for less than 1% or 5 per thousand In this case, you can choose to filter these records, and if the share is higher, you need to investigate further if there is a problem with the logging.
delete duplicate record
The value of some fields in the
data set is necessarily unique, such as date fields in the index values by day statistics, user information for the user information table, and so on. These need to ensure that only rules can be applied to ID