International Journal of Information Technology and Computer Science(IJITCS)
ISSN: 2074-9007 (Print), ISSN: 2074-9015 (Online)
Published By: MECS Press
IJITCS Vol.9, No.9, Sep. 2017
A New Dynamic Data Cleaning Technique for Improving Incomplete Dataset Consistency
Full Text (PDF, 390KB), PP.60-68
This paper presents a new approach named Dynamic Data Cleaning (DDC) aims to improve incomplete dataset consistency by identifying, reconstructing and removing inconsistent data objects for future data analysis process. The proposed DDC approach consists of three methods: Identify Normal Object (INO), Reconstruct Normal Object (RNO) and Dataset Quality Measure (DQM). The first method INO divides the incomplete dataset into normal objects and abnormal objects (outliers) based on degree of missing attributes values in each individual object. Second, the (RNO) method reconstructs missed attributes values in the normal objects by the closest object based on a distance metric and removes inconsistent data objects (outliers) with higher missed data. Finally, the DQM method measures the consistency and inconsistency among the objects in improved dataset with and without outlier. Experimental results show that the proposed DDC approach is suitable to identify and reconstruct the incomplete data objects for improving dataset consistency from lower to higher level without user knowledge.
Cite This Paper
Sreedhar Kumar S, Meenakshi Sundaram S, "A New Dynamic Data Cleaning Technique for Improving Incomplete Dataset Consistency", International Journal of Information Technology and Computer Science(IJITCS), Vol.9, No.9, pp. 60-68, 2017. DOI: 10.5815/ijitcs.2017.09.06
Mohammed A. AlGhamdi, “Pre-Processing Methods of Data Mining,” IEEE/ACM 7th International Conference on Utility and Cloud Computing, pp. 452-456, 2014.
I. Ahmed and A. Aziz, “Dynamic approach for data scrubbing process,” International Journal on Computer Science and Engineering, ISSN: 0975-3397, 2010.
B. Everett, Cluster Analysis, John Wiley and Sons, Inc., 1993.
W. Kim, B.J. Choi, E.K. Hong, S.K. Kim, and D. Lee, “A taxonomy of dirty data”, Data mining and knowledge discovery, vol. 7, no. 1, 2003, pp. 81–99.
Edwin-de-Jonge and Mark-van-der-loo, “An introduction to data cleaning with R,” Statistics Netherland, 2013.
R. J. A. Little, “Missing-data adjustments in large surveys,” Journal of Business and Economic Statistics, vol. 6, no. 3, pp. 287-296, 1988.
W. Young, G. Weckman, W. Holland, “ A survey of methodologies for the treatment of missing values within datasets: limitations and benefits,” Theoretical Issues in Ergonomics Science, vol. 12, no. 1, pp. 15-43, 2011.
Darwiche Adnan, Modeling and Reasoning with Bayesian Networks, Cambridge University Press, 2009.
Koller, Daphne and Friedman, Nir, Probabilistic Graphical Models: Principles and Techniques, MIT Press, 2009.
Murphy, Kevin Patrick,"Machine Learning: A Probabilistic Perspective, “ MIT Press, 2012.
K. Mohan, G. Van den Broeck, A. Choi, J. Pearl, “An Efficient Method for Bayesian Network Parameter Learning from Incomplete Data,” International Conference on Machine learning Workshop, 2014.
D.B. Rubin, Multiple imputations for nonresponse in surveys, New York: Wiley, 1987.
J. L. Schafer, M. K. Olsen, Multiple imputations for multivariate missing data problems: A data analyst’s perspective. Multivariate Behavioral Research, vol. 33, pp. 545–571, 1998.
J. W. Graham, A. E. Olchowski, T. D. Gilreath, “How Many Imputations are really needed? Some Practical Clarifications of Multiple Imputation Theory,” Preventative Science, vol. 8, no. 3, pp. 206-2013,2007.
L. M. Collins, J. L. Schafer, L. M. Kam, “A comparison of inclusive and restrictive strategies in modern missing data procedures,” Psychological Methods, vol. 6, no. 4, pp. 330-351, 2001.
J. W. Graham, “Adding missing data relevant variables to FIML based Structural equation models,” Structural Equation Modeling, vol. 10, pp. 80–100, 2003.
E. Mirkes, T. J. Coats, J. Levesley, A. N. Gorban, “Handling missing data inlarge healthcare dataset: A case study of unknown trauma outcomes,” Computers in Biology and Medicine. vol.75, pp. 203–216, 2016, DOI:10.1016/j.compbiomed. 2016.06.004.
http://www.ics.uci/mamographicsmasses / ML Repository .html
Melissa J Azur, Elizabeth A Stuart, Constantine Frangakis, Philip J Leaf, “ Multiple imputation by chained equations: what is it and how does it work?,” International Journal of MMethods in Psychiatric Research, vol. 20, no. 1, 2011, pp. 40-49, DOI: 10.1002/MPR.329.
Michael G Kenward, “The handling of missing data in clinical trials,” Clinical Investigation, vol. 3, no. 3, 2013, pp. 241-250, DOI: 10.4155/cli.13.7.
Sameer Dixit, Navjot Gwal, “An Implementation of Data Pre-Processing for Small Dataset,” International Journal of Computer Application, vol. 103, no. 6, pp. 28-31, 2014.
R. Kavitha Kumar and R. M. Chadrasekaran, “Attribute Correction Data Cleaning Using Association Rule and Clustering Methods,” International Journal of Data Mining & Knowledge Management Process, vol. 1, no. 2, pp. 22-32, 2011, DOI:10.5121/ijdkp.2011.1202
Anosh Fatima, Nosheen Nazir, Muhammad Gufran Khan, “Data Cleaning in Data Warehouse: A Survey of Data Pre-Processing” Journal of Information Technology and Computer Science (IJITCS), vol. 9, no. 3, pp. 50-61, 2017, DOI: 10.5815/ijitcs.2017.03.06.