A hybrid Technique for Cleaning Missing and Misspelling Arabic Data in Data Warehouse

Full Text (PDF, 728KB), PP.17-25

Views: 0 Downloads: 0

Author(s)

Mohammed Abdullah Al-Hagery 1,* Latifah Abdullah Alreshoodi 1 Maram Abdullah Almutairi 1 Suha Ibrahim Alsharekh 1 Emtenan Saad Alkhowaiter 1

1. Department of Computer Science, College of Computer, Qassim University, KSA

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2019.07.03

Received: 26 Jun. 2018 / Revised: 10 Dec. 2018 / Accepted: 19 May 2019 / Published: 8 Jul. 2019

Index Terms

Data Cleaning, Missing Data, Arabic Misspelling, Data Quality, Data Consistency

Abstract

Real-World datasets accumulated over a number of years tend to be incomplete, inconsistent and contain noisy data, this, in turn, will cause an inconsistency of data warehouses. Data owners are having hundred-millions to billions of records written in different languages, hence continuously increases the need for comprehensive, efficient techniques to maintain data consistency and increase its quality. It is known that the data cleaning is a very complex and difficult task, especially for the data written in Arabic as a complex language, where various types of unclean data can occur to the contents. For example, missing values, dummy values, redundant, inconsistent values, misspelling, and noisy data. The ultimate goal of this paper is to improve the data quality by cleaning the contents of Arabic datasets from various types of errors, to produce data for better analysis and highly accurate results. This, in turn, leads to discover correct patterns of knowledge and get an accurate Decision-Making. This approach established based on the merging of different algorithms. It ensures that reliable methods are used for data cleansing. This approach cleans the Arabic datasets based on the multi-level cleaning using Arabic Misspelling Detection, Correction Model (AMDCM), and Decision Tree Induction (DTI). This approach can solve the problems of Arabic language misspelling, cryptic values, dummy values, and unification of naming styles. A sample of data before and after cleaning errors presented.

Cite This Paper

Mohammed Abdullah Al-Hagery, Latifah Abdullah Alreshoodi, Maram Abdullah Almutairi, Suha Ibrahim Alsharekh, Emtenan Saad Alkhowaiter, "A hybrid Technique for Cleaning Missing and Misspelling Arabic Data in Data Warehouse", International Journal of Information Technology and Computer Science(IJITCS), Vol.11, No.7, pp.17-25, 2019. DOI:10.5815/ijitcs.2019.07.03

Reference

[1]N. Debbarma, “Analysis of Data Quality and Performance Issues in Data Warehousing and Business Intelligence,” vol. 79, no. 15, pp. 20–26, 2013.

[2]S. B. Alotaibi, “ETDC,” in Proceedings of the International Conference on Advances in Image Processing - ICAIP 2017, 2017, pp. 135–138.

[3]G. Rahman and Z. Islam, “iDM I : A Novel Technique for Missing Value Imputation using a Decision Tree and Expectation-Maximization Algorithm,” no. March, pp. 8–10, 2014.

[4]M. Hernández and J. Stolfo, “Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem,” Data Min. Knowl. Discov., vol. 2, pp. 9–37, 1998.

[5]W. N. Li, R. Bheemavaram, and X. Zhang, “Transitive closure of data records: Application and computation,” Int. Ser. Oper. Res. Manag. Sci., vol. 132, pp. 39–75, 2010.

[6]J. Wang, H. Zhang, B. Fang, X. Wang, and L. Ye, “A Survey on Data Cleaning Methods in Cyberspace,” 2017 IEEE Second Int. Conf. Data Sci. Cybersp., pp. 74–81, 2017.

[7]A. Paul, V. Ganesan, J. S. Challa, and Y. Sharma, “HADCLEAN: A hybrid approach to data cleaning in data warehouses,” 2012 Int. Conf. Inf. Retr. Knowl. Manag., pp. 136–142, 2012.

[8]P. Patidar and A. Tiwari, “Handling Missing Value in Decision Tree Algorithm,” Int. J. Comput. Appl., vol. 70, no. 13, pp. 975–8887, 2013.

[9]B. E. T. H. Twala, M. C. Jones, and D. J. Hand, “Good methods for coping with missing data in Decision Trees,” Pattern Recognit. Lett., vol. 29, no. 7, pp. 950–956, 2008.

[10]Q. Yang, S. Member, C. Ling, X. Chai, and R. Pan, “Test-Cost Sensitive Classification on Data with Missing Values,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 5, pp. 626–638, 2006.

[11]T. Aljuaid and S. Sasi, “Proper imputation techniques for missing values in data sets,” Proc. 2016 Int. Conf. Data Sci. Eng. ICDSE 2016, 2017.

[12]S. Swapna, P. Niranjan, B. Srinivas, and R. Swapna, “Data Cleaning for Data Quality,” 2016 3rd {{International Conf. {{Computing}} {{Sustainable Glob. Dev., pp. 344–348, 2016.

[13]X. Lian, L. Chen, and S. Song, “Consistent query answers in inconsistent probabilistic databases,” Proc. 2010 Int. Conf. Manag. data - SIGMOD ’10, p. 303, 2010.

[14]L. Bravo, W. Fan, and S. Ma, “Extending Dependencies with Conditions,” Constraints, pp. 243–254, 

[15]J. Chomicki and J. Marcinkowski, “Minimal-change integrity maintenance using tuple deletions,” Inf. Comput., vol. 197, no. 1–2, pp. 90–121, 2005.

[16]W. Fan, F. Geerts, N. Tang, and W. Yu, “Inferring data currency and consistency for conflict resolution,” Proc. - Int. Conf. Data Eng., pp. 470–481, 2013.

[17]S. Kolahi and L. V. S. Lakshmanan, “Inconsistency in Databases,” Icdt, no. March 2009.

[18]P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, “A cost-based model and effective heuristic for repairing constraints by value modification,” Proc. 2005 ACM SIGMOD Int. Conf. Manag. data  - SIGMOD ’05, p. 143, 2005.

[19]I. F. I. P. P. Xu Chu, “Holistic data cleaning: Put violations into context,” no. L, pp. 458–469, 2013.

[20]G. Cong et al., “Improving data quality: consistency and accuracy,” Proc. 33rd Int. Conf. Very large data bases, vol. Vienna, Au, pp. 315–326, 2007.

[21]A. C. Gohel, A. V Patil, P. P. Vadhwana, and H. S. Patel, “A commodity data cleaning system,” Int. Res. J. Eng. Technol., vol. 4, no. 5, 2017.

[22]W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, “Interaction between record matching and data repairing,” Proc. 2011 Int. Conf. Manag. data - SIGMOD ’11, vol. 1, no. 1, p. 469, 2011.

[23]F. Geerts, G. Mecca, P. Papotti, and D. Santoro, “The L LUNATIC Data-Cleaning Framework,” Proc. VLDB Endow., vol. 6, no. 9, pp. 625–636, 2013.

[24]C. Mayfield, J. Neville, and S. Prabhakar, “ERACER: A Database Approach for Statistical Inference and Data Cleaning,” Proc. ACM SIGMOD Int. Conf. Manag. Data, pp. 75–86, 2010.

[25]M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas, “Guided data repair,” Proc. VLDB Endow., vol. 4, no. 5, pp. 279–289, 2011.

[26]M. Volkovs, J. Szlichta, and R. J. Miller, “Continuous data cleaning,” 2014 IEEE 30th Int. Conf. Data Eng., pp. 244–255, 2014.

[27]P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional dependencies for data cleaning,” Proc. - Int. Conf. Data Eng., pp. 746–755, 2007.

[28]W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, “Towards certain fixes with editing rules and master data,” VLDB J., vol. 21, no. 2, pp. 213–238, 2012.

[29]M. I. Alkanhal, M. A. Al-badrashiny, M. M. Alghamdi, and A. O. Al-qabbany, “Automatic Stochastic Arabic Spelling Correction With Emphasis on Space Insertions and Deletions,” vol. 20, no. 7, pp. 2111–2122, 2012.

[30]H. M. Noaman, S. S. Sarhan, and M. A. A. Rashwan, “Automatic Arabic Spelling Errors Detection and Correction Based on Confusion Matrix- Noisy Channel Hybrid System,” J. Theor. Appl. Inf. Technol., vol. 40, no. 2, pp. 54–64, 2016.

[31]K. Shaalan, R. Aref, and A. Fahmy, “An Approach for Analyzing and Correcting Spelling Errors for Non-native Arabic learners,” 2017.

[32]H. H. A. Ghafour, A. Ei-bastawissy, and A. F. A. Heggazy, “AEDA : Arabic Edit Distance Algorithm Towards A New Approach for Arabic Name Matching,” pp. 307–311, 2011.

[33]S. A. Mahmoud and A. S. Mahmoud, “Arabic character recognition using Modified Fourier Spectrum (MFS) Vs. fourier descriptors,” Cybern. Syst., vol. 40, no. 3, pp. 189–210, 2009.

[34]A. Higazy, T. El Tobely, A. H. Yousef, and A. Sarhan, “Web-based Arabic/English duplicate record detection with nested blocking technique,” Proc. - 2013 8th Int. Conf. Comput. Eng. Syst. ICCES 2013, pp. 313–318, 2013.

[35]S. Kamble and V. Kohle, “A novel approach of data cleaning/cleansing detecting, editing,” Int. J. Acad. Res. Dev., vol. 2, no. 3, pp. 84–88, 2017.

[36]M. M. Hamad and A. A. Jihad, “An enhanced technique to clean data in the data warehouse,” Proc. - 4th Int. Conf. Dev. eSystems Eng. DeSE 2011, pp. 306–311, 2011.

[37]E. Rahm and H. Do, “Data cleaning: Problems and current approaches,” IEEE Data Eng. Bull., vol. 23, no. 4, pp. 3–13, 2000.

[38]M. Interlandi and N. Tang, “Proof positive and negative in data cleaning,” Proc. - Int. Conf. Data Eng., vol. 2015–May, pp. 18–29, 2015.

[39]L. Zhai, M. Wu, S. Zhang, Q. Zhao, and T. Li, “Research on association mining data cleaning for the professional field,” in Proceedings of 2013 2nd International Conference on Measurement, Information and Control, ICMIC 2013, 2013, vol. 1, pp. 563–566.

[40]M. A. Al-hagery, “Knowledge Discovery in the Data Sets of Hepatitis Disease for Diagnosis and Prediction to Support and Serve Community,” Int. J. Comput. Electron. Res., vol. 4, no. 6, pp. 118–125, 2015.

[41]S. B. Alotaibi, “ETDC : An Efficient Technique to Cleanse Data in the Data Warehouse,” pp. 135–138, 2017.