Imtiaz Hussain Khan; Muazzam A. Siddiqui; Kamal M. Jambi; Muhammad Imran; Abobakr A. Bagais

Query Optimization in Arabic Plagiarism Detection: An Empirical Study

Full Text (PDF, 444KB), PP.73-79

Views: 0 Downloads: 0

Author(s)

Imtiaz Hussain Khan ^1,* Muazzam A. Siddiqui ² Kamal M. Jambi ¹ Muhammad Imran ¹ Abobakr A. Bagais ¹

1. Department of Computer Science, King Abdulaziz University, Jeddah – KSA

2. Department of Information System, King Abdulaziz University, Jeddah – KSA

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2015.01.07

Received: 1 May 2014 / Revised: 15 Sep. 2014 / Accepted: 20 Nov. 2014 / Published: 8 Dec. 2014

Index Terms

Arabic Plagiarism Detection, Query Generation, Query Optimization, Document Similarity, Arabic Natural Language Processing

Abstract

This article describes an ongoing research which intends to develop a plagiarism detection system for Arabic documents. We developed different heuristics to generate effective queries for document retrieval from the Web. The performance of those heuristics was empirically evaluated against a sizeable corpus in terms of precision, recall and f-measure. We found that a systematic combination of different heuristics greatly improves the performance of the document retrieval system.

Cite This Paper

Imtiaz H. Khan, Muazzam A. Siddiqui, Kamal M. Jambi, Muhammad Imran, Abobakr A. Bagais, "Query Optimization in Arabic Plagiarism Detection: An Empirical Study", International Journal of Intelligent Systems and Applications(IJISA), vol.7, no.1, pp.73-79, 2015. DOI:10.5815/ijisa.2015.01.07

Reference

[1]M. Konchady. Buiding search applications: Lucene, LingPipe, and Gate. First Edition. Mustru Publishing, 2008.

[2]P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to data mining. Addison Wesley, 2005.

[3]M. D. Lee, B. Pincombe, and M. Welsh. An empirical evaluation of models of text document similarity. In B. G. Bara, L. Barsalou, and M. Bucciarelli, editors, 27th Annual Meeting of the Cognitive Science Society, pages 1254–1259. 2005.

[4]C.-H. Leung and Y.-Y. Chan. A natural language processing approach to automatic plagiarism detection. In Proceedings of the 8th ACM SIGITE conference on information technology education, pages 213–218, 2007.

[5]T. Wang, X. Z. Fan, and J. Liu. Plagiarism detection in chinese based on chunk and paragraph weight. In Proceedings of the 7th International Conference on Machine Learning Cybernet, pages 2574–2579, Beijing, China, 2008.

[6]J. A. Malcolm and P. C. R. Lane. Tackling the pan09 external plagiarism detection corpus with a desktop plagiarism detector. In Proceedings of SEPLN, pages 29–33, Spain, 2009.

[7]G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11): 613–620, 1975.

[8]M. Eissen, B. Stein, and M. Kulig. Plagiarism detection without reference collections. In Proceedings of the advances in data analysis, pages 359–366, 2007.

[9]S. Benno, K. Moshe, and S. Efstathios. Plagiarism analysis, authorship identiﬁcation, and near-duplicate detection. In Proceedings of the ACM SIGIR Forum PAN07, pages 68–71, New York, 2007.

[10]S. M. Alzahrani, N. Salim, and A. Abraham. Understanding plagiarism linguistic patterns, textual features and detection methods. 42, 2012.

[11]S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the ACM SIGMOD annual conference., 1995.

[12]A. Z. Broder. On the resemblance and containment of documents. Compression and complexity of sequences, pages 21–29, 1997.

[13]J. Kasprzak, M. Brandejs, and M. Kˇripaˇc. Finding plagiarism by evaluating document similarities. In Proceedings of the SEPLN, pages 24–28, Spain, 2009.

[14]A. Si, H. V. Leong, and R. W. Lau. check: A document plagiarism detection system. In Proceedings of ACM symposium for applied computing, pages 70–77, 1997.

[15]D. Zou, W. Long, and Z. Ling. A cluster-based plagiarism detection method. In Proceedings of the CLEF Workshop, 2010.

[16]M. Elhadi and A. Al-Tobi. Use of text syntactical structures in detection of document duplicates. In Proceedings of the third international conference on digital information management, pages 520–525, 2008.

[17]S. Torres and A. Gelbukh. Comparing similarity measures for original WSD lesk algorithm. In Proceedings of advances in computer science and application, 2009.

[18]N. Shivakumar and H. Garcia-Molina. Building a scalable and accurate copy detection mechanism. In Proceedings of the 1st ACM international conference on digital libraries, 1996.

[19]K. Monostori, A. Zaslavsky, and H. Schmidt. A repetition based measure for verification of text collections and for text categorization. In Proceedings of information resources management association international conference, pages 955–957, 2000.

[20]D. Khmelev and W. Teahan. A repetition based measure for verification of text collections and for text categorization. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pages 104–110, 2003.

[21]R. L. Ribler and M. Abrams. Using visualization to detect plagiarism in computer science classes. In Proceedings of IEEE symposium on information visualization, pages 173–178, 2000.

[22]R. Lukashenko, V. Graudina, and J. Grundspenkis. Computer-based plagiarism detection methods and tools: an overview. In Proceedings of the ACM international conference on computer systems and technologies, pages 203–215, 2007.

[23]P. Clough. Old and new challenges in automatic plagiarism detection. Technical report, national plagiarism advisory service, 2003.

[24]P. Runeson, M. Alexanderson, and O. Nyholm. Detection of duplicate defect reports using natural language processing. In Proceedings of the 29th international conference on software engineering, pages 499–510, 2007.

[25]I. Androutsopoulos and P. Malakasiotis. A survey of paraphrasing and textual entailment methods. Technical report, Athens university of economics and business, Greece, 2009.

[26]Z. Ceska and C. Fox. The influence of text pre-processing on plagiarism detection. In Proceedings of the recent advances in natural language processing, 2009.

[27]M. Chong, L. Specia, and R. Mitkov. Using natural language processing for automatic detection of plagiarism. In Proceedings of the 4th international plagiarism conference, 2010.

[28]A. N. De Roeck and W Al-Fares. A morphologically sensitive clustering algorithm for identifying Arabic roots. In Proceedings of the association for computational linguistics (ACL’00), 2000.

[29]Rozovskaya, R. Sproat, and E. Benmamoun. Challenges in processing colloquial Arabic: The challenge of Arabic for NLP/MT. In international conference at the British computer society, London, UK, 2006.

[30]N. Habash. Arabic tutorial. In the fifth international conference on Language Resources and Evaluation, LREC’06, 2006.

[31]S. Alzahrani and N. Salim. Fuzzy semantic-based string similarity for extrinsic plagiarism detection. In Proceedings of the 2nd international conference on the applications of digital information and Web technologies., London, UK, 2009.

[32]A. Jadalla and A. Elnagar. A fingerprinting-based plagiarism detection system for Arabic text-based documents. In Proceedings of the 8th international conference on computing technology and information management, 2012.

[33]I. Bensalem, P. Rosso, and S. Chikhi. Intrinsic plagiarism detection in Arabic text: Preliminary experiments. In Proceedings of the 2nd Spanish conference on information retrieval, Spain, 2012.

[34]M. Menai. Detection of plagiarism in Arabic documents. International journal of information technology and computer science (IJITCS), 4(10), 2012.

[35]R. Yerra and Y. Ng. A sentence-based copy detection approach for web documents. Fuzzy systems and knowledge discovery, pages 557–570, 2005.

[36]Y. Li, D. McLean, Z. A. Banda, J. D. O’Shea, and K. Crockett. Sentence similarity based on semantic nets and corpus statistics. IEEE transactions on knowledge and data engineering, 18(8):1138–1150, 2006.

[37]M. A. Siddiqui, S. Elhag, I. H. Khan and K. Jambi. Building an Arabic plagiarism detection corpus. To appear in language resources and engineering.

[38]Duck Duck Go. [Online]. Available: https://duckduckgo.com/api. [Accessed: 13-Jan-2014].

[39]Faroo Web Search. [Online]. Available: http://www.faroo.com/hp/api/api.html. [Accessed: 13-Jan-2014].

[40]Blekko API. [Online]. Available: http://help.blekko.com/index.php/does-blekko-have-an-api/. [Accessed: 01-Dec-2013].

International Journal of Intelligent Systems and Applications (IJISA)