Kazem Taghandiki

Work place: Department of Computer Engineering, Faculty of Computer Engineering, University of Isfahan, Isfahan, Iran

E-mail: taghandiky@gmail.com

Website:

Research Interests: Information Systems, Data Mining, Information Retrieval, Multimedia Information System, Data Structures and Algorithms

Biography

Kazem Taghandiki is a graduate student at the University of Isfahan in Software Engineering. He received his B.S in computer engineering from Birjand University in 2013. His area of interest includes Data mining, Information Retrieval and Semantic Web. He is currently working on his thesis in the area of Noisy Hyperlinks Removing.

Author Articles
A Supervised Approach for Automatic Web Documents Topic Extraction Using Well-Known Web Design Features

By Kazem Taghandiki Ahmad Zaeri Amirreza Shirani

DOI: https://doi.org/10.5815/ijmecs.2016.11.03, Pub. Date: 8 Nov. 2016

The aim of this paper is to propose an efficient method for identification of web document topics which is often considered as one of the debatable challenges in many information retrieval systems. Most of the previous works have focused on analyzing the entire text using time-consuming methods and also many of them have used unsupervised approaches to identify the main topic of documents. However, in this paper, it is attempted to exploit the most widely-used Hyper-Text Markup Language (HTML) features to extract topics from web documents using a supervised approach.
Hiring an interactive crawler, we firstly try to analyze HTML structures of 5000 webpages in order to identify the most widely-used HTML features. In the next step, the selected features of 1500 webpages are extracted using the same crawler.
Suitable topics are given to each web document by users in a supervised learning process. A topic modeling technique is used over extracted features to build four classifiers- C4.5, Decision Tree, Naïve Bayes and Maximum Entropy- which are separately adopted to train and test our data. The results of classifiers are compared and the high accurate classifier is selected. In order to examine our approach in a larger scale, a new set of 3500 web documents is evaluated using the selected classifier. Results show that the proposed system provides remarkable performance which is able to obtain 71.8% recognition rate.

[...] Read more.
Other Articles