International Journal of Information Technology and Computer Science(IJITCS)

ISSN: 2074-9007 (Print), ISSN: 2074-9015 (Online)

Published By: MECS Press

IJITCS Vol.6, No.9, Aug. 2014

An Approach for Indexing Web Data Sources

PP.52-58

Saidi Imene, Nait Bahloul Safia

Index Terms

Information Retrieval, Indexing Techniques, Data Mining, Mapreduce


Web information sources such as forums, blogs, and news articles are becoming increasingly large and diverse. Even if advances in technology are helping to improve techniques for dealing with the large amounts of the generated data, such data sources are heterogeneous in structure (semi structured or unstructured sources) and nature (texts or images). Implementation of software solutions is then necessary to prepare data and access these sources in a homogenous way. In this paper we present an approach for indexing heterogeneous data sources. Our objective is to offer techniques for efficient indexing of web sources by storing only the necessary information. We propose automatic indexing for semi structured or unstructured sources (e.g., xml files, html files) and annotation for other sources (e.g., images, videos that exist within a page). We present our algorithms of indexing and propose the use of MapReduce model to build a scalable inverted index. Experiments on a real-world corpus show that our approach achieves a good performance.

Cite This Paper

Saidi Imene, Nait Bahloul Safia,"An Approach for Indexing Web Data Sources", International Journal of Information Technology and Computer Science(IJITCS), vol.6, no.9, pp.52-58, 2014. DOI: 10.5815/ijitcs.2014.09.07



