International Journal of Information Technology and Computer Science(IJITCS)

ISSN: 2074-9007 (Print), ISSN: 2074-9015 (Online)

Published By: MECS Press

IJITCS Vol.6, No.9, Aug. 2014

An Approach for Indexing Web Data Sources

Full Text (PDF, 360KB), PP.52-58

Views:119   Downloads:1


Saidi Imene, Nait Bahloul Safia

Index Terms

Information Retrieval, Indexing Techniques, Data Mining, Mapreduce


Web information sources such as forums, blogs, and news articles are becoming increasingly large and diverse. Even if advances in technology are helping to improve techniques for dealing with the large amounts of the generated data, such data sources are heterogeneous in structure (semi structured or unstructured sources) and nature (texts or images). Implementation of software solutions is then necessary to prepare data and access these sources in a homogenous way. In this paper we present an approach for indexing heterogeneous data sources. Our objective is to offer techniques for efficient indexing of web sources by storing only the necessary information. We propose automatic indexing for semi structured or unstructured sources (e.g., xml files, html files) and annotation for other sources (e.g., images, videos that exist within a page). We present our algorithms of indexing and propose the use of MapReduce model to build a scalable inverted index. Experiments on a real-world corpus show that our approach achieves a good performance.

Cite This Paper

Saidi Imene, Nait Bahloul Safia,"An Approach for Indexing Web Data Sources", International Journal of Information Technology and Computer Science(IJITCS), vol.6, no.9, pp.52-58, 2014. DOI: 10.5815/ijitcs.2014.09.07



[2]J. Dean and S. Ghemawat. Mapreduce: Simplified data Processing on large clusters. OSDI ’04, pages 137–150, 2008.

[3]M.Isard, M.Budiu, Y.Yu, A.Birrell, and D.Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys ’07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 59–72. ACM2007.

[4]S. Agrawal et al. Database Tuning Advisor for Microsoft SQLServer2005. VLDB2004. Pages 1110–1121. 2004.

[5]N. Bruno and S. Chaudhuri. Physical Design Refinement: The Merge-Reduce Approach.ACM TODS, 32(4), 2007.

[6]S.Chaudhuri and V. R. Narasayya. Self-Tuning Database Systems: A Decade of Progress. In VLDB, pages 3–14, 2007.

[7]F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach,M.Burrows,T.Chandra A.Fikes, and R.E. Gruber, “Bigtable: A distributed storage system for structured data” ACM Trans. Comput. Syst. vol. 26, no. 2, pp. 1–26. 2008.

[8]J. Lin, D. Ryaboy, and K. Weil.Full-text indexing for optimizing selection operations in large-scale data analytics. In MapReduce, pages 59–66, 2011.

[9]Mustapha Baziz, Mohand Boughanem, Salam Traboulsi: A concept-based approach for indexing documents in IR. INFORSID 2005: 489-504.

[10]Mustapha Baziz, Mohand Boughanem, and Nathalie Aussenac-Gilles: Conceptual Indexing Based on Document Content Representation. CoLIS 2005: 171 - 186.

[11]Mustapha Baziz, Mohand Boughanem, Gabriella Pasi, Henri Prade: An Information Retrieval Driven by Ontology: from Query to Document Expansion. RIAO2007.

[12]Gerard Salton: Syntactic Approaches to Automatic Book Indexing. ACL 1988: 204-210.

[13]Christina Lioma, Iadh Ounis: Light Syntactically-Based Index Pruning for Information Retrieval. ECIR2007:88-100.

[14]Lin, J. et C. Dyer (2010).Data-Intensive Text Processing with MapReduce.Morgan & Claypool Publishers.

[15]C., Schenk, S., Scherp, A.: Kat: the k-space annotation tool. In: Poster Session, Int. Conf. on Semantic and Digital Media Technologies (SAMT). Germany. (2008).

[16]Russell, B., Torralba, A., Murphy, K., Freeman, W.: Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision 77 (2008) 157-173.

[17]Kipp, M.: Anvil - a generic annotation tool for multimodal dialogue. In: in Proc. 7th European Conf. on Speech Communication and Technology (Eurospeech), Aalborg, Denmark. (2001).

[18]Schallauer, P., Ober, S., Neuschmied, H.: Efficient semantic video annotation by object and shot re detection. In: Posters and Demos Session, 2nd International Conference on Semantic and Digital Media Technologies (SAMT), Koblenz, Germany. (2008).

[19]Kang, B.-Y., & Lee S.-J. "Document indexing: a Concept-based approach to term weight estimation." Information Processing and Management: an International Journal 41(5): 1065 – 1080. (2005). 

[20]Nick Craswell, Stephen E. Robertson, Hugo Zaragoza, Michael J. Taylor: Relevance weighting for query independent evidence. SIGIR 2005: 416-423 (2005).

[21]Miller F., WordNet: A lexical database. Communication of the ACM, 38(11): 39-41, (1995).

[22]G. Wang, Evaluating Mapreduce system performance: A Simulation approach. Ph.D. Thesis. Virginia Polytechnic Institute and State University (2012).