International Journal of Information Technology and Computer Science(IJITCS)

ISSN: 2074-9007 (Print), ISSN: 2074-9015 (Online)

Published By: MECS Press

IJITCS Vol.8, No.11, Nov. 2016

Relevant XML Documents - Approach Based on Vectors and Weight Calculation of Terms

Full Text (PDF, 571KB), PP.16-25

Abdeslem DENNAI, Mohammed Yacine DENNAI, Sidi Mohammed BENSLIMANE

Index Terms

Semi-structured web document;term weighting;term frequency;TF-IDF and logic frequency


Three classes of documents, based on their data, circulate in the web: Unstructured documents (.Doc, .html, .pdf ...), semi-structured documents (.xml, .Owl ...) and structured documents (Tables database for example). A semi-structured document is organized around predefined tags or defined by its author.
However, many studies use a document classification by taking into account their textual content and underestimate their structure. We attempt in this paper to propose a representation of these semi-structured web documents based on weighted vectors allowing exploiting their content for a possible treatment. The weight of terms is calculated using: The normal frequency for a document, TF-IDF (Term Frequency - Inverse Document Frequency) and logic (Boolean) frequency for a set of documents. To assess and demonstrate the relevance of our proposed approach, we will realize several experiments on different corpus.

Abdeslem DENNAI, Mohammed Yacine DENNAI, Sidi Mohammed BENSLIMANE,"Relevant XML Documents - Approach Based on Vectors and Weight Calculation of Terms", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.11, pp.16-25, 2016. DOI: 10.5815/ijitcs.2016.11.03


