International Journal of Information Technology and Computer Science(IJITCS)

ISSN: 2074-9007 (Print), ISSN: 2074-9015 (Online)

Published By: MECS Press

IJITCS Vol.8, No.8, Aug. 2016

Typology for Linguistic Pattern in English-Hindi Journalistic Text Reuse

Full Text (PDF, 1076KB), PP.75-86

Views:69   Downloads:1


Aarti Kumar, Sujoy Das

Index Terms

Paraphrasing;typology;linguistic transformation;lexical;cross-lingual;journalistic text reuse


Linking and tracking news stories covering the same events written in different languages is a challenging task. In natural languages same information may be expressed in multiple ways and newspapers try to exploit this feature for making the news stories more appealing. It has been observed that the same news story is presented in same as well as in different language in different ways but normally the gist remains the same. Diversity of linguistic expressions presents a major challenge in identifying and tracking news stories covering the same events across languages, but doing so may provide rich and valuable resources as comparable and parallel corpora can be generated with this resource. In the case of Indian languages there exist limited language resources for Natural Language Processing and Information Retrieval tasks and identifying comparable and parallel documents would offer a potential source for deriving bilingual dictionaries and training statistical Machine Translation systems. Paraphrasing is the most common way of reproducing news stories and translated text is also a type of paraphrase. Prior to linking monolingual or bilingual news stories, these paraphrase types need to identified and classified to help researchers to devise techniques to solve these challenging problems. English-Hindi language pair not only differs in their scripts but also in their grammar and vocabulary. A number of paraphrase typologies have been built from the perspective of Natural Language Processing or for some or the other specific applications but as per the knowledge of the authors, no typology have been reported for English-Hindi cross language text reuse. In this paper a typology is formulated for cross lingual journalistic text reuse in English-Hindi. Typology unravels level of difficulties in English-Hindi mapping. It shall help in devising techniques for linking and tracking English-Hindi stories.

Cite This Paper

Aarti Kumar, Sujoy Das,"Typology for Linguistic Pattern in English-Hindi Journalistic Text Reuse", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.8, pp.75-86, 2016. DOI: 10.5815/ijitcs.2016.08.09


[1]I. Androutsopoulos, and P. Malakasiotis, “A Survey of Paraphrasing and Textual Entailment Methods,” Journal of Artificial Intelligence Research, 38(1), 135-187, 2010.

[2]E. Barker and R. Gaizauskas, “Assessing the Comparability of News Texts,” in Proc. Eighth International Conference on Language Resources and Evaluation (LREC'12), 2012.

[3]A. Barreiro, “Make It Simple with Paraphrases. Automated Paraphrasing for Authoring Aids and Machine Translation,” Ph.D. Thesis, Porto: Universidade do Porto, 2008.

[4]Barrón-Cedeño, “On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism,” Ph.D. Thesis, Spain: Universitat Polit`ecnica de Val`encia, 2012. 

[5]R. Barzilay, K. McKeown, and M. Elhadad, “Information Fusion in the Context of Multi-Document Summarization,” in Proc. 27th Annual Meeting of the Association for Computational Linguistics (ACL 1999), College Park (MD), 550-557, 1999.

[6]R. Bhagat, “Learning Paraphrases from Text,” Ph.D. Thesis, Los Angeles: University of Southern California, 2009.

[7]C. Boonthum, “iSTART: Paraphase Recognition,” in Proc. ACL 2004 Student Research Workshop, Barcelona, 31-36, 2004.Available:

[8]P. Clough, “Measuring Text Reuse,” Ph.D. Thesis, Sheffield: University of Sheffield, 2003.

[9]P. Culicover, “Paraphrase Generation and Information Retrieval from Stored Text,” Mechanical Translation and Computational Linguistics, 11(1-2), 78-88, 1968.

[10]I. Dagan, O. Glickman, “Probabilistic Textual Entailment: generic Applied Modeling of Language Variability”. Available:

[11]B. Dolan, C. Quirk, and C. Brockett, “Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources,” in Proc. 20th International Conference on Computational Linguistics  

(COLING 2004), Geneva, 350-356, 2004.Available: 

[12]B. J. Dorr, et. al., “Semantic Annotation and Lexico-Syntactic Paraphrase,” in  Proc. LREC 2004 Workshop on Building Lexical Resources from Semantically Annotated Corpora, Lisbon, 47-52, 2004.

[13]M. Dras, “Tree Adjoining Grammar and the Reluctant Paraphrasing of Text,” Ph.D. Thesis, Sydney: Macquarie University, 1999.

[14]C. Dutrey, D. Bernhard, H. Bouamor, and A. Max, “Local Modifications and Paraphrases in Wikipedia’s Revision History,” Procesamiento del Lenguaje Natural, 46, 51-58, 2011.

[15]L. Faigley, and S. Witte, “Analyzing Revision. College Composition and Communication,” 32(4), 400-414, 1981. Available:

[16]A. Fujita, “Automatic Generation of Syntactically Well-Formed and Semantically Appropriate Paraphrases,” Ph.D. Thesis, Nara: Nara Institute of Science and Technology, 2005.

[17]R. P. Honeck, “A Study of Paraphrases,” Journal of Verbal Learning and Verbal Behavior, 10, 367-381, 1971. Available: (71)80035-X.

[18]R. Kozlowski, K. F. McCoy, and V. K. Shanker, “Generation of Single-Sentence Paraphrases from Predicate/Argument Structure Using Lexico-Grammatical Resources,” in Proc. International Workshop on Paraphrasing: Paraphrase Acquisition and Applications (IWP 2003), Sapporo, 1-8, 2003.

[19]D. Munteanu, and D. Marcu, “Improving Machine Translation Performance by Exploiting Comparable Corpora” Computational Linguistics, 31 (4), pp. 477-504, December 2005.

[20]M. Recasens, and M. Vila, “On Paraphrase and Coreference," Computational Linguistics, 36(4), 639-647, 2010. Available:

[21]F. Rinaldi, J. Dowdall, K. Kaljurand, M. Hess, and D. Mollá, “ Exploiting Paraphrases in a Question Answering System,” in Proc. 2nd International Workshop on Paraphrasing: Paraphrase Acquisition and Applications (IWP 2003), Sapporo, 25-32, 11 July 2003. 

[22]M. Shimohata, “Acquiring Paraphrases from Corpora and Its Application to Machine Translation,” Ph.D. Thesis, Nara: Nara Institute of Science and Technology, 2004.

[23]M. Vila, M. Antonia Marti, and H. Rodrguez, “Paraphrase Concept and Typology-A Linguistically Based and Computationally Oriented Approach,” Procesamiento del Lenguaje Natural, pp 83-90, 2011.

[24]M. Vila, M. A. Martí, and H. Rodríguez, “Is This a Paraphrase? What Kind? Paraphrase Boundaries and Typology,” Open Journal of Modern Linguistics, 4, 205-218., 2014. Available:

[25]P. Clough, and R. Gaizauskas, “Corpora and Text Re-Use,” In A. L¨udeling, M. Kyt¨o, and T. McEnery, editors, Handbook of Corpus Linguistics, Handbooks of Linguistics and Communication Science, Mouton de Gruyter, 2009, pages 1249—1271.

[26]J. Robin, “Revision-Based Generation of Natural Language Summaries Providing Historical Background: Corpus-Based Analysis, Design, Implementation, and Evaluation,” Ph.D. thesis, Department of Computer Science, Columbia University, NY, 1994.

[27]J. Milićević, “Semantic Equivalence Rules in Meaning-Text Paraphrasing,” In L. Wanner (Ed.), Selected Lexical and Grammatical Issues in the Meaning-Text Theory, Amsterdam: John Benjamins, 2007, pp. 267-296.

[28]C.D. Manning, P. Raghavan, and H. Schutze, “Introduction to Information Retrieval,” Vol. 1, Cambridge: Cambridge University, Press; 2008.

[29]D. Guan and W. Yuan, “A Survey of Mislabeled Training Data Detection Techniques for Pattern Classification”, IETE Technical Review, vol. 30, issue-6, pp. 524-530, Nov-Dec 2013.

[30]M. B. Bashir, M. S. A. Latiff, A. A. Ahmed, A. Yousif, and M. E. Eltayeeb, “Content‑based Information Retrieval Techniques Based on Grid Computing: A Review,” IETE Technical Review, vol. 30, issue-3, pp. 223-232, May-Jun 2013.