Work place: National Institute of Technology (NIT), Raipur (C.G.), India
E-mail: tjaiswal_1207@yahoo.com
Website:
Research Interests:
Biography
Tarun Jaiswal is a Research Scholar at the Department of Computer Applications, National Institute of Technology, Raipur. His research interests include Machine Learning, Artificial Intelligence and IoT. He has published several research papers in leading national and international journals, as well as presented His work at various conferences. He has also worked on several projects related to machine learning, which has given his practical experience in applying theoretical concepts to real-world problems.
By Sushma Jaiswal Harikumar Pallthadka Rajesh P. Chinchewadi Tarun Jaiswal
DOI: https://doi.org/10.5815/ijisa.2024.02.05, Pub. Date: 8 Apr. 2024
Deep learning has improved image captioning. Transformer, a neural network architecture built for natural language processing, excels at image captioning and other computer vision applications. This paper reviews Transformer-based image captioning methods in detail. Convolutional neural networks (CNNs) extracted image features and RNNs or LSTM networks generated captions in traditional image captioning. This method often has information bottlenecks and trouble capturing long-range dependencies. Transformer architecture revolutionized natural language processing with its attention strategy and parallel processing. Researchers used Transformers' language success to solve image captioning problems. Transformer-based image captioning systems outperform previous methods in accuracy and efficiency by integrating visual and textual information into a single model. This paper discusses how the Transformer architecture's self-attention mechanisms and positional encodings are adapted for image captioning. Vision Transformers (ViTs) and CNN-Transformer hybrid models are discussed. We also discuss pre-training, fine-tuning, and reinforcement learning to improve caption quality. Transformer-based image captioning difficulties, trends, and future approaches are also examined. Multimodal fusion, visual-text alignment, and caption interpretability are challenges. We expect research to address these issues and apply Transformer-based image captioning to medical imaging and distant sensing. This paper covers how Transformer-based approaches have changed image captioning and their potential to revolutionize multimodal interpretation and generation, advancing artificial intelligence and human-computer interactions.
[...] Read more.Subscribe to receive issue release notifications and newsletters from MECS Press journals