Generating Bangla Image Captions with Deep Learning Techniques
Date:
Introduction
In this talk, we explore our work on generating Bangla image captions, covering the introduction, materials, methods, results, and discussions.
What is Image Captioning?
Image captioning involves generating textual descriptions from images. Bangla image captioning bridges the gap between visual content and Bengali text using computer vision and natural language processing (NLP).
Proposed System
We developed a system for Bangla image captioning using EfficientNetB4 and ResNet-50 pretrained models for image feature extraction. The datasets used are:
- Flickr30k: 31,000 images.
- BanglaView: 158,000 captions, nearly 4 times larger than previous datasets.
Dataset Highlights
- Flickr30k: Contains 31,000 images.
- BanglaView:
- Vocabulary: ~25,000 words.
- 5 captions per image, with a maximum caption length of 67 words.
- Total captions: 158,000.
Train-Test Splitting
- Training set: 90% of the dataset.
- Testing set: 10% of the dataset, including over 3,000 images and 15,000 captions.
System Architecture
- Images are passed through pretrained CNN models (EfficientNetB4 or ResNet-50) for feature extraction.
- Captions are embedded into vectors using a Gated Recurrent Unit (GRU).
- Features and embeddings are combined and processed through a dense neural layer to generate captions.
Training Performance
- Training was performed over 10 epochs using CPU due to limited GPU availability.
- Significant reduction in loss and improvements in accuracy were observed during training.
Test Results
- BLEU scores obtained from unseen test data indicate excellent caption generation quality.
Examples of Generated Captions
For an image from the Flickr30k dataset:
- EfficientNetB4 Output: “একজন লোক মাইক্রোফোনে গান গাইছে” (A man is singing into a microphone).
- ResNet-50 Output: Also generated accurate captions.
Comparative Study
Our test scores are in the mid-range compared to other works. However:
- Only 10 epochs were performed.
- Training on the large BanglaView dataset took ~30 hours using CPU.
Applications
This work has several applications, including:
- Assisting visually impaired Bengali speakers.
- Enhancing photo search capabilities.
- Enabling robot interactions using Bengali captions.