Generating Bangla Image Captions with Deep Learning Techniques

Date: December 14, 2024

Introduction

In this talk, we explore our work on generating Bangla image captions, covering the introduction, materials, methods, results, and discussions.

What is Image Captioning?

Image captioning involves generating textual descriptions from images. Bangla image captioning bridges the gap between visual content and Bengali text using computer vision and natural language processing (NLP).

Proposed System

We developed a system for Bangla image captioning using EfficientNetB4 and ResNet-50 pretrained models for image feature extraction. The datasets used are:

Flickr30k: 31,000 images.
BanglaView: 158,000 captions, nearly 4 times larger than previous datasets.

Dataset Highlights

Flickr30k: Contains 31,000 images.
BanglaView:
- Vocabulary: ~25,000 words.
- 5 captions per image, with a maximum caption length of 67 words.
- Total captions: 158,000.

Train-Test Splitting

Training set: 90% of the dataset.
Testing set: 10% of the dataset, including over 3,000 images and 15,000 captions.

System Architecture

Images are passed through pretrained CNN models (EfficientNetB4 or ResNet-50) for feature extraction.
Captions are embedded into vectors using a Gated Recurrent Unit (GRU).
Features and embeddings are combined and processed through a dense neural layer to generate captions.

Training Performance

Training was performed over 10 epochs using CPU due to limited GPU availability.
Significant reduction in loss and improvements in accuracy were observed during training.

Test Results

BLEU scores obtained from unseen test data indicate excellent caption generation quality.

Examples of Generated Captions

For an image from the Flickr30k dataset:

EfficientNetB4 Output: “একজন লোক মাইক্রোফোনে গান গাইছে” (A man is singing into a microphone).
ResNet-50 Output: Also generated accurate captions.

Comparative Study

Our test scores are in the mid-range compared to other works. However:

Only 10 epochs were performed.
Training on the large BanglaView dataset took ~30 hours using CPU.

Applications

This work has several applications, including:

Assisting visually impaired Bengali speakers.
Enhancing photo search capabilities.
Enabling robot interactions using Bengali captions.

Share on

Twitter Facebook LinkedIn

Sajeeb Kumar Ray