Region Captioning

Overview

This project implements a Region Captioning System that bridges the gap between Computer Vision and Natural Language Processing using Deep Learning.

Unlike traditional image captioning models that describe an entire image with a single sentence, Region Captioning focuses on generating fine-grained, natural language descriptions for specific regions within an image, offering a more localized and contextual understanding.

Approach

The project follows a progressive learning and implementation pipeline:

  1. Generative Extension: Leveraged the learned embeddings to train a caption generator that produces region-wise captions for unseen images.
  2. Theoretical Learning: Studied Neural Networks, CNNs, RNNs, and multimodal architectures.
  3. Region Proposal: Used Faster R-CNN for detecting regions of interest (bounding boxes).
  4. Feature Extraction (Visual Encoder): Implemented a VGG-19-based CNN to extract region-level features from each detected bounding box.
  5. Text Encoding (Language Encoder): Built an RNN/GRU-based model to convert textual captions into numerical embeddings.
  6. Multimodal Embedding Space: Mapped both image and text features into a shared vector space where similar image–text pairs lie close together.
  7. Alignment Training: Used a max-margin ranking loss to align image regions with corresponding caption fragments.

Dataset and Preprocessing

Source: Flickr30k Entities (an extension of Flickr8k/30k)

The Flickr30k dataset contains about 31,000 real-world images, each paired with 5 human-written English captions describing the scene.

Preprocessing Pipeline

1. Parsing

  • Extracted image paths, captions, and bounding box coordinates from annotations.

2. Image Preprocessing

  • Resized and padded images for uniform input size.
  • Applied random cropping, rotation, and color jitter for data augmentation.
  • Converted images to tensors normalized to the range [-1, 1].

3. Caption Preprocessing

  • Tokenized sentences using nltk.
  • Constructed a vocabulary with <pad><start><end><unk> tokens.
  • Created word–index mappings and saved as vocab.pkl.

4. Data Loader

  • Returned paired (image, caption) tensors.
  • Implemented collate_fn to handle variable-length captions dynamically during batching.

Model Architecture

Alignment Model

A. Encoder (AlignmentEncoderCNN)

  • Based on VGG-19, pre-trained on ImageNet.
  • Extracts deep region-level feature maps.
  • Uses Adaptive Average Pooling (7×7) for uniform feature size.
  • Passes pooled features through an FC layer → embedding vector.
  • L2 normalization ensures fair similarity comparison with text vectors.

B. Decoder (AlignmentDecoderRNN)

  • Uses Embedding + GRU + Linear layer.
  • Encodes caption sequences into text embeddings.
  • The final GRU hidden state represents the sentence meaning.
  • Also L2-normalized for cosine similarity computation.

Intuitive Explanation

The alignment model learns a shared space where visually similar regions and semantically similar text lie close together. If a caption describes a region (e.g., “a brown dog”), their embeddings align closely in this space.

Mathematical Formulation

Image–Sentence Matching Score:

Where:

  • vi: Image region vector
  • st: Word or phrase vector
  • Skl: Similarity between image k and sentence l

Loss Function (Max-Margin Ranking Loss):

This ensures that correct image–caption pairs have higher similarity scores than mismatched ones.

Applications

DomainApplication
Autonomous SystemsScene understanding for perception & navigation
Medical ImagingDescribing localized anatomical regions
SurveillanceDetecting and describing activities in CCTV frames
Image RetrievalText-based search for image regions
Assistive TechScene narration for visually impaired users

Results

Team Members

  • Aman Karki
  • Ritika Varshney
  • Nidhi Soni

Mentors

  • Shaan Vora
  • Rajas Daryapurkar
  • Roushni Sareen
  • Devansh Palan
  • Sameh Nadeem