Summarizing Video Content with a Single Image Using Large Language Models

EasyChair Preprint 14970

6 pages•Date: September 21, 2024

Shutaro Tamada, Chunzhi Gu and Shigeru Kuriyama

Abstract

Generating thumbnails for news videos plays an important role in efficiently understanding the contents. Prior techniques mostly handle this task by selecting one keyframe as a representative image. However, this approach cannot effectively handle a video whose key content is distributed across different frames. In this paper, we propose a summarization of a news video by composing its key contents into one image as a thumbnail. To achieve this, our method starts with text extraction from each scene in the video using OCR, speech recognition, and existing image captioning models. We then group these texts based on similarity and leverage large language models to score the group significance. Next, for each group, a keyframe is selected by jointly considering the importance and content quality. Eventually, we compose the objects in these keyframes into a single image as a thumbnail in a non-overlap manner and utilize diffusion-based generative models for further quality refinement. Experiments on real-world news videos demonstrate that our method can effectively extract key video contents and generate natural and informative video thumbnails.

Keyphrases: large language models, video semantic analysis, video thumbnail generation

Links:

https://easychair.org/publications/preprint/3wNf

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:14970,
  author    = {Shutaro Tamada and Chunzhi Gu and Shigeru Kuriyama},
  title     = {Summarizing Video Content with a Single Image Using Large Language Models},
  howpublished = {EasyChair Preprint 14970},
  year      = {EasyChair, 2024}}

Download PDF Open PDF in browser