Building Video Summarizer Using AI

With the recent advancements in AI technology, such as OpenAI, it has become possible to automate tasks that were previously too tedious to perform manually. An example of this is a video summarizer. Previously, the process of summarizing video content relied mainly on human vision and hearing. However, with AI models such as GPT-4 and Whisper, it is now possible to automate this task.

We will be utilizing the following technologies: OpenAI, Node.js, React, and GridDB. This blog will teach you how to create a basic web application for uploading a video and receiving a summary of the content.

Getting Started

Source code can be found here:

$ git clone https://github.com/griddbnet/Blogs --branch voice_summarizer

This project running on Ubuntu 20.04 LTS. These are the mandatory stack requirements that you need to run for this project:

OpenAI Key

To access any OpenAI services, we need a valid key. Go to this link and create a new OpenAI key.

The OpenAI key is on a project basis, so we need to create a project first in the OpenAI platform and you need also to enable any models that you use on a project. For this project, we will need gpt-4o and whisper models.

The OpenAI key will be saved on the .env file and make sure not to include it in version control by adding it to the .gitignore.

Node.js

This project will run on the Node.js platform. You need to install it from here. For this project, we will use the nvm package manager and Node.js v16.20.2 LTS version.

# installs nvm (Node Version Manager)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash

# download and install Node.js
nvm install 16

# verifies the right Node.js version is in the environment
node -v # should print `v16.20.2`

# verifies the right NPM version is in the environment
npm -v # should print `8.19.4``

To connect Node.js and GridDB database, you need the gridb-node-api npm package which is a Node.js binding developed using GridDB C Client and Node addon API.

FFmpeg

This project utilizes the fluent-ffmpeg npm package, which requires FFmpeg to be installed on the system. For Ubuntu, you can use the following command to install it:

sudo apt update
sudo apt install ffmpeg

For more installation information, please go to the FFmpeg official website.

GridDB

To save the video summary and video data, we will use the GridDB database. Please look at the guide for detailed installation. We will use Ubuntu 20.04 LTS here.

Run GridDB and check if the service is running. Use this command:

sudo systemctl status gridstore

If not running try to run the database with this command:

sudo systemctl start gridstore

Run the Project

Pull the Source Code

To run the project, you need to clone the code from this repository.

Run this command to pull the source code.

git clone https://github.com/junwatu/video-summarizer-nodejs-griddb.git

Change the directory to the app folder and install any project dependencies using this command:

cd video-summarizer-nodejs-griddb
npm install

Setup .env

This project needs a few environment variables. Copy the env.example file to the .env file.

cp .env .example .env

You need to fill in the environment variables in this file:

OPENAI_API_KEY=sk-....
VITE_API_URL=http://localhost:3000

It is important to note that the project must be restarted every time you change the VITE_API_URL environment variable.

Start the Project

Run this command to run the project:

npm run start:build

Open Web App

Open the web app using the browser. The default URL is http://localhost:3000. Upload any videos and it’s recommended to upload a short video for fast processing. You can also use the video in the test/video folder. Depending on the video duration, it will take a minute to process.

How it Works?

The user flow for this project or web app involves opening the web app, uploading the video, waiting for processing, and receiving the summary result. It uses the GPT-4o and Whisper models from OpenAI to summarize the uploaded user video. This project requires two models because OpenAI models cannot process video directly, however they can process images or audio files. On the other side, in Node.js, to separate the video into images and audio files, we use the fluent-ffmpeg npm package.

These are the primary preparation steps for videos before we input them into OpenAI models for the summarization process.

1. Video Processing

While it’s not possible to directly send a video to the API, GPT-4o can understand videos if you sample frames and then provide them as images. It performs better at this task than the earlier GPT-4 Turbo model.

This function, extractFrames(), will extract images from the video file and save them in the frames folder.

export function extractFrames(videoPath, secondsPerFrame, outputFolder) {
    return new Promise((resolve, reject) => {
        const frameRate = 1 / secondsPerFrame
        const framePattern = path.join(outputFolder, 'frame-%03d.png')

        ffmpeg(videoPath)
            .outputOptions([`-vf fps=${frameRate}`])
            .output(framePattern)
            .on('end', () => {
                fs.readdir(outputFolder, (err, files) => {
                    if (err) {
                        reject(err)
                    } else {
                        const framePaths = files.map(file => path.join(outputFolder, file))
                        resolve(framePaths)
                    }
                })
            })
            .on('error', reject)
            .run()
    })
}

In the function extractFrames, the parameter secondsPerFrame defines the interval between frames that you want to extract from the video. Specifically, secondsPerFrame determines how many seconds should elapse between each frame that is extracted.

Here’s how it works:

Frame Rate Calculation: The frame rate is calculated as the reciprocal of secondsPerFrame, i.e., frameRate = 1 / secondsPerFrame. This means:

  • If secondsPerFrame is 1, the frame rate is 1 frame per second.
  • If secondsPerFrame is 0.5, the frame rate is 2 frames per second.
  • If secondsPerFrame is 2, the frame rate is 0.5 frames per second (one frame every 2 seconds).

2. Image Processing

The GPT-4o model can directly process images and take intelligent actions based on the image. We can provide images in two formats:

  • Base64 Encoded
  • URL

In this project, we will use base64 encoding for the images. The function imageToBase64() will read each image file and convert it into a base64 encoded image.

export function imageToBase64(imagePath) {
    return new Promise((resolve, reject) => {
        fs.readFile(imagePath, (err, data) => {
            if (err) {
                reject(err)
            } else {
                const base64String = data.toString('base64')
                resolve(base64String)
            }
        })
    })
}

3. Audio Extraction

For a better context summarization, we can add audio to the OpenAI model. To extract audio from video, we can also use the fluent-ffmpeg npm. The audio result is in mp3 format and saved in the audio directory.

// Function to extract audio from video
export function extractAudio(videoPath, audioPath) {
    return new Promise((resolve, reject) => {
        ffmpeg(videoPath)
            .output(audioPath)
            .audioBitrate('32k')
            .on('end', resolve)
            .on('error', reject)
            .run()
    })
}

4. Audio Transcription

After extracting the audio, we need to transcribe it into text using the speech-to-text model Whisper.

async function transcribeAudio(filePath) {
    try {
        const transcription = await openai.audio.transcriptions.create({
            file: fs.createReadStream(filePath),
            model: 'whisper-1'
        })
        return transcription.text
    } catch (error) {
        throw new Error(`Transcription failed: ${error.message}`)
    }
}

The transcribeAudio() will transcribe an audio file to text using the whisper-1 AI model. For more information about how this speech-to-text model works, please read here.

The code for video processing, image processing, and audio extraction can be found in the file libs/videoProcessing.js.

Video Summarization Process

The video summary is created by inputting both the visual and audio transcription elements of the video into the model simultaneously. By providing both of these inputs, the model is expected to produce a more accurate summary as it can perceive the entire video at once.

// Generate a summary with visual and audio transcription
import OpenAI from "openai";

const openai = new OpenAI({
    // eslint-disable-next-line no-undef
    apiKey: process.env.OPENAI_API_KEY
});

async function createVideoSummarization(frames, audioTranscription) {
    const frameObjects = frames.map(x => ({
        type: 'image_url',
        image_url: {
            url: `data:image/jpg; base64, ${x}`,
            detail: 'low'
        }
    }));

    const response = await openai.chat.completions.create({
        model: "gpt-4o",
        messages: [{
                role: "system",
                content: "You are generating a video summary. Please provide a summary of the video. Respond in Markdown."
            }
            ,
            {
                role: "user",
                content: [{
                        type: 'text',
                        text: "These are the frames from the video."
                    }
                    ,
                    ...frameObjects,
                    {
                        type: 'text',
                        text: `The audio transcription is: ${audioTranscription}`
                    }
                ],
            }
            ,
        ],
        temperature: 0,
    });
    console.log(response.choices[0].message.content);
    return response;
}

export { createVideoSummarization }

The content parameter is an array and may contain text or images. Prompts can be added to summarize the video, as image frames addition, and audio text transcription for better context. You can look into the Chat API documentation for more information about the parameters.

Save Video Summary to GridDB

The GridDB database is utilized to store the video summary, video file path, and audio transcription. The code to save these data resides in the griddbservices.js file.

export async function saveData({ filename, audioTranscription, summary }) {
    const id = generateRandomID();
    const videoFilename = String(filename);
    const audioToText = String(audioTranscription);
    const videoSummary = String(summary);

    const packetInfo = [parseInt(id), videoFilename, audioToText, videoSummary];
    const saveStatus = await GridDB.insert(packetInfo, collectionDb);
    return saveStatus;
}

There are three important fields here, which are:

Parameter Type Description
filename String The name of the video file.
audioTranscription String The transcription of the audio from the video.
summary String A summary of the video content.

The saveData function is a wrapper to save data to GridDB. You can find the real code that saves the data in the libs/griddb.cjs file.

Get All Summaries

All the summaries data can be accessed in the route /summaries. The default URL is:

http://localhost:3000/summaries

The response is JSON data, which is very easy to process on the client if you need further features or enhancements for the project.

Limitation

This project is a prototype and tested with MP4 videos with a video duration not exceeding 5 minutes.

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.