Voice-Based Image Generation Using Imagen 4 and ElevenLabs

Project Overview

A modern web application that transforms spoken descriptions into high-quality images using cutting-edge AI technologies. Users can record their voice describing an image they want to create, and the system will transcribe their speech and generate a corresponding image.

What Problem We Solved

Traditional image generation tools require users to type detailed prompts, which can be:

  • Time-consuming for complex descriptions.
  • Limiting for users with typing difficulties.
  • Less natural than speaking.

This project solution makes AI image generation more accessible through voice interaction.

Architecture & Tech Stack

This diagram shows a pipeline for converting speech into images and storing the result in the GridDB database:

  1. User speaks into a microphone.
  2. Speech recording captures the audio.
  3. Audio is sent to ElevenLabs (Scriber-1) for speech-to-text transcription.
  4. The transcribed text becomes a prompt for Imagen 4 API, which generates an image.
  5. The data saved into the database are:
    • The audio reference and
    • The image
    • The prompt text

Frontend Stack

In this project, we will use Next.js as the frontend framework.

Backend Stack

There is no specific backend code because all services used are from APIs. There are three main APIs:

1. Speech-to-Text API

This project utilizes the ElevenLabs API for speech-to-text transcription. The API is athttps://api.elevenlabs.io/v1/text-to-speech/. ElevenLabs also provides JavaScript SDK for easier API integration. You can see the SDK documentation for more details.

2. Image Generation API

This project uses Imagen 4 API from fal. The API is hosted on https://fal.ai/models/fal-ai/imagen4/preview. Fal provides JavaScript SDK for easier API integration. You can see the SDK documentation for more details.

3. Database API

We will use the GridDB Cloud version in this project. So there is no need for local installation. Please read the next section on how to set up GridDB Cloud.

Prerequisites

Node.js

This project is built using Next.js, which requires Node.js version 16 or higher. You can download and install Node.js from https://nodejs.org/en.

GridDB

Sign Up for GridDB Cloud Free Plan

If you would like to sign up for a GridDB Cloud Free instance, you can do so at the following link: https://form.ict-toshiba.jp/download_form_griddb_cloud_freeplan_e.

After successfully signing up, you will receive a free instance along with the necessary details to access the GridDB Cloud Management GUI, including the GridDB Cloud Portal URL, Contract ID, Login, and Password.

GridDB WebAPI URL

Go to the GridDB Cloud Portal and copy the WebAPI URL from the Clusters section. It should look like this:

GridDB Username and Password

Go to the GridDB Users section of the GridDB Cloud portal and create or copy the username for GRIDDB_USERNAME. The password is set when the user is created for the first time, use this as the GRIDDB_PASSWORD.

For more details, to get started with GridDB Cloud, please follow this quick start guide.

IP Whitelist

When running this project, please ensure that the IP address where the project is running is whitelisted. Failure to do so will result in a 403 status code or forbidden access.

You can use a website like What Is My IP Address to find your public IP address.

To whitelist the IP, go to the GridDB Cloud Admin and navigate to the Network Access menu.

ElevenLabs

You need an ElevenLabs account and API key to use this project. You can sign up for an account at https://elevenlabs.io/signup.

After signing up, go to the Account section, and create and copy your API key.

Imagen 4 API

You need an Imagen 4 API key to use this project. You can sign up for an account at https://fal.ai.

After signing up, go to the Account section, and create and copy your API key.

How to Run

1. Clone the repository

Clone the repository from https://github.com/junwatu/speech-image-gen to your local machine.

git clone https://github.com/junwatu/speech-image-gen.git
cd speech-image-gen
cd apps

2. Install dependencies

Install all project dependencies using npm.

npm install

3. Set up environment variables

Copy file .env.example to .env and fill in the values:

# Copy this file to .env.local and add your actual API keys
# Never commit .env.local to version control

# Fal.ai API Key for Imagen 4
# Get your key from: https://fal.ai/dashboard
FAL_KEY=

# ElevenLabs API Key for Speech-to-Text
# Get your key from: https://elevenlabs.io/app/speech-synthesis
ELEVENLABS_API_KEY=

GRIDDB_WEBAPI_URL=
GRIDDB_PASSWORD=
GRIDDB_USERNAME=

Please look the section on Prerequisites before running the project.

4. Run the project

Run the project using the following command:

npm run dev

5. Open the application

Open the application in your browser at http://localhost:3000. You also need to allow the browser to access your microphone.

Implementation Details

Speech Recording

The user will speak into the microphone and the audio will be recorded. The audio will be sent to ElevenLabs API for speech-to-text transcription. Please, remember that the language supported is English.

The code to save the recording file is in the main page.tsx. It uses a native media recorder HTML 5 API to record the audio. Below is the snippet code:

const startRecording = useCallback(async () => {
  try {
    setError(null);
    const stream = await navigator.mediaDevices.getUserMedia({ 
      audio: {
        echoCancellation: true,
        noiseSuppression: true,
        sampleRate: 44100
 } 
 });
    
    // Try different MIME types based on browser support
    let mimeType = 'audio/webm;codecs=opus';
    if (!MediaRecorder.isTypeSupported(mimeType)) {
      mimeType = 'audio/webm';
      if (!MediaRecorder.isTypeSupported(mimeType)) {
        mimeType = 'audio/mp4';
        if (!MediaRecorder.isTypeSupported(mimeType)) {
          mimeType = ''; // Let browser choose
 }
 }
 }
    
    const mediaRecorder = new MediaRecorder(stream, {
      ...(mimeType && { mimeType })
 });
    
    mediaRecorderRef.current = mediaRecorder;
    audioChunksRef.current = [];
    recordingStartTimeRef.current = Date.now();

    mediaRecorder.ondataavailable = (event) => {
      if (event.data.size > 0) {
        audioChunksRef.current.push(event.data);
 }
 };

    mediaRecorder.onstop = async () => {
      const duration = Date.now() - recordingStartTimeRef.current;
      const audioBlob = new Blob(audioChunksRef.current, { 
        type: mimeType || 'audio/webm' 
 });
      const audioUrl = URL.createObjectURL(audioBlob);
      
      const recording: AudioRecording = {
        blob: audioBlob,
        url: audioUrl,
        duration,
        timestamp: new Date()
 };
      
      setCurrentRecording(recording);
      await transcribeAudio(recording);
      stream.getTracks().forEach(track => track.stop());
 };

    mediaRecorder.start(1000); // Collect data every second
    setIsRecording(true);
 } catch (error) {
    setError('Failed to access microphone. Please check your permissions and try again.');
 }
}, []);

The audio processing flow is as follows:

  1. User clicks record button → startRecording() is called.
  2. Requests microphone access via getUserMedia().
  3. Creates MediaRecorder with optimal settings.
  4. Collects audio data in chunks.
  5. When stopped, create an audio blob and trigger transcription.

The audio data will be saved in the public/uploads/audio folder. Below is the snippet code to save the audio file:

export async function saveAudioToFile(audioBlob: Blob, extension: string = 'webm'): Promise {
  // Create uploads directory if it doesn't exist
  const uploadsDir = join(process.cwd(), 'public', 'uploads', 'audio');
  await mkdir(uploadsDir, { recursive: true });
  
  // Generate unique filename
  const filename = `${generateRandomID()}.${extension}`;
  const filePath = join(uploadsDir, filename);
  
  // Convert blob to buffer and save file
  const arrayBuffer = await audioBlob.arrayBuffer();
  const buffer = Buffer.from(arrayBuffer);
  await writeFile(filePath, buffer);
  
  // Return relative path for storage in database
  return `/uploads/audio/${filename}`;
}

The full code for the saveAudioToFile() function is in the app/lib/audio-storage.ts file.

Speech to Text Transcription

The transcribed text will be sent to ElevenLabs API for text-to-speech synthesis. The code to send the audio to ElevenLabs API is in the transcribeAudio() function. The full code is in the lib/elevenlabs-client.ts file.

// Main transcription function
export async function transcribeAudio(
  client: ElevenLabsClient, 
  audioBuffer: Buffer, 
  modelId: ElevenLabsModel = ELEVENLABS_MODELS.SCRIBE_V1
) {
  try {
    const result = await client.speechToText.convert({
      audio: audioBuffer,
      model_id: modelId,
 }) as TranscriptionResponse;

    return {
      success: true,
      text: result.text,
      language_code: result.language_code,
      language_probability: result.language_probability,
      words: result.words || [],
      additional_formats: result.additional_formats || []
 };
 } catch (error) {
    console.error('ElevenLabs transcription error:', error);
    return {
      success: false,
      error: error instanceof Error ? error.message : 'Unknown error'
 };
 }
}

Transcription Route

The transcribeAudio() function is called when accessing the /api/transcribe route. This route only accepts the POST method and processes the audio file sent in the request body. The ELEVENLABS_API_KEY environment variable in the .env is used in the route to initialize the ElevenLabs client.

export async function POST(request: NextRequest) {
  // Get audio file from form data
  const formData = await request.formData();
  const audioFile = formData.get('audio') as File;
  
  // Convert to buffer
  const arrayBuffer = await audioFile.arrayBuffer();
  const audioBuffer = Buffer.from(arrayBuffer);
  
  // Initialize ElevenLabs client
  const elevenlabs = new ElevenLabsClient({ apiKey: apiKey });
  
  // Convert audio to text
  const result = await elevenlabs.speechToText.convert({
    file: audioBlob,
    modelId: "scribe_v1",
    languageCode: "en", 
    tagAudioEvents: true,
    diarize: false,
 });
  
  return NextResponse.json({ 
    transcription: result.text,
    language_code: result.languageCode,
    language_probability: result.languageProbability,
    words: result.words
 });
}

The route will return the following JSON object:

{
  "transcription": "Transcribed text from the audio",
  "language_code": "en",
  "language_probability": 0.99,
  "words": [
 {
      "start": 0.0,
      "end": 1.0,
      "word": "Transcribed",
      "probability": 0.99
 },
    // ... more words
 ]
}

The transcribed text serves as the input prompt for image generation using Imagen 4 from fal.ai, which creates high-quality images based on the provided text description.

Image Generation

The Fal API endpoint used is fal-ai/imagen4/preview. You must have a Fal API key to use this endpoint and set the FAL_KEY in the .env file. Please look into this section on how to get the API key.

The Fal Imagen 4 image generation API is called directly in the /api/generate-image route. The route will create the image using the subscribe() method from the @fal-ai/client SDK package.

export async function POST(request: NextRequest) {
  const { prompt, style = 'photorealistic' } = await request.json();
  
  // Configure fal client
  fal.config({
    credentials: process.env.FAL_KEY || ''
 });
  
  // Generate image using fal.ai Imagen 4
  const result = await fal.subscribe("fal-ai/imagen4/preview", {
    input: {
      prompt: prompt,
      // Add style to prompt if needed
      ...(style !== 'photorealistic' && { 
        prompt: `${prompt}, ${style} style` 
 })
 },
    logs: true,
    onQueueUpdate: (update) => {
      if (update.status === "IN_PROGRESS") {
        update.logs.map((log) => log.message).forEach(console.log);
 }
 },
 });
  
  // Extract image URLs from the result
  const images = result.data?.images || [];
  const imageUrls = images.map((img: any) => img.url || img);
  
  return NextResponse.json({ 
    images: imageUrls,
    prompt: prompt,
    style: style,
    requestId: result.requestId
 });
}

The route will return JSON with the following structure:

 {
  "images": [
    "https://v3.fal.media/files/panda/YCl2K_C4yG87sDH_riyJl_output.png"
 ],
  "prompt": "Floating red jerry can on the blue sea, wide shot, side view",
  "style": "photorealistic",
  "requestId": "8a0e13db-5760-48d4-9acd-5c793b14e1ee"
}

The image data, along with the prompt and audio file path, will be saved into the GridDB database.

Database Operation

We use the GridDB Cloud version for saving the image generation, prompt, and audio file path. It’s easy to use and accessible using API. The container or database name for this project is genvoiceai.

Save Data to GridDB

We can save any data to the database therefore we need to define the data schema or structure. We will use the following data schema or structure for this project:

export interface GridDBData {
  id: string | number;
  images: Blob;        // Stored as base64 string
  prompts: string;     // Text prompt
  audioFiles: string;  // File path to audio file
}

In real-world applications, best practice is to separate binary files from their references. However, for simplicity in this example, we store the image directly in the database as a base64-encoded string. Before saving to the database, the image needs to be converted to base64 format:

// Convert image blob to base64 string for GridDB storage
const imageBuffer = await imageBlob.arrayBuffer();
const imageBase64 = Buffer.from(imageBuffer).toString('base64');

Please look into the lib/griddb.ts file for the implementation details. The insertData() function is the actual database insertion.

async function insertData({ data, containerName = 'genvoiceai' }) {
  const row = [
    parseInt(data.id.toString()),  // ID as integer
    data.images,                   // Base64 image string
    data.prompts,                  // Text prompt
    data.audioFiles                // Audio file path
 ];
  
  const path = `/containers/${containerName}/rows`;
  return await makeRequest(path, [row], 'PUT');
}

Get Data from GridDB

To get data from the database, you can use the GET request from the route /api/save-data. This route uses SQL query to get specific or all data from the database.

// For specific ID
query = {
  type: 'sql-select',
  stmt: `SELECT * FROM genvoiceai WHERE id = ${parseInt(id)}`
};

// For recent entries
query = {
  type: 'sql-select', 
  stmt: `SELECT * FROM genvoiceai ORDER BY id DESC LIMIT ${parseInt(limit)}`
};

For detailed code implementation, please look into the app/api/save-data/route.ts file.

Server Routes

This project uses Next.js serverless functions to handle API requests. This means there is no separate backend code to handle APIs, as they are integrated directly into the Next.js application.

The routes used by the frontend are as follows:

Route Method Description
/api/generate-image POST Generate images using fal.ai Imagen 4
/api/transcribe POST Convert audio to text using ElevenLabs
/api/save-data POST Save image, prompt, and audio data to GridDB
/api/save-data GET Retrieve saved data from GridDB
/api/audio/[filename] GET Serve audio files from uploads directory

User Interface

The main entry of the frontend is in the page.tsx file. The user interface is built with Next.js and its single-page applications with several key sections:

  • Voice Recording Section: Large microphone button for audio recording.
  • Transcribed Text Display: Shows the converted speech-to-text with language detection. You can also edit the prompt here before generating the image.
  • Style Selection: A dropdown menu that allows users to choose different image generation styles, including photorealistic, artistic, anime, and abstract styles.
  • Generated Images Grid: Displays created images with download/save options.
  • Saved Data Viewer: Shows previously saved generations from the database.

The saved data will be displayed in the Saved Data Viewer section, you can show and hide it by clicking the Show Saved button on the top right. Each saved entry will include the image, the prompt used to generate it, the audio reference, and the request ID. You can also play the audio and download the image.

Future Enhancements

This project is a basic demo and can be further enhanced with additional features, such as:

  • User authentication and authorization for saved data.
  • Image editing or customization options.
  • Integration with other AI models for image generation.
  • Speech recognition improvements for different languages. Currently, it supports only English.

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.