Project Overview
A modern web application that transforms spoken descriptions into high-quality images using cutting-edge AI technologies. Users can record their voice describing an image they want to create, and the system will transcribe their speech and generate a corresponding image.
What Problem We Solved
Traditional image generation tools require users to type detailed prompts, which can be:
- Time-consuming for complex descriptions.
- Limiting for users with typing difficulties.
- Less natural than speaking.
This project solution makes AI image generation more accessible through voice interaction.
Architecture & Tech Stack
This diagram shows a pipeline for converting speech into images and storing the result in the GridDB database:
- User speaks into a microphone.
- Speech recording captures the audio.
- Audio is sent to ElevenLabs (Scriber-1) for speech-to-text transcription.
- The transcribed text becomes a prompt for Imagen 4 API, which generates an image.
- The data saved into the database are:
- The audio reference and
- The image
- The prompt text
Frontend Stack
In this project, we will use Next.js as the frontend framework.
Backend Stack
There is no specific backend code because all services used are from APIs. There are three main APIs:
1. Speech-to-Text API
This project utilizes the ElevenLabs API for speech-to-text transcription. The API is athttps://api.elevenlabs.io/v1/text-to-speech/. ElevenLabs also provides JavaScript SDK for easier API integration. You can see the SDK documentation for more details.
2. Image Generation API
This project uses Imagen 4 API from fal. The API is hosted on https://fal.ai/models/fal-ai/imagen4/preview. Fal provides JavaScript SDK for easier API integration. You can see the SDK documentation for more details.
3. Database API
We will use the GridDB Cloud version in this project. So there is no need for local installation. Please read the next section on how to set up GridDB Cloud.
Prerequisites
Node.js
This project is built using Next.js, which requires Node.js version 16 or higher. You can download and install Node.js from https://nodejs.org/en.
GridDB
Sign Up for GridDB Cloud Free Plan
If you would like to sign up for a GridDB Cloud Free instance, you can do so at the following link: https://form.ict-toshiba.jp/download_form_griddb_cloud_freeplan_e.
After successfully signing up, you will receive a free instance along with the necessary details to access the GridDB Cloud Management GUI, including the GridDB Cloud Portal URL, Contract ID, Login, and Password.
GridDB WebAPI URL
Go to the GridDB Cloud Portal and copy the WebAPI URL from the Clusters section. It should look like this:
GridDB Username and Password
Go to the GridDB Users section of the GridDB Cloud portal and create or copy the username for GRIDDB_USERNAME. The password is set when the user is created for the first time, use this as the GRIDDB_PASSWORD.
For more details, to get started with GridDB Cloud, please follow this quick start guide.
IP Whitelist
When running this project, please ensure that the IP address where the project is running is whitelisted. Failure to do so will result in a 403 status code or forbidden access.
You can use a website like What Is My IP Address to find your public IP address.
To whitelist the IP, go to the GridDB Cloud Admin and navigate to the Network Access menu.
ElevenLabs
You need an ElevenLabs account and API key to use this project. You can sign up for an account at https://elevenlabs.io/signup.
After signing up, go to the Account section, and create and copy your API key.
Imagen 4 API
You need an Imagen 4 API key to use this project. You can sign up for an account at https://fal.ai.
After signing up, go to the Account section, and create and copy your API key.
How to Run
1. Clone the repository
Clone the repository from https://github.com/junwatu/speech-image-gen to your local machine.
git clone https://github.com/junwatu/speech-image-gen.git
cd speech-image-gen
cd apps
2. Install dependencies
Install all project dependencies using npm.
npm install
3. Set up environment variables
Copy file .env.example to .env and fill in the values:
# Copy this file to .env.local and add your actual API keys
# Never commit .env.local to version control
# Fal.ai API Key for Imagen 4
# Get your key from: https://fal.ai/dashboard
FAL_KEY=
# ElevenLabs API Key for Speech-to-Text
# Get your key from: https://elevenlabs.io/app/speech-synthesis
ELEVENLABS_API_KEY=
GRIDDB_WEBAPI_URL=
GRIDDB_PASSWORD=
GRIDDB_USERNAME=
Please look the section on Prerequisites before running the project.
4. Run the project
Run the project using the following command:
npm run dev
5. Open the application
Open the application in your browser at http://localhost:3000. You also need to allow the browser to access your microphone.
Implementation Details
Speech Recording
The user will speak into the microphone and the audio will be recorded. The audio will be sent to ElevenLabs API for speech-to-text transcription. Please, remember that the language supported is English.
The code to save the recording file is in the main page.tsx. It uses a native media recorder HTML 5 API to record the audio. Below is the snippet code:
const startRecording = useCallback(async () => {
try {
setError(null);
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true,
sampleRate: 44100
}
});
// Try different MIME types based on browser support
let mimeType = 'audio/webm;codecs=opus';
if (!MediaRecorder.isTypeSupported(mimeType)) {
mimeType = 'audio/webm';
if (!MediaRecorder.isTypeSupported(mimeType)) {
mimeType = 'audio/mp4';
if (!MediaRecorder.isTypeSupported(mimeType)) {
mimeType = ''; // Let browser choose
}
}
}
const mediaRecorder = new MediaRecorder(stream, {
...(mimeType && { mimeType })
});
mediaRecorderRef.current = mediaRecorder;
audioChunksRef.current = [];
recordingStartTimeRef.current = Date.now();
mediaRecorder.ondataavailable = (event) => {
if (event.data.size > 0) {
audioChunksRef.current.push(event.data);
}
};
mediaRecorder.onstop = async () => {
const duration = Date.now() - recordingStartTimeRef.current;
const audioBlob = new Blob(audioChunksRef.current, {
type: mimeType || 'audio/webm'
});
const audioUrl = URL.createObjectURL(audioBlob);
const recording: AudioRecording = {
blob: audioBlob,
url: audioUrl,
duration,
timestamp: new Date()
};
setCurrentRecording(recording);
await transcribeAudio(recording);
stream.getTracks().forEach(track => track.stop());
};
mediaRecorder.start(1000); // Collect data every second
setIsRecording(true);
} catch (error) {
setError('Failed to access microphone. Please check your permissions and try again.');
}
}, []);
The audio processing flow is as follows:
- User clicks record button →
startRecording()is called. - Requests microphone access via
getUserMedia(). - Creates
MediaRecorderwith optimal settings. - Collects audio data in chunks.
- When stopped, create an audio blob and trigger transcription.
The audio data will be saved in the public/uploads/audio folder. Below is the snippet code to save the audio file:
export async function saveAudioToFile(audioBlob: Blob, extension: string = 'webm'): Promise {
// Create uploads directory if it doesn't exist
const uploadsDir = join(process.cwd(), 'public', 'uploads', 'audio');
await mkdir(uploadsDir, { recursive: true });
// Generate unique filename
const filename = `${generateRandomID()}.${extension}`;
const filePath = join(uploadsDir, filename);
// Convert blob to buffer and save file
const arrayBuffer = await audioBlob.arrayBuffer();
const buffer = Buffer.from(arrayBuffer);
await writeFile(filePath, buffer);
// Return relative path for storage in database
return `/uploads/audio/${filename}`;
}
The full code for the saveAudioToFile() function is in the app/lib/audio-storage.ts file.
Speech to Text Transcription
The transcribed text will be sent to ElevenLabs API for text-to-speech synthesis. The code to send the audio to ElevenLabs API is in the transcribeAudio() function. The full code is in the lib/elevenlabs-client.ts file.
// Main transcription function
export async function transcribeAudio(
client: ElevenLabsClient,
audioBuffer: Buffer,
modelId: ElevenLabsModel = ELEVENLABS_MODELS.SCRIBE_V1
) {
try {
const result = await client.speechToText.convert({
audio: audioBuffer,
model_id: modelId,
}) as TranscriptionResponse;
return {
success: true,
text: result.text,
language_code: result.language_code,
language_probability: result.language_probability,
words: result.words || [],
additional_formats: result.additional_formats || []
};
} catch (error) {
console.error('ElevenLabs transcription error:', error);
return {
success: false,
error: error instanceof Error ? error.message : 'Unknown error'
};
}
}
Transcription Route
The transcribeAudio() function is called when accessing the /api/transcribe route. This route only accepts the POST method and processes the audio file sent in the request body. The ELEVENLABS_API_KEY environment variable in the .env is used in the route to initialize the ElevenLabs client.
export async function POST(request: NextRequest) {
// Get audio file from form data
const formData = await request.formData();
const audioFile = formData.get('audio') as File;
// Convert to buffer
const arrayBuffer = await audioFile.arrayBuffer();
const audioBuffer = Buffer.from(arrayBuffer);
// Initialize ElevenLabs client
const elevenlabs = new ElevenLabsClient({ apiKey: apiKey });
// Convert audio to text
const result = await elevenlabs.speechToText.convert({
file: audioBlob,
modelId: "scribe_v1",
languageCode: "en",
tagAudioEvents: true,
diarize: false,
});
return NextResponse.json({
transcription: result.text,
language_code: result.languageCode,
language_probability: result.languageProbability,
words: result.words
});
}
The route will return the following JSON object:
{
"transcription": "Transcribed text from the audio",
"language_code": "en",
"language_probability": 0.99,
"words": [
{
"start": 0.0,
"end": 1.0,
"word": "Transcribed",
"probability": 0.99
},
// ... more words
]
}
The transcribed text serves as the input prompt for image generation using Imagen 4 from fal.ai, which creates high-quality images based on the provided text description.
Image Generation
The Fal API endpoint used is fal-ai/imagen4/preview. You must have a Fal API key to use this endpoint and set the FAL_KEY in the .env file. Please look into this section on how to get the API key.
The Fal Imagen 4 image generation API is called directly in the /api/generate-image route. The route will create the image using the subscribe() method from the @fal-ai/client SDK package.
export async function POST(request: NextRequest) {
const { prompt, style = 'photorealistic' } = await request.json();
// Configure fal client
fal.config({
credentials: process.env.FAL_KEY || ''
});
// Generate image using fal.ai Imagen 4
const result = await fal.subscribe("fal-ai/imagen4/preview", {
input: {
prompt: prompt,
// Add style to prompt if needed
...(style !== 'photorealistic' && {
prompt: `${prompt}, ${style} style`
})
},
logs: true,
onQueueUpdate: (update) => {
if (update.status === "IN_PROGRESS") {
update.logs.map((log) => log.message).forEach(console.log);
}
},
});
// Extract image URLs from the result
const images = result.data?.images || [];
const imageUrls = images.map((img: any) => img.url || img);
return NextResponse.json({
images: imageUrls,
prompt: prompt,
style: style,
requestId: result.requestId
});
}
The route will return JSON with the following structure:
{
"images": [
"https://v3.fal.media/files/panda/YCl2K_C4yG87sDH_riyJl_output.png"
],
"prompt": "Floating red jerry can on the blue sea, wide shot, side view",
"style": "photorealistic",
"requestId": "8a0e13db-5760-48d4-9acd-5c793b14e1ee"
}
The image data, along with the prompt and audio file path, will be saved into the GridDB database.
Database Operation
We use the GridDB Cloud version for saving the image generation, prompt, and audio file path. It’s easy to use and accessible using API. The container or database name for this project is genvoiceai.
Save Data to GridDB
We can save any data to the database therefore we need to define the data schema or structure. We will use the following data schema or structure for this project:
export interface GridDBData {
id: string | number;
images: Blob; // Stored as base64 string
prompts: string; // Text prompt
audioFiles: string; // File path to audio file
}
In real-world applications, best practice is to separate binary files from their references. However, for simplicity in this example, we store the image directly in the database as a base64-encoded string. Before saving to the database, the image needs to be converted to base64 format:
// Convert image blob to base64 string for GridDB storage
const imageBuffer = await imageBlob.arrayBuffer();
const imageBase64 = Buffer.from(imageBuffer).toString('base64');
Please look into the lib/griddb.ts file for the implementation details. The insertData() function is the actual database insertion.
async function insertData({ data, containerName = 'genvoiceai' }) {
const row = [
parseInt(data.id.toString()), // ID as integer
data.images, // Base64 image string
data.prompts, // Text prompt
data.audioFiles // Audio file path
];
const path = `/containers/${containerName}/rows`;
return await makeRequest(path, [row], 'PUT');
}
Get Data from GridDB
To get data from the database, you can use the GET request from the route /api/save-data. This route uses SQL query to get specific or all data from the database.
// For specific ID
query = {
type: 'sql-select',
stmt: `SELECT * FROM genvoiceai WHERE id = ${parseInt(id)}`
};
// For recent entries
query = {
type: 'sql-select',
stmt: `SELECT * FROM genvoiceai ORDER BY id DESC LIMIT ${parseInt(limit)}`
};
For detailed code implementation, please look into the app/api/save-data/route.ts file.
Server Routes
This project uses Next.js serverless functions to handle API requests. This means there is no separate backend code to handle APIs, as they are integrated directly into the Next.js application.
The routes used by the frontend are as follows:
| Route | Method | Description |
|---|---|---|
/api/generate-image |
POST | Generate images using fal.ai Imagen 4 |
/api/transcribe |
POST | Convert audio to text using ElevenLabs |
/api/save-data |
POST | Save image, prompt, and audio data to GridDB |
/api/save-data |
GET | Retrieve saved data from GridDB |
/api/audio/[filename] |
GET | Serve audio files from uploads directory |
User Interface
The main entry of the frontend is in the page.tsx file. The user interface is built with Next.js and its single-page applications with several key sections:
- Voice Recording Section: Large microphone button for audio recording.
- Transcribed Text Display: Shows the converted speech-to-text with language detection. You can also edit the prompt here before generating the image.
- Style Selection: A dropdown menu that allows users to choose different image generation styles, including photorealistic, artistic, anime, and abstract styles.
- Generated Images Grid: Displays created images with download/save options.
- Saved Data Viewer: Shows previously saved generations from the database.
The saved data will be displayed in the Saved Data Viewer section, you can show and hide it by clicking the Show Saved button on the top right. Each saved entry will include the image, the prompt used to generate it, the audio reference, and the request ID. You can also play the audio and download the image.
Future Enhancements
This project is a basic demo and can be further enhanced with additional features, such as:
- User authentication and authorization for saved data.
- Image editing or customization options.
- Integration with other AI models for image generation.
- Speech recognition improvements for different languages. Currently, it supports only English.
If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.








