Skip to content

Speech Models

Vox uses OpenAI's Whisper models for local speech recognition. This guide explains the available models and how to choose the right one for your needs.

Understanding Speech Models

Speech Models Screen

Access speech models from Settings → Speech.

What Are Whisper Models?

Whisper is OpenAI's open-source automatic speech recognition (ASR) system. Vox runs these models locally on your device, ensuring:

  • Privacy: Audio never leaves your device
  • Offline capability: Works without internet connection
  • Speed: No network latency
  • Cost: No per-minute charges

Privacy First

All speech recognition happens on your device. Your voice data is never sent to external servers (unless you enable AI Enhancement).

Available Models

Vox offers five Whisper model variants, each balancing speed and accuracy differently:

Fastest

Size: ~75MB Speed: Lowest latency (<50ms) Accuracy: Good for clear speech Best for: Quick commands, short phrases, testing

The smallest and fastest model. Ideal for users who prioritize speed over accuracy or have limited disk space.

Fast

Size: ~150MB Speed: Very low latency (~50ms) Accuracy: Better than Fastest Best for: Daily use with clear speech

A good middle ground between speed and quality. Suitable for most casual transcription needs.

Balanced

Size: ~480MB Speed: Recommended (~480MB) Accuracy: Good general-purpose accuracy Best for: Most users, general transcription

Recommended for most users. Provides excellent accuracy for everyday use without requiring excessive resources.

Accurate

Size: ~1.5GB Speed: Better accuracy, more decent latency (~1.5GB) Accuracy: High accuracy for complex speech Best for: Professional transcription, technical content, accents

Higher accuracy for challenging audio conditions, technical terminology, and various accents.

Best

Size: ~3GB Speed: Highest quality, significant CPU (~3GB) Accuracy: Maximum accuracy Best for: Critical transcription, multi-language, noisy environments

The largest and most accurate model. Use when transcription quality is paramount and system resources allow.

Downloading Models

First-Time Setup

Models Before Download

When you first install Vox, no models are downloaded. You must download at least one model to use Vox.

To download a model:

  1. Navigate to Settings → Speech
  2. Click Download next to your chosen model
  3. Wait for the download to complete
  4. The button changes to "Downloaded" when ready

Downloaded Models

First Model Recommendation

Start with Balanced for the best balance of quality and performance. You can always download additional models later.

Downloading Multiple Models

You can download multiple models and switch between them:

  1. Download different models for different use cases
  2. Test each model with the Test Local Model button
  3. Vox uses the currently selected model (marked with a checkmark)
  4. Switch between models anytime without re-downloading

Download Requirements

  • Internet connection: Required for initial download
  • Disk space: Ensure sufficient space for your chosen model
  • Time: Downloads typically take 1-10 minutes depending on model size and connection speed

System Requirements

Vox has different system requirements depending on your operating system:

macOS

RequirementMinimumRecommended
OS VersionmacOS 15 (Sequoia)macOS 15+ (Sequoia or later)
ProcessorApple Silicon (M1) or IntelApple Silicon (M2 or newer)
RAM4 GB8 GB or more
Storage500 MB - 4 GB4 GB free space
PermissionsMicrophone + Accessibility-

Apple Silicon Performance

Vox runs significantly faster on Apple Silicon (M1/M2/M3) compared to Intel Macs due to optimized neural engine support.

Windows

RequirementMinimumRecommended
OS VersionWindows 10 (64-bit)Windows 11
Processorx64 processorModern multi-core processor
RAM4 GB8 GB or more
Storage500 MB - 4 GB4 GB free space
PermissionsMicrophone access-

Windows Performance

Performance varies based on processor. Modern CPUs (Intel 10th gen+, AMD Ryzen 3000+) provide better transcription speed.

Coming Soon

Linux, iOS, and Android support is planned for future releases. See roadmap →

Testing Models

Test Local Model

After downloading a model, verify it works correctly:

  1. Click Test Local Model
  2. Speak a test phrase when prompted
  3. Review the transcription result
  4. Look for the success message: "Yeah. This is just a test. I laughing"

The test verifies:

  • Model is properly downloaded and installed
  • Audio pipeline is working
  • Transcription accuracy meets your needs

Test with Real Content

Test with phrases similar to your actual use case (technical terms, names, etc.) to gauge accuracy.

Choosing the Right Model

Decision Matrix

ModelSizeSpeedAccuracyBest For
Fastest75MB⚡⚡⚡⚡⚡⭐⭐⭐Testing, simple commands
Fast150MB⚡⚡⚡⚡⭐⭐⭐⭐Daily use, clear speech
Balanced480MB⚡⚡⚡⭐⭐⭐⭐Recommended for most users
Accurate1.5GB⚡⚡⭐⭐⭐⭐⭐Professional work, technical content
Best3GB⭐⭐⭐⭐⭐Critical transcription, complex audio

Consider Your Use Case

Choose Fastest or Fast if you:

  • Need instant transcription results
  • Transcribe short, simple phrases
  • Have limited disk space
  • Speak clearly in quiet environments

Choose Balanced if you:

  • Want a good all-around experience
  • Transcribe both short and long content
  • Need reliable accuracy without sacrificing too much speed
  • Are unsure which model to pick (start here!)

Choose Accurate if you:

  • Work with technical terminology
  • Speak with an accent or in multiple languages
  • Transcribe in environments with background noise
  • Need high accuracy for professional work

Choose Best if you:

  • Require maximum transcription accuracy
  • Work with complex, multi-language content
  • Transcribe critical documents or legal content
  • Have a powerful computer with plenty of resources

Model Performance Requirements

All models work on any computer that runs Vox, but performance varies:

For Fastest, Fast, Balanced:

  • Any Mac running macOS 15 or later / Any modern Windows PC
  • 8GB RAM minimum
  • Standard performance expectations

For Accurate:

  • Mac from 2020 or later / Windows PC with 8GB+ RAM recommended
  • 16GB RAM recommended
  • May be slower on older hardware

For Best:

  • Apple Silicon Mac or modern Windows PC with 16GB+ RAM
  • 16GB+ RAM recommended
  • Expect noticeable processing time on transcriptions

Apple Silicon Advantage

Macs with Apple Silicon (M1, M2, M3 chips) run Whisper models significantly faster than Intel Macs due to their Neural Engine.

Model Performance

Processing Time Examples

Approximate transcription times for a 10-second recording:

Performance on Windows PCs with equivalent specifications is comparable.

ModelIntel Mac (2019)M1/M2 MacM3 Mac
Fastest0.5s0.2s0.1s
Fast1s0.5s0.3s
Balanced2s1s0.5s
Accurate5s2.5s1.5s
Best10s4s2s

Times are approximate and vary based on audio complexity

Accuracy Comparison

Example transcription quality with technical terms:

Original speech: "Initialize the TypeScript interface with async await handlers"

ModelTranscription Quality
Fastest"Initialize the typescript interface with a sync away handlers"
Fast"Initialize the TypeScript interface with a sync await handlers"
Balanced"Initialize the TypeScript interface with async await handlers" ✓
Accurate"Initialize the TypeScript interface with async await handlers" ✓
Best"Initialize the TypeScript interface with async await handlers" ✓

AI Enhancement

For even better accuracy, enable AI Enhancement to post-process transcriptions with large language models.

Audio Retention

Audio Retention Setting

Configure how many recent audio recordings Vox keeps on disk:

Default: 10 recordings

Why keep audio:

  • Review transcriptions for accuracy
  • Test different models on the same audio
  • Add missed words to your dictionary
  • Debug transcription issues

Adjust retention:

  • Increase if you frequently review past recordings
  • Decrease to save disk space
  • Set to 0 to disable audio retention entirely

Privacy Note

Audio recordings are stored locally in Vox's application folder. They are never uploaded unless you explicitly enable AI Enhancement features.

Switching Models

You can change which model Vox uses at any time:

  1. Navigate to Settings → Speech
  2. Click on a different downloaded model
  3. The model with a checkmark is active
  4. Your next recording will use the new model

No restart required - the change takes effect immediately.

Managing Disk Space

Checking Model Storage

Models are stored in:

~/Library/Application Support/Vox/models/

Removing Models

To free up disk space:

  1. Navigate to Settings → Speech
  2. Find models you no longer need
  3. Click the trash icon next to the model
  4. Confirm deletion

You can re-download models at any time without penalty.

Storage Tips

  • Keep only the models you actively use
  • Balanced model is a good single-model choice
  • Download larger models only when needed
  • Audio retention takes minimal space (configurable)

Troubleshooting

Model Download Failed

Solution:

  1. Check your internet connection
  2. Ensure sufficient disk space
  3. Try downloading a smaller model first
  4. Restart Vox and try again

Test Local Model Fails

Solution:

  1. Verify microphone permission is granted
  2. Check System Preferences → Sound → Input for microphone selection
  3. Try a different model
  4. Restart Vox

Poor Transcription Quality

Solutions:

  1. Upgrade to a larger model: Try Accurate or Best
  2. Check audio quality: Speak clearly, reduce background noise
  3. Add custom words: Use the Dictionary feature
  4. Enable AI Enhancement: Post-process with AI for better results

Model Takes Too Long to Process

Solutions:

  1. Downgrade to a smaller model: Try Fast or Balanced
  2. Shorten recordings: Break long dictation into smaller chunks
  3. Close other apps: Free up CPU resources
  4. Check system activity: Ensure your computer isn't under heavy load

Model Using Too Much CPU/Memory

Solutions:

  1. Switch to a smaller model (Fastest or Fast)
  2. Close background applications
  3. Reduce audio retention to free resources
  4. Consider upgrading your hardware if you need larger models

Advanced Topics

Model Architecture

Vox uses quantized versions of Whisper models optimized for:

  • Optimized inference across platforms
  • Reduced memory footprint
  • Maintained accuracy vs. original models
  • Apple Silicon Neural Engine acceleration

Language Support

All Whisper models support multiple languages including:

  • English, Spanish, French, German, Italian, Portuguese
  • Chinese, Japanese, Korean
  • And 90+ other languages

Configure speech languages in Settings → General → Languages.

Custom Models

Currently, Vox supports only the five built-in Whisper variants. Custom model support may be added in future versions.

Next Steps

Built with 💜 by the open-source community & core contributors