Building a Local-First Pipeline: yt-dlp, ffmpeg, and Whisper in Electron

The Pipeline Problem

Falavra does something that sounds simple: you paste a YouTube URL, and it gives you a transcript. Behind that simple interaction, four separate tools have to work in sequence, passing data between them, reporting progress, handling failures, and cleaning up after themselves. All of it running locally on your machine, no cloud APIs, no server round-trips.

The pipeline looks like this:

YouTube URL → yt-dlp (extract audio) → ffmpeg (convert to 16kHz WAV) → sherpa-onnx Whisper (transcribe) → SQLite → UI

Each arrow is a boundary where things can fail. The URL might point to a private video. The audio format might be unexpected. The Whisper model might not be downloaded yet. The user might cancel mid-transcription. Every one of those scenarios needs handling, and the user needs to see meaningful progress through the whole thing.

This is the architecture story of how I built that pipeline inside Electron.

The Pipeline Orchestrator

The core of Falavra's architecture is pipeline.ts, a job queue that processes one transcription at a time. This constraint exists because Whisper transcription is CPU-intensive. Running two Whisper instances in parallel on a consumer MacBook would either OOM the process or slow both jobs to a crawl. Sequential processing is not a limitation -- it is a deliberate architectural decision.

The orchestrator manages a simple state machine for each job:

type JobStatus = 'queued' | 'downloading' | 'converting' | 'diarizing' | 'transcribing' | 'completed' | 'failed' | 'cancelled';

When a job enters the queue, the orchestrator checks if another job is currently running. If not, it starts processing. If yes, it stays in the queue and the UI shows its position. When the active job completes (or fails, or gets cancelled), the orchestrator automatically picks up the next queued job.

This pattern is deliberately simple. I considered more sophisticated approaches -- priority queues, concurrent downloads with sequential transcription, job persistence across app restarts. I shipped none of them. The sequential queue handles the actual use case: someone pasting a few YouTube URLs and letting Falavra work through them. Over-engineering the queue would have delayed shipping by weeks for scenarios that rarely occur.

Stage Separation

Each pipeline stage lives in its own service file: ytdlp.ts, ffmpeg.ts, whisper.ts. This separation is not just organizational cleanliness -- it enables independent error handling, independent testing, and makes the pipeline extensible.

The yt-dlp stage takes a URL and produces an audio file. It spawns yt-dlp as a child process with specific flags to extract audio only, targeting the best available audio format. The key detail here is parsing yt-dlp's stdout for progress information. yt-dlp outputs download progress in a semi-structured format that needs regex parsing to extract percentage values.

The ffmpeg stage takes whatever audio format yt-dlp produced and converts it to a 16kHz mono WAV file. This specific format is what Whisper expects. The conversion is usually fast -- a few seconds for a typical video -- but it is a necessary normalization step. Without it, you would need to handle every audio codec that YouTube might serve, and Whisper's accuracy drops with non-standard sample rates.

The Whisper stage loads the sherpa-onnx model and runs inference on the WAV file. This is the longest stage by far. A 10-minute video takes roughly 30-60 seconds to transcribe depending on the model size and hardware. sherpa-onnx provides progress callbacks, which feed directly into the progress mapping system.

Progress Mapping

Users need a single progress bar that moves smoothly from 0% to 100%. But the three stages have wildly different durations. Download speed depends on network. Conversion is nearly instant. Transcription dominates. If I mapped each stage to a third of the progress bar, the conversion stage would flash by while the transcription stage would crawl.

The solution is weighted progress mapping:

// Standard pipeline
const PROGRESS_MAP = {
  download: { start: 0, end: 30 },    // 0-30%
  convert:  { start: 30, end: 50 },   // 30-50%
  transcribe: { start: 50, end: 100 } // 50-100%
};

// With diarization enabled
const PROGRESS_MAP_DIARIZED = {
  download:   { start: 0, end: 25 },   // 0-25%
  convert:    { start: 25, end: 40 },  // 25-40%
  diarize:    { start: 40, end: 55 },  // 40-55%
  transcribe: { start: 55, end: 100 }  // 55-100%
};

Each stage reports its own internal progress as 0-100%, and the pipeline orchestrator maps that to the appropriate segment of the overall progress bar. The conversion stage gets a larger allocation than its actual duration warrants, which creates a smoother perceived experience. The progress bar never appears stuck, even if the network download is fast.

Bundled Binaries

Falavra bundles yt-dlp and ffmpeg rather than requiring users to install them globally. This was a non-negotiable decision. Asking non-technical users to install Homebrew, then brew install yt-dlp ffmpeg would kill the onboarding experience.

The binary resolution logic lives in binary-paths.ts:

function getBinaryPath(name: string): string {
  const isDev = !app.isPackaged;
  const basePath = isDev
    ? path.join(__dirname, '..', '..', 'binaries')
    : path.join(process.resourcesPath, 'binaries');

  return path.join(basePath, process.platform, name);
}

In development, binaries live in <project-root>/binaries/<platform>/. In production, they are in process.resourcesPath/binaries/<platform>/. A setup script (npm run setup) downloads the correct platform-specific binaries. This script runs once after cloning the repo and again when binary versions need updating.

The split between dev and production paths is critical. During development, you want binaries that are easy to update and test. In production, process.resourcesPath is the correct Electron location for bundled assets -- it survives code signing and notarization, and it is read-only, which prevents accidental corruption.

Data Storage Architecture

Falavra uses four distinct storage mechanisms, each chosen for its specific use case:

Settings (electron-store): App preferences like default model size, theme, window position. electron-store writes JSON files to userData. It is the right tool for key-value configuration that rarely changes and never needs querying.

Transcripts (SQLite via better-sqlite3): The transcript library lives in userData/falavra.db. SQLite is synchronous in better-sqlite3, which simplifies the main process code considerably. No async/await chains for database operations -- just direct function calls that return results.

Full-text search (SQLite FTS5): This is one of the decisions I am most happy with. The FTS5 virtual table indexes title, channel name, and the full transcript text:

CREATE VIRTUAL TABLE transcripts_fts USING fts5(
  title, channel, transcript,
  content='transcripts',
  content_rowid='id'
);

Search across thousands of transcripts returns results in milliseconds. No Elasticsearch, no external search service, no network latency. FTS5 supports phrase matching, prefix queries, and boolean operators out of the box. For a desktop app, this is more than sufficient.

Models (filesystem): Whisper models are stored in userData/models/sherpa-onnx-whisper-{size}/. They range from 40MB (tiny) to 1.5GB (large-v3). These are downloaded on demand, and the model management UI shows download progress and disk usage per model.

Temp files (os.tmpdir): Each pipeline job creates temporary files in os.tmpdir()/falavra/. The downloaded audio, the converted WAV, and any intermediate files live here during processing. After each job completes (success or failure), the temp directory for that job is cleaned up.

This last point matters more than you might think. An hour of YouTube audio produces roughly 115MB of WAV data at 16kHz. If you process ten videos and skip cleanup, that is over a gigabyte of orphaned temp files. The cleanup runs in a finally block so it executes even on job failure or cancellation.

IPC Bridge Architecture

Electron's process model requires explicit communication between the main process (Node.js, where the pipeline runs) and the renderer process (React, where the UI lives). Falavra uses a typed IPC bridge defined in preload/index.ts using contextBridge.exposeInMainWorld.

The bridge exposes three categories of operations:

// Pipeline operations
transcribeUrl(url: string, options: TranscribeOptions): void;
cancelJob(jobId: string): void;
onPipelineProgress(callback: (event: ProgressEvent) => void): void;

// Library operations
getTranscripts(options: QueryOptions): Promise<Transcript[]>;
searchTranscripts(query: string): Promise<Transcript[]>;
exportTranscript(id: string, format: ExportFormat): Promise<string>;

// Model operations
downloadModel(modelId: string): void;
setActiveModel(modelId: string): Promise<void>;
onModelDownloadProgress(callback: (event: ModelProgressEvent) => void): void;

The pattern here is deliberate: commands are fire-and-forget (the renderer tells the main process to do something), queries return promises (the renderer asks the main process for data), and progress events use callbacks (the main process pushes updates to the renderer).

This separation means the renderer never blocks waiting for a long operation. The transcription might take two minutes, but the UI remains responsive the entire time because it is receiving progress events rather than awaiting a promise.

Renderer State with Zustand

The renderer uses three Zustand stores, each responsible for a distinct domain:

useTranscriptionStore manages the job queue state. It tracks which jobs are queued, which is active, what the current progress is. When the main process sends a progress event, the store updates and the UI re-renders.

useLibraryStore manages the transcript library. Fetching transcripts, searching, pagination, the currently selected transcript for viewing. This store handles the optimistic updates when a new transcription completes -- the transcript appears in the library immediately without requiring a refetch.

useModelStore manages Whisper model state. Which models are downloaded, which is active, download progress for in-flight model downloads. Switching models triggers a re-initialization of the Whisper engine in the main process.

Three stores instead of one is an intentional choice. Each store has its own lifecycle and update frequency. The transcription store updates multiple times per second during active processing. The library store updates once per completed job. The model store updates rarely. Keeping them separate means progress updates during transcription do not trigger unnecessary re-renders in the library view.

Error Handling Across Stages

Each pipeline stage has its own failure modes:

yt-dlp fails when videos are private, geo-blocked, age-restricted, or removed. It also fails silently sometimes -- returning a zero-byte file instead of an error. The pipeline checks file size after download and treats zero-byte results as failures.

ffmpeg fails on corrupted audio or unsupported codecs. This is rare because yt-dlp typically produces well-formed audio, but it happens with certain live stream recordings or DRM-protected content.

Whisper model loading fails when model files are corrupted (interrupted downloads) or when the model directory is missing. The pipeline validates model files before starting transcription.

Every stage wraps its execution in try/catch, reports the error through the progress event system (with a human-readable message), and the pipeline marks the job as failed. The UI displays the error and automatically moves to the next queued job.

The critical design decision here is that stage failures do not crash the pipeline. A failed download does not prevent the next queued URL from being processed. Isolation between jobs is as important as isolation between stages.

Lessons Learned

Temp file cleanup is non-negotiable. I learned this the hard way during development when my SSD filled up after a batch testing session. The finally block pattern for cleanup is not optional -- it is critical infrastructure.

Sequential job processing is a feature. Early in development, I attempted parallel downloads with sequential transcription. The complexity of managing partially-downloaded files, coordinating queue positions, and handling cancellation of in-flight downloads was not worth the marginal time savings. One job at a time, start to finish.

SQLite FTS5 is remarkably capable. Before implementing it, I considered shipping a search service or using a JavaScript full-text search library. FTS5 handles everything Falavra needs with zero additional dependencies and sub-millisecond query times on thousands of documents.

The pipeline pattern is reusable. The stage-by-stage architecture with progress mapping, error isolation, and sequential processing is not specific to transcription. I have already identified places in DropVox where the same pattern could simplify the architecture. If you are building any desktop app that orchestrates multiple tools, this pattern is worth adopting.

Binary bundling is table stakes for desktop apps. The gap between "developer experience" and "user experience" is enormous. As a developer, I have Homebrew and can install anything. My users do not. Bundling binaries adds complexity to the build process but removes it from the user's life, and that is always the right trade-off.

If you are building local-first desktop apps or working with AI pipelines, I would love to hear about your architecture decisions. Find me on LinkedIn or check out my other projects at helrabelo.dev.