Shipping DropVox 1.0: From Python Prototype to Native Swift

12 min read
dropvoxswiftswiftuiwhisperkitmacosindie-hackingrewrite

Three weeks ago I wrote about building DropVox as a Python prototype. A menu bar app that transcribes voice messages locally using Whisper. It worked. I used it every day. My wife's WhatsApp voice messages no longer piled up unread while I was in meetings.

Then I rewrote the entire thing in Swift.

This is the story of that rewrite, what I learned, and how DropVox went from a weekend hack to a commercial macOS product in 24 days.

Why Rewrite at All?

The Python version worked fine for me. But "works fine for me" is not a product. Here's what was limiting the prototype:

rumps hit its ceiling fast. The library is wonderful for simple menu bar apps, but the moment I wanted anything beyond a dropdown menu, I was fighting the framework. The floating drop zone I'd dreamed about since day one? Impossible with rumps. Drag-and-drop? Requires PyObjC bindings that are fragile and poorly documented. Custom UI beyond the menu? Not happening.

Distribution was painful. Shipping a Python app to non-developers means bundling a Python runtime, managing dependencies, dealing with code signing complications, and hoping py2app doesn't break on the next macOS update. Every release was a gamble.

Performance was noticeable. Python's GIL meant the UI occasionally stuttered during model loading. The app consumed more memory than it should have. Startup was slow because the entire Python runtime had to initialize before the menu bar icon appeared.

I wanted real macOS integration. Native notifications with actions. Proper keyboard shortcuts. Accessibility support. Share Extension support. The kind of polish that makes an app feel like it belongs on macOS, not like it's visiting.

On January 22, seven days after shipping the prototype, I started the rewrite.

The New Stack

The technology choices were deliberate:

Language:     Swift 5.9+
UI:           SwiftUI
AI Engine:    WhisperKit (by Argmax)
Target:       macOS 14+ Sonoma
Architecture: Actor-based concurrency
Distribution: Developer ID + Notarization

Why WhisperKit Over OpenAI Whisper

The Python prototype used OpenAI's reference Whisper implementation. For Swift, I needed something native. WhisperKit from Argmax is a Swift package that runs Whisper models using Apple's Core ML framework. The difference is substantial.

WhisperKit leverages the Neural Engine on Apple Silicon. The same transcription that took 30 seconds in Python now completes in under 10 seconds on an M1 MacBook. The models are optimized for Apple hardware specifically, and the memory footprint is dramatically lower.

The integration is clean:

import WhisperKit

let whisperKit = try await WhisperKit(
    model: "openai_whisper-base",
    computeOptions: .init(audioEncoderCompute: .cpuAndNeuralEngine)
)

let result = try await whisperKit.transcribe(audioPath: fileURL.path)
let transcription = result.map { $0.text }.joined(separator: " ")

Five model sizes are available, from Tiny at 75MB to Large at 3GB. Each step up trades speed for accuracy. For voice messages, Base hits the sweet spot. For longer recordings or heavy accents, users can step up to Small or Medium.

Thirteen Languages Out of the Box

Whisper's multilingual support carried over to WhisperKit. DropVox 1.0 ships with 13 languages: English, Portuguese, Spanish, French, German, Italian, Dutch, Japanese, Korean, Chinese, Russian, Arabic, and Hindi.

This matters to me personally. My family speaks Portuguese. My work is in English. Having both languages transcribed accurately without switching settings was non-negotiable.

Building the Features I Always Wanted

The Floating Drop Zone

Remember the drag-and-drop dream I deferred in the Python version? It's real now.

Press Cmd+D and a translucent floating window appears on screen. Drag any audio file onto it. That's it. The window accepts drops from Finder, WhatsApp desktop, Telegram, Safari downloads, anywhere.

struct DropZoneView: View {
    @State private var isDragging = false

    var body: some View {
        ZStack {
            RoundedRectangle(cornerRadius: 16)
                .fill(.ultraThinMaterial)
                .overlay(
                    RoundedRectangle(cornerRadius: 16)
                        .strokeBorder(
                            isDragging ? Color.accentColor : Color.secondary,
                            style: StrokeStyle(lineWidth: 2, dash: [8])
                        )
                )
            // ... content
        }
        .onDrop(of: [.audio, .fileURL], isTargeted: $isDragging) { providers in
            handleDrop(providers)
        }
    }
}

SwiftUI's .onDrop modifier made this surprisingly straightforward. The window floats above all other windows, stays always-on-top, and can be repositioned anywhere. When you don't need it, Cmd+D hides it again.

Clipboard Paste Support

This was the feature I didn't know I needed until I built it. Press Cmd+V while DropVox is focused, and it checks the clipboard for audio file paths. If you've copied an audio file in Finder, DropVox grabs it and starts transcribing.

The workflow for WhatsApp is now: right-click a voice message in WhatsApp desktop, copy it, switch to DropVox, Cmd+V, done.

The Python version was ephemeral by design. Transcribe, copy to clipboard, forget. That sounded good in theory but was terrible in practice. I constantly found myself re-transcribing the same files because I'd pasted the text somewhere and lost it.

DropVox 1.0 keeps a searchable history of all transcriptions. Each entry shows the filename, language, model used, duration, and the full text. Search works across all fields, so I can find "that message about dinner" from three days ago without remembering the filename.

History is stored locally in a SQLite database. Nothing leaves the machine.

Format Support

DropVox handles the audio formats people actually encounter:

  • .opus -- WhatsApp voice messages
  • .mp3 -- The universal format
  • .m4a -- iPhone voice memos and recordings
  • .wav -- Lossless audio from professional tools

The .opus support was critical. WhatsApp uses Opus encoding for voice messages, and most transcription tools don't handle it natively. DropVox converts it transparently before processing.

Architecture Decisions

Actor-Based Concurrency

Swift's actor model solved the threading headaches I had in Python. The transcription engine runs on its own actor, preventing data races without manual lock management:

actor TranscriptionEngine {
    private var whisperKit: WhisperKit?
    private var currentModel: WhisperModel = .base

    func transcribe(_ audioURL: URL, language: String?) async throws -> TranscriptionResult {
        guard let whisper = whisperKit else {
            throw TranscriptionError.modelNotLoaded
        }

        let result = try await whisper.transcribe(audioPath: audioURL.path)
        return TranscriptionResult(
            text: result.map { $0.text }.joined(separator: " "),
            language: language ?? "auto",
            duration: audioURL.audioDuration
        )
    }

    func loadModel(_ model: WhisperModel) async throws {
        whisperKit = try await WhisperKit(model: model.identifier)
        currentModel = model
    }
}

The actor keyword guarantees that only one transcription runs at a time and that model loading never conflicts with active transcription. No semaphores. No dispatch queues. No race conditions.

Protocol-Driven Design

Every major component is defined by a protocol. The transcription engine, the license validator, the history store, the notification service. Each can be swapped out for a mock in tests or an alternative implementation.

protocol TranscriptionProvider {
    func transcribe(_ url: URL, language: String?) async throws -> TranscriptionResult
    func loadModel(_ model: WhisperModel) async throws
    var isReady: Bool { get }
}

protocol LicenseValidator {
    func validate(_ key: String) async throws -> LicenseStatus
    var currentStatus: LicenseStatus { get }
}

This paid off immediately when I needed to test the license validation flow without hitting the server. Inject a mock validator, simulate expired licenses, test grace periods. All without network calls.

Monetization: Free Tier + One-Time Purchase

I deliberated on pricing for longer than I'd like to admit. The model I landed on:

Free tier: 3 transcriptions per day, 60-second maximum duration per file. Enough to genuinely use the app and decide if it's worth paying for. Not a crippled trial that frustrates users into paying.

Pro license: $9.99 USD / R$49.90 BRL. One-time purchase. Unlimited transcriptions, unlimited duration, all model sizes, all languages.

Why One-Time, Not Subscription

DropVox runs entirely on the user's hardware. There are no server costs per user. No API calls to bill for. No cloud infrastructure scaling with usage. Charging monthly for software that uses zero ongoing resources felt dishonest.

One-time pricing also aligns with the indie ethos. When someone pays $9.99, they own it. No anxiety about canceling. No "am I still getting value?" calculations every month.

Multi-Currency Pricing

The BRL price isn't a direct conversion. $9.99 USD would be roughly R$60 at current rates. I priced it at R$49.90 because purchasing power in Brazil is different, and I'd rather have more Brazilian users at a fair local price than fewer users at a "correct" exchange rate.

License Validation

The license system validates online when possible but includes a 7-day offline grace period. If someone's internet goes down or they're traveling without connectivity, the app keeps working. Licenses are tied to a machine identifier, with a generous activation limit for people who own multiple Macs.

actor LicenseManager {
    private var cachedStatus: LicenseStatus?
    private var lastValidation: Date?

    func checkLicense() async -> LicenseStatus {
        // If we validated recently and it was valid, trust the cache
        if let cached = cachedStatus,
           let lastCheck = lastValidation,
           cached == .valid,
           Date().timeIntervalSince(lastCheck) < gracePeriod {
            return .valid
        }

        // Try online validation
        do {
            let status = try await validateOnline()
            cachedStatus = status
            lastValidation = Date()
            return status
        } catch {
            // Network failure: use grace period
            if let lastCheck = lastValidation,
               Date().timeIntervalSince(lastCheck) < gracePeriod {
                return cachedStatus ?? .free
            }
            return .free
        }
    }
}

Code Signing and Distribution

This was the part I dreaded most. Apple's code signing and notarization process has a reputation for being painful, and the reputation is earned. But it's non-negotiable for a commercial macOS app.

The setup:

  1. Apple Developer ID certificate for signing the app outside the App Store
  2. Notarization through Apple's service to prove the app isn't malware
  3. Stapling the notarization ticket to the app so it works offline
  4. GitHub Actions for automated builds on every tag

The CI/CD pipeline builds a universal binary (Intel + Apple Silicon), signs it, submits for notarization, waits for approval, staples the ticket, creates a DMG, and publishes a GitHub release. The entire flow runs in about 8 minutes.

# Simplified GitHub Actions workflow
- name: Build
  run: xcodebuild -scheme DropVox -configuration Release

- name: Sign
  run: codesign --deep --force --sign "$DEVELOPER_ID" DropVox.app

- name: Notarize
  run: xcrun notarytool submit DropVox.zip --apple-id "$APPLE_ID" --wait

- name: Staple
  run: xcrun stapler staple DropVox.app

Getting this pipeline working was a full day of debugging. Apple's error messages are cryptic. The documentation assumes you're using Xcode's GUI. But once it works, every release is automatic. Tag a commit, push, wait for the GitHub Action, done.

The Timeline

Here's how 24 days broke down:

  • Days 1-3: Project setup, SwiftUI menu bar app skeleton, WhisperKit integration
  • Days 4-6: Core transcription flow, model selection, language support
  • Days 7-9: Floating drop zone, clipboard integration, keyboard shortcuts
  • Days 10-12: Transcription history, search, SQLite persistence
  • Days 13-15: License system, free tier limits, payment integration
  • Days 16-18: Code signing, notarization, GitHub Actions CI/CD
  • Days 19-21: Polish, edge cases, .opus handling, error states
  • Days 22-24: Website updates, documentation, launch preparation

Twenty-four days from first Swift commit to a signed, notarized, commercially available macOS application. Not a prototype. Not an MVP with asterisks. A real product.

What I Learned

Rewrites Are Sometimes the Right Call

The conventional wisdom is "never rewrite." And for large systems with many contributors, that's usually correct. But for a single-developer project where the original was a deliberate prototype? The rewrite was the product development.

The Python version proved the concept. The Swift version is the product. They're different things with different goals. Trying to evolve the Python version into a commercial product would have taken longer than rewriting.

SwiftUI is Ready for Menu Bar Apps

SwiftUI still has rough edges for complex layouts, but for menu bar apps and utility windows, it's excellent. The declarative syntax maps perfectly to the kind of UI DropVox needs. Settings views, popover menus, the floating drop zone, all came together cleanly.

The state management with @Observable (new in iOS 17 / macOS 14) is a massive improvement over the old @ObservableObject pattern. Less boilerplate, better performance, more intuitive.

Apple's Developer Experience Has Gaps

Code signing documentation is scattered across multiple guides that contradict each other. Notarization errors reference internal Apple IDs that mean nothing to developers. The tooling assumes Xcode GUI usage, making CI/CD setup an archaeological exercise.

This is the kind of friction that keeps indie developers from shipping native macOS apps. It's solvable, but Apple could make it dramatically better.

Local AI Is a Legitimate Product Category

When I started DropVox, "local AI" felt like a niche concern for privacy-conscious developers. But the response to the Python prototype showed real demand from non-technical users who simply don't want their voice messages on someone else's server.

With Apple Silicon's Neural Engine, local AI isn't just a privacy feature. It's a performance feature. WhisperKit on an M-series Mac is faster than most cloud APIs because there's no network round trip. The model runs where the data already is.

What's Next

DropVox 1.0 is shipped, but the roadmap is far from empty:

Share Extension. The feature that could change everything. Right-click an audio file in any app, share to DropVox, get the transcription back. No switching apps, no drag-and-drop, just a system-level integration.

App Store consideration. Direct distribution gives me full control over the experience, but the App Store provides discovery. I'm watching how v1.0 performs before committing to the App Store review process and its 30% cut.

Speaker diarization. Identifying who said what in a multi-person recording. WhisperKit doesn't support this yet, but the underlying models are getting close.

Real-time transcription. Live microphone input with streaming transcription. The latency requirements are tight, but WhisperKit's performance on Apple Silicon makes it feasible.

Try It

DropVox is available at dropvox.app. The free tier gives you 3 transcriptions per day. If you send or receive voice messages regularly and care about where your audio data goes, give it a shot.

The Pro license is $9.99 -- less than one month of any cloud transcription service, and it's yours permanently.


This is the follow-up to my January 15 post about building the Python prototype. If you're curious about the origin story, start there. If you have questions about the Swift rewrite or indie macOS development, find me on GitHub.