hyperframes-media

Installation

Summary

Generate speech, transcribe audio with timestamps, and remove video backgrounds for transparent overlays.

Three CLI commands (tts, transcribe, remove-background) that each download and cache their own model on first run; no API keys required
Text-to-speech supports 54 multilingual voices (American, British, Spanish, French, Hindi, Italian, Japanese, Portuguese, Mandarin) with speed control; auto-detects language from voice prefix
Transcription produces word-level timestamps in normalized JSON; supports multiple input formats (audio, video, SRT/VTT, OpenAI responses) with configurable Whisper model sizes and explicit language selection to prevent silent translation errors
Background removal outputs VP9 WebM with alpha channel (or ProRes/PNG) for transparent overlays; optional --background-output flag creates a hole-cut inverse layer for compositing text or graphics between subject and background

SKILL.md

HyperFrames Media Preprocessing

Three CLI commands that produce assets for compositions: tts (speech), transcribe (timestamps), and remove-background (transparent video). Each downloads a model on first run and caches it under ~/.cache/hyperframes/. Drop the output into the project, then reference it from the composition HTML — see the hyperframes skill for the audio/video element conventions.

Text-to-Speech (`tts`)

Generate speech audio locally with Kokoro-82M. No API key.

npx hyperframes tts "Text here" --voice af_nova --output narration.wav
npx hyperframes tts script.txt --voice bf_emma --output narration.wav
npx hyperframes tts --list                       # all 54 voices