hyperframes-media
Installation
Summary
Generate speech, transcribe audio with timestamps, and remove video backgrounds for transparent overlays.
- Three CLI commands (
tts,transcribe,remove-background) that each download and cache their own model on first run; no API keys required - Text-to-speech supports 54 multilingual voices (American, British, Spanish, French, Hindi, Italian, Japanese, Portuguese, Mandarin) with speed control; auto-detects language from voice prefix
- Transcription produces word-level timestamps in normalized JSON; supports multiple input formats (audio, video, SRT/VTT, OpenAI responses) with configurable Whisper model sizes and explicit language selection to prevent silent translation errors
- Background removal outputs VP9 WebM with alpha channel (or ProRes/PNG) for transparent overlays; optional
--background-outputflag creates a hole-cut inverse layer for compositing text or graphics between subject and background
SKILL.md
HyperFrames Media Preprocessing
Three CLI commands that produce assets for compositions: tts (speech), transcribe (timestamps), and remove-background (transparent video). Each downloads a model on first run and caches it under ~/.cache/hyperframes/. Drop the output into the project, then reference it from the composition HTML — see the hyperframes skill for the audio/video element conventions.
Text-to-Speech (tts)
Generate speech audio locally with Kokoro-82M. No API key.
npx hyperframes tts "Text here" --voice af_nova --output narration.wav
npx hyperframes tts script.txt --voice bf_emma --output narration.wav
npx hyperframes tts --list # all 54 voices
Voice Selection
Match voice to content. Default is af_heart.
Explore more of GLSRM
