Glossary

Key terms & concepts

Definitions for the technical concepts behind MaineCoon — from streaming generation to agentic inference.

Agentic Inference Framework

An inference system with autonomous controllers that manage narrative, memory, and pacing.

MaineCoon's inference framework uses three independent agents: the Director (plans prompts and corrects drift), Cache Manager (manages KV-cache retention and clearing), and Buffer Controller (balances generation-ahead buffer with interaction responsiveness). Together they enable thousand-second-scale generation without quality collapse.

Social World Model →

Audio-Visual Autoregressive Model

A model that generates each output chunk conditioned on all previously generated chunks.

Autoregressive models generate sequentially — each new chunk depends on the history of prior chunks. For MaineCoon, this means each sub-second audio-visual segment is conditioned on all previous segments, maintaining temporal coherence. The challenge is that errors compound over time, which MaineCoon addresses through self-resampling training and agentic drift correction.

Audio-Visual Sync capability →

Buffer Controller

Regulates the generation-ahead buffer to balance smooth playback with interaction speed.

Because MaineCoon generates faster than playback speed, a buffer of pre-generated content accumulates. The Buffer Controller keeps this ahead-buffer within an optimal window — enough to prevent stutter, but small enough that user interactions (new prompts) take effect within reasonable latency.

Cache Manager

Manages KV-cache retention — keeping identity anchors while clearing stale context.

The Cache Manager controls which generated frames remain in the model's KV-cache (attention memory). It retains character appearance frames, scene-establishing shots, and key dialogue moments as long-term anchors, while periodically clearing stale context and applying statistical anchors to correct global appearance drift.

Digital Human

AI-generated human-like characters for video communication and interaction.

Digital humans are AI-rendered characters that look, speak, and behave like real people. Platforms like HeyGen and Synthesia provide turnkey digital human video production. MaineCoon operates at a deeper layer — the real-time generative engine that can power next-generation digital human experiences with live streaming and interaction.

Compare vs HeyGen →

Director

The cognitive controller that plans narrative beats and corrects quality drift.

The Director is MaineCoon's narrative and quality-control agent. It uses a planner to generate structured prompts (visual description + dialogue + ambient audio) beat-by-beat, and an observer to monitor output for quality drift. When drift is detected, it triggers forward-fix correction on the next frame without interrupting the stream.

DPO + ROPD

Domain-aware preference optimization with reinforced online policy distillation.

MaineCoon's post-training stage uses Domain-Aware Preference Optimization (DPO) to train specialized expert models for different social scenarios (dance, dialogue, wide shots), then Reinforced Online Policy Distillation (ROPD) to unify them into a single deployable streaming policy. This balances scenario-specific quality with deployment efficiency.

FPS (Frames Per Second)

Generation throughput — how many video frames the model produces per second.

FPS measures how many complete video frames the model generates per second during inference. Real-time playback typically requires 24–30 FPS. MaineCoon achieves 47.5 FPS on a single H100 and 30+ FPS on RTX Pro 6000 — exceeding real-time requirements and building a generation-ahead buffer for smooth streaming.

Low Latency capability →

Interactive Control (Mid-Stream)

Changing generation behavior via new prompts while output is actively streaming.

Interactive control allows users to inject new instructions — tone changes, dialogue, questions — while the model is actively generating. The model adapts the ongoing stream without resetting, enabling conversational dynamics impossible with batch generation models.

Interactive Control capability →

KV-Cache

Key-value attention cache storing computed attention states from prior generation steps.

In autoregressive generation, each new chunk requires attending to all prior chunks. KV-cache stores the computed key-value pairs from previous steps to avoid recomputation. MaineCoon's Cache Manager strategically retains or clears cache entries to balance generation speed, memory usage, and long-horizon consistency.

Lip Sync

Alignment between generated speech audio and visible mouth movements.

Lip sync quality is a primary indicator of AI video realism in social contexts. Because MaineCoon generates audio and video jointly in each autoregressive chunk, lip movements naturally align with speech — unlike pipelines that generate video first and add audio separately.

Audio-Visual Sync capability →

Representation Alignment

Cross-modal distillation from a frozen V-JEPA 2 encoder to accelerate AV training.

Joint audio-visual training is slow to converge. MaineCoon introduces a frozen pre-trained V-JEPA 2 visual encoder as a distillation target, helping the model learn cross-modal semantic structure faster. This acts as both a training accelerator and stabilizer for audio-visual alignment.

Self-Resampling

Training technique that exposes the model to its own imperfect outputs as context.

During training, models typically use clean ground-truth frames as history context. But at inference, the model only has its own generated frames — creating a train-inference gap. Self-resampling feeds the model degraded versions of its own outputs during training, teaching it to stay stable even when history contains noise and drift.

SocialVideo Bench

Catnip's benchmark for social-interaction video across 7 scenarios and 9 metrics.

The first benchmark designed specifically for social-interaction video generation. It covers 7 scenarios (dense speech, two-person interaction, musical performance, emotional acting, dance, creative challenges, social memes) and 9 metrics (visual quality, motion, audio, alignment, consistency, etc.). MaineCoon scores 0.934 overall — surpassing all 7 compared baselines.

Benchmarks →

Streaming Generation

Incremental output produced and consumed simultaneously, rather than waiting for complete generation.

Streaming generation means the model produces output incrementally — chunk by chunk — while the user consumes it in real time. In text, this is how ChatGPT outputs tokens one at a time. In video, each chunk contains synchronized audio and visual frames. MaineCoon compresses the generation unit to sub-second chunks, enabling first-frame latency under 3 seconds.

Real-Time Streaming capability →

Sub-Second Latency

First output frame appears within one second of receiving a prompt.

Sub-second latency refers to the time between submitting a prompt and seeing the first generated frame. MaineCoon achieves first frame in under 3 seconds, with subsequent chunks arriving at sub-second intervals. This is critical for social interaction where delays break the illusion of live conversation.

Low Latency capability →

World Model

AI systems that simulate environments — physical, game, or social.

World models are generative systems that simulate environments internally. Physical world models (e.g., for robotics) simulate object physics. Game world models simulate exploration. Social world models — a new category — simulate human-centric social dynamics with real-time multimodal output. MaineCoon represents the rendering breakthrough for social world models.

Social World Model →

Experience MaineCoon live

Input a prompt and watch real-time streaming audio-visual generation on the official platform.

Try Experience Platform →Read Technical Report