Audio as Optional Prop: Adding Human Commentary to AI Content

By Jay Griffin, Claude Sonnet 4.5*🔧 AI-Assisted - Jay's design evolution, Claude documented the spec· January 29, 2026*Updated:
February 2, 2026

docs

🏷️ Tags:feature-specaudioai-contentprimitivesdesign-systemstransparency

Feature spec for audio as optional prop on content primitives - attach human commentary to any component

The Problem

I'm generating a lot of AI-assisted content now—articles, documentation, feature specs—all written in my IDE with AI coding assistants. (See Why I Write Everything in My IDE Now for the full workflow.)

The velocity is amazing, but it creates a new problem: readers encounter AI-synthesized content and may want to know why I thought it was worth creating. Audio commentary can help with this—adding human context, reasoning, and transparency without slowing down the creation loop.

Evolution: From VoiceNote Component to Audio as Optional Prop

Initially, I imagined a standalone VoiceNote component—a special callout box with audio player and transcript. Like a blockquote but with voice.

But then I realized: audio isn't a content type, it's metadata. I don't want to insert special "voice note boxes" that interrupt the flow. I want to attach audio to the content that already exists.

<CodeBlock audio="..."> - "Here's what this code does"
<Heading audio="..."> - "Here's what this section is about"

This is better because:

Semantic integrity - Headings stay headings, code stays code
Natural breakpoints - Audio at headings acts like chapter markers
No flow interruption - Audio enhances existing content, doesn't fragment it
Composable - Any component can have audio, not just special "voice note" components

So this spec evolved from "VoiceNote component" to "audio as optional prop."

What Audio Commentary Provides

Transparency and context:

Process visibility - "Here's why I curated this"
Human reasoning - "Why this matters to me"
Decision context - "What made this worth shipping"

It's not about explaining the text. It's about explaining why the text exists and why it matters.

Why This Is Different

Current content formats force you to choose:

Blogs - Static text only, one voice throughout
Video essays - Full commitment to video format, can't skim
Podcasts - Audio-only, completely separate from written content

Audio is hybrid: AI-generated comprehensive text + inline human reasoning attached to specific content. Best of both:

Readers get comprehensive, well-synthesized content
AI does the synthesis, human shares the reasoning and curation
Perfect transparency workflow - clearly marked AI synthesis vs human reasoning

The Workflow Evolution

Phase 1: Basic Implementation

Add audio props to CodeBlock and Heading components
Create basic AudioPlayer component
Record on phone, transcribe with Whisper, add audio prop manually
Works but has friction

Phase 2: Friction Elimination

Always-hot mic setup at desk
System-wide hotkey (e.g., Cmd+Shift+V) triggers recording
Auto-transcribes, auto-titles, drops in /content/audio/inbox/
VSCode command to insert audio prop from inbox into current component
Reduces "I should record this" → actually recording from 10% to 90%

Phase 3: Multi-Modal Composition

Audio + code (explain why code is ugly)
Audio + images (context for screenshots)
Audio + debug visualizations (thought vs reality)
Multiple audio clips per page for different sections
Threading for longer explanations across components

Phase 4: Content Modes

Read Mode - Text only, no audio
Commentary Mode - All audio players visible
Audio Tour Mode - Auto-plays audio as you scroll
Lets different users consume content their way

The "Director's Commentary" Track

Every article could have a toggle at the top to switch between consumption modes:

[📄 Read Mode] - Just the text, no audio
[🎙️ Commentary Mode] - Shows all audio players
[▶️ Audio Tour Mode] - Auto-plays audio as you scroll past components

This lets different users consume the content differently:

Speed readers: Text only
Deep learners: Commentary mode
Multitaskers: Audio tour (listen while doing dishes)

Component Specification

Audio on CodeBlock

.tsx

<CodeBlock 
  language="typescript" 
  audio="/audio/why-this-code.mp3"
  audioDuration="0:45"
>
  {codeString}
</CodeBlock>

Audio player appears inline with the code block. When played, provides context about implementation decisions, trade-offs, or "why this code is ugly but necessary."

Audio on Heading

.tsx

<Heading 
  level={2} 
  audio="/audio/section-intro.mp3"
  audioDuration="1:20"
>
  The Core Architecture
</Heading>

Audio player appears next to or below heading. Acts as a chapter marker—"here's what this section is about and why it matters."

Shared Audio Props

.ts

interface AudioProps {
  audio?: string;           // Path to audio file (mp3/wav/etc)
  audioDuration?: string;   // Display duration (e.g., "0:45", "2:30")
  audioWaveform?: string;   // Optional: Path to waveform image/data
  audioTranscript?: string; // Optional: Full transcript text
}

These props get added to existing primitives (CodeBlock, Heading). When present, the component renders an inline audio player.

Note on audioDuration: The browser can read duration from the audio file itself once it loads.audioDuration is just an optimization for displaying the duration immediately on initial render (SSR) and preventing layout shift. It's entirely optional—you can omit it and let the player figure it out after the audio loads.

Visual Design

Audio player should be minimal and integrated with existing component styling:

Icon indicator - 🎙️ or speaker icon to signal audio is available
Inline player - Appears within or adjacent to component, not as separate callout
Minimal controls - Play/pause, progress bar, speed control, duration
Theme integration - Uses existing theme colors and styling
Transcript toggle - Optional: show/hide full transcript text

Audio Player Features

Play/Pause button - Primary control
Progress bar - Show position in audio, allow seeking
Playback speed - 0.5x, 0.75x, 1x, 1.25x, 1.5x, 2x
Duration display - Show total time and current position
Optional waveform visualization - If available, show audio waveform
Keyboard controls - Space to play/pause, arrow keys to seek

Usage Examples

Audio + Code

.tsx

<CodeBlock 
  language="typescript" 
  audio="/audio/why-i-hate-this-code.mp3"
>
  // The ugly code in question
  function messyButNecessary() {
    // Yeah so this function is ugly as heck, 
    // but here's why I had to do it this way...
  }
</CodeBlock>

Audio + Section Heading

.tsx

<Heading 
  level={2} 
  audio="/audio/the-key-insight.mp3"
  audioDuration="1:15"
>
  The Key Innovation
</Heading>

<Paragraph>
  [AI-generated explanation of the innovation...]
</Paragraph>

Audio provides: "This is where it clicked for me. I was stuck thinking about "X" but it's actually "Y". That reframe changed everything."

Implementation Details

Three pieces to make this work:

1. Shared AudioPlayer Component

A reusable <AudioPlayer> component that handles all playback logic:

Play/pause state management
Progress bar with seeking
Playback speed controls (0.5x, 1x, 1.5x, 2x)
Duration display and current time
Keyboard shortcuts (space for play/pause, arrows for seek)
Optional transcript toggle

.ts

interface AudioPlayerProps {
  src: string;              // Path to audio file
  duration?: string;        // Display duration
  waveform?: string;        // Optional waveform visualization
  transcript?: string;      // Optional transcript text
}

2. Optional Audio Props on Primitives

Each primitive (CodeBlock, Heading, etc.) gets extended with optional audio props:

.ts

interface AudioProps {
  audio?: string;           // Path to audio file
  audioDuration?: string;   // Display duration
  audioWaveform?: string;   // Optional waveform
  audioTranscript?: string; // Optional transcript
}

// CodeBlock extends its existing props
interface CodeBlockProps extends AudioProps {
  language: string;
  children: string;
  // ...existing props
}

// Heading extends its existing props
interface HeadingProps extends AudioProps {
  level: 1 | 2 | 3 | 4 | 5 | 6;
  children: React.ReactNode;
  // ...existing props
}

3. Integration Per Component

Each primitive decides where to render the AudioPlayer:

.tsx

// CodeBlock renders audio player at bottom
export function CodeBlock({
  audio,
  audioDuration,
  audioTranscript,
  language,
  children,
}: CodeBlockProps) {
  return (
    <div>
      <pre><code>{children}</code></pre>
      {audio && (
        <AudioPlayer 
          src={audio} 
          duration={audioDuration}
          transcript={audioTranscript}
        />
      )}
    </div>
  );
}

// Heading renders audio player inline after text
export function Heading({
  audio,
  audioDuration,
  level,
  children,
}: HeadingProps) {
  const Tag = `h${level}` as keyof JSX.IntrinsicElements;
  return (
    <div>
      <Tag>{children}</Tag>
      {audio && (
        <AudioPlayer 
          src={audio} 
          duration={audioDuration}
        />
      )}
    </div>
  );
}

The positioning is the only custom part—CodeBlock might put it at the bottom, Heading might put it inline. The playback logic is entirely handled by the shared AudioPlayer.

Phase 1: Add Audio Props to Primitives

Create shared AudioPlayer component in src/components/AudioPlayer/
Add optional audio props to CodeBlock component
Add optional audio props to Heading primitive
HTML5 audio element with custom controls
Theme-integrated styling (use theme colors, spacing, radii)
Responsive design for mobile

Phase 2: Always-On Recording Workflow

The killer feature that makes audio commentary actually usable at scale:

Global hotkey - Cmd+Shift+V triggers recording from anywhere
Auto-transcription - Whisper runs in background, generates transcript
Smart titling - AI generates preliminary title from first 10 words
Inbox staging - Saves to /content/voice-notes/inbox/
File management - Moves audio to /public/audio/ with proper naming
AI categorization - Optional suggestion of which article it relates to

Friction elimination: Current workflow is Think → Open app → Record → Save → Transcribe → File → Insert. Target workflow is Think → Hit hotkey → Talk → Done. The difference between "I should record this" and actually recording it is literally one keypress.

Phase 3: Enhanced Features

Waveform generation and visualization
Auto-sync text highlighting as audio plays
Timestamps for jumping to specific sections
Download transcript option
Share audio clip functionality

Use Cases

1. AI-Generated Document Commentary

Scenario: You generate a comprehensive 5,000-word document with AI about your Timeline component architecture. You add audio at key sections to guide readers through your thinking.

.tsx

<Heading level={2} audio="/audio/aha-moment.mp3">
  The Key Innovation
</Heading>

<Paragraph>
  [AI-generated explanation of the Timeline component...]
</Paragraph>

<Paragraph>
  Audio provides: "This is where it clicked for me. I was stuck thinking 
  about this as a layout problem, but it's actually a data structure problem. 
  That reframe changed everything."
</Paragraph>

2. Tutorial Walkthroughs

Scenario: Technical tutorial with code examples. Audio on CodeBlocks explains "why" decisions were made, not just "what" the code does.

3. Roadmap Context

Scenario: Feature roadmap document. Audio on headings adds personal context about priorities, trade-offs, and decision-making process.

4. Content Curation

Scenario: AI synthesizes research from multiple sources. Audio commentary adds "this source is particularly valuable because..." or "notice how these three ideas connect..."

Technical Considerations

Audio Format & Compression

Format: MP3 (best browser compatibility) or WebM (smaller file sizes)
Bitrate: 64kbps for voice is sufficient (significantly smaller than music)
Mono vs Stereo: Mono for voice (half the file size)
Target size: 30-60 seconds = ~250-500KB, 2-3 minutes = ~1-1.5MB

Accessibility

Transcript text is always visible (audio enhancement, not replacement)
Full keyboard navigation support
ARIA labels for screen readers
Visual indicators when audio is playing
Prefer reduced motion: disable animations

Performance

Lazy load audio files (don't preload until user interaction)
Cache audio files in browser
Show loading state when fetching audio
Progressive loading for longer audio clips

Mobile Experience

Larger touch targets for controls
Simplified UI on small screens
Handle background audio (continue playing when scrolling)
Respect system audio settings and volume

Content Workflow

Step 1: Generate Base Content

Use AI to create comprehensive document on topic. Let it be thorough - that's what it's good at.

Step 2: Identify Commentary Points

Read through and mark spots where you want to add your reasoning:

"Here's why this idea matters to me"
"This is the moment it clicked - let me explain"
"Why I decided to include this / why I curated this"
"The real-world reason this exists"
"What made this worth documenting and shipping"

Step 3: Record Audio Commentary

Open voice memos on phone or use desktop recorder
Record spontaneously (conversational, not scripted)
Keep it short: 30-90 seconds per clip
Name files descriptively: schema-insight.mp3, aha-moment.mp3

Step 4: Transcribe & Add Audio Prop

Run Whisper locally: whisper audio.mp3 --model base
Copy transcript text (optional, for accessibility)
Add audio prop to relevant component (CodeBlock or Heading)
Light editing of transcript for readability if included

Step 5: Review & Publish

Listen to each audio clip in context. Does it flow? Does it add value? Adjust placement or re-record if needed.

Success Metrics

How do we know this feature is working?

Engagement: Do people actually play the audio? Track play rates.
Completion: Do they listen all the way through? Track completion rates.
Time on page: Does voice commentary increase time spent on content?
Feedback: Direct comments about audio commentary - helpful or distracting?
Personal satisfaction: Does this make AI-generated content feel more authentic and valuable?

Open Questions

How many audio clips per page before it becomes overwhelming?
Should we show a "total audio commentary time" at the top of articles?
How do we handle audio in article excerpts/previews?
Should threading support automatic numbering ("Part 1 of 3")?
Can audio commentary be searched/indexed for discovery?
What's the right UX for audio tour mode auto-play behavior?

Ideal Future State

Composable across all content types
Zero-friction capture workflow
AI handles synthesis, human adds the "why"
Built into the development environment itself

Next Steps

Build basic AudioPlayer component (MVP - just play/pause + progress bar)
Add audio props to CodeBlock and Heading primitives
Test in one document (maybe this spec or a technical article)
Record 2-3 test audio clips and add them to components
Get feedback (does this feel natural? does it add value?)
Iterate on design and UX based on real usage
Add enhanced features (waveform, transcript toggle, etc.)
Document the workflow for future content creation
Write blog post about the pattern and open source it