Files
sionrui/openspec/changes/refactor-voice-provider/specs/voice-clone/spec.md
2026-01-27 01:39:08 +08:00

5.6 KiB

Voice Clone Capability Specification

ADDED Requirements

Requirement: Provider Abstraction Layer

The system SHALL provide a unified provider abstraction layer for voice cloning services, supporting multiple vendors through a common interface.

Scenario: Get provider by type

  • GIVEN the system is configured with multiple voice clone providers
  • WHEN requesting a provider by type
  • THEN the system SHALL return the corresponding provider instance
  • AND the provider SHALL implement the VoiceCloneProvider interface

Scenario: Provider not found

  • GIVEN the system is configured with a default provider
  • WHEN requesting a non-existent provider type
  • THEN the system SHALL fallback to the default provider
  • AND log a warning message

Requirement: Voice Cloning

The system SHALL support voice cloning through the provider interface, accepting an audio file URL and returning a unique voice ID.

Scenario: Successful voice cloning with CosyVoice

  • GIVEN a valid CosyVoice provider is configured
  • WHEN submitting a voice clone request with audio URL
  • THEN the system SHALL return a voice ID
  • AND the voice ID SHALL be usable for subsequent TTS synthesis

Scenario: Voice cloning failure

  • GIVEN the provider API is unavailable or returns an error
  • WHEN submitting a voice clone request
  • THEN the system SHALL throw a VOICE_TTS_FAILED exception
  • AND log the error details for debugging

Requirement: Text-to-Speech Synthesis

The system SHALL support TTS synthesis through cloned voices or system voices, accepting text input and returning audio data.

Scenario: TTS with cloned voice

  • GIVEN a valid voice ID from a previous clone operation
  • WHEN submitting a TTS request with text and voice ID
  • THEN the system SHALL return audio data in the specified format
  • AND the audio SHALL match the cloned voice characteristics

Scenario: TTS with system voice

  • GIVEN a system voice ID is configured
  • WHEN submitting a TTS request with text and system voice ID
  • THEN the system SHALL return audio data using the system voice
  • AND the audio SHALL match the system voice characteristics

Scenario: TTS with reference audio (file URL)

  • GIVEN a reference audio URL and transcription text
  • WHEN submitting a TTS request with file URL
  • THEN the system SHALL perform on-the-fly voice cloning
  • AND return audio data matching the reference voice

Requirement: Configuration Management

The system SHALL support multi-provider configuration through a unified configuration structure.

Scenario: Configure multiple providers

  • GIVEN the application configuration file
  • WHEN configuring multiple voice providers
  • THEN each provider SHALL have independent enabled flag
  • AND the system SHALL only use enabled providers

Scenario: Default provider selection

  • GIVEN the configuration specifies a default-provider
  • WHEN no provider is explicitly specified
  • THEN the system SHALL use the default provider
  • AND fallback to cosyvoice if default is not configured

Scenario: Backward compatibility

  • GIVEN existing configuration using yudao.cosyvoice.*
  • WHEN the system starts
  • THEN the system SHALL automatically migrate to new config structure
  • AND existing functionality SHALL remain unchanged

Requirement: Provider Factory

The system SHALL provide a factory component for managing provider instances and resolving providers by type.

Scenario: Factory resolves provider

  • GIVEN the factory is initialized with provider configurations
  • WHEN calling factory.getProvider("cosyvoice")
  • THEN the factory SHALL return the CosyVoiceProvider instance
  • AND cache the instance for subsequent requests

Scenario: Factory returns default

  • GIVEN the factory is configured with default provider
  • WHEN calling factory.getProvider(null)
  • THEN the factory SHALL return the default provider instance

MODIFIED Requirements

Requirement: Voice Creation Flow

The voice creation process SHALL use the provider abstraction layer instead of directly calling CosyVoice client.

Scenario: Create voice with CosyVoice

  • GIVEN a user uploads a voice audio file
  • WHEN creating a voice configuration through the API
  • THEN the system SHALL:
    1. Validate the file exists and belongs to voice category
    2. Call provider.cloneVoice() with the audio URL
    3. Store the returned voiceId in the database
    4. Return success response with voice configuration ID

Scenario: Create voice with transcription

  • GIVEN a voice configuration is created without transcription
  • WHEN the user triggers transcription
  • THEN the system SHALL:
    1. Fetch the audio file URL
    2. Call the transcription service
    3. Store the transcription text
    4. Update the voice configuration

Requirement: Voice Preview

The voice preview functionality SHALL work with both cloned voices (voiceId) and reference audio (file URL).

Scenario: Preview cloned voice

  • GIVEN a voice configuration with a valid voiceId
  • WHEN requesting a preview with custom text
  • THEN the system SHALL call provider.synthesize() with the voiceId
  • AND return audio data in Base64 format

Scenario: Preview with reference audio

  • GIVEN a voice configuration without voiceId but with audio file
  • WHEN requesting a preview
  • THEN the system SHALL call provider.synthesize() with the file URL
  • AND use the stored transcription as reference text
  • AND return audio data in Base64 format

REMOVED Requirements

None. This change is additive and refactoring only.