133 lines
5.6 KiB
Markdown
133 lines
5.6 KiB
Markdown
# Voice Clone Capability Specification
|
|
|
|
## ADDED Requirements
|
|
|
|
### Requirement: Provider Abstraction Layer
|
|
The system SHALL provide a unified provider abstraction layer for voice cloning services, supporting multiple vendors through a common interface.
|
|
|
|
#### Scenario: Get provider by type
|
|
- **GIVEN** the system is configured with multiple voice clone providers
|
|
- **WHEN** requesting a provider by type
|
|
- **THEN** the system SHALL return the corresponding provider instance
|
|
- **AND** the provider SHALL implement the `VoiceCloneProvider` interface
|
|
|
|
#### Scenario: Provider not found
|
|
- **GIVEN** the system is configured with a default provider
|
|
- **WHEN** requesting a non-existent provider type
|
|
- **THEN** the system SHALL fallback to the default provider
|
|
- **AND** log a warning message
|
|
|
|
### Requirement: Voice Cloning
|
|
The system SHALL support voice cloning through the provider interface, accepting an audio file URL and returning a unique voice ID.
|
|
|
|
#### Scenario: Successful voice cloning with CosyVoice
|
|
- **GIVEN** a valid CosyVoice provider is configured
|
|
- **WHEN** submitting a voice clone request with audio URL
|
|
- **THEN** the system SHALL return a voice ID
|
|
- **AND** the voice ID SHALL be usable for subsequent TTS synthesis
|
|
|
|
#### Scenario: Voice cloning failure
|
|
- **GIVEN** the provider API is unavailable or returns an error
|
|
- **WHEN** submitting a voice clone request
|
|
- **THEN** the system SHALL throw a `VOICE_TTS_FAILED` exception
|
|
- **AND** log the error details for debugging
|
|
|
|
### Requirement: Text-to-Speech Synthesis
|
|
The system SHALL support TTS synthesis through cloned voices or system voices, accepting text input and returning audio data.
|
|
|
|
#### Scenario: TTS with cloned voice
|
|
- **GIVEN** a valid voice ID from a previous clone operation
|
|
- **WHEN** submitting a TTS request with text and voice ID
|
|
- **THEN** the system SHALL return audio data in the specified format
|
|
- **AND** the audio SHALL match the cloned voice characteristics
|
|
|
|
#### Scenario: TTS with system voice
|
|
- **GIVEN** a system voice ID is configured
|
|
- **WHEN** submitting a TTS request with text and system voice ID
|
|
- **THEN** the system SHALL return audio data using the system voice
|
|
- **AND** the audio SHALL match the system voice characteristics
|
|
|
|
#### Scenario: TTS with reference audio (file URL)
|
|
- **GIVEN** a reference audio URL and transcription text
|
|
- **WHEN** submitting a TTS request with file URL
|
|
- **THEN** the system SHALL perform on-the-fly voice cloning
|
|
- **AND** return audio data matching the reference voice
|
|
|
|
### Requirement: Configuration Management
|
|
The system SHALL support multi-provider configuration through a unified configuration structure.
|
|
|
|
#### Scenario: Configure multiple providers
|
|
- **GIVEN** the application configuration file
|
|
- **WHEN** configuring multiple voice providers
|
|
- **THEN** each provider SHALL have independent `enabled` flag
|
|
- **AND** the system SHALL only use enabled providers
|
|
|
|
#### Scenario: Default provider selection
|
|
- **GIVEN** the configuration specifies a `default-provider`
|
|
- **WHEN** no provider is explicitly specified
|
|
- **THEN** the system SHALL use the default provider
|
|
- **AND** fallback to `cosyvoice` if default is not configured
|
|
|
|
#### Scenario: Backward compatibility
|
|
- **GIVEN** existing configuration using `yudao.cosyvoice.*`
|
|
- **WHEN** the system starts
|
|
- **THEN** the system SHALL automatically migrate to new config structure
|
|
- **AND** existing functionality SHALL remain unchanged
|
|
|
|
### Requirement: Provider Factory
|
|
The system SHALL provide a factory component for managing provider instances and resolving providers by type.
|
|
|
|
#### Scenario: Factory resolves provider
|
|
- **GIVEN** the factory is initialized with provider configurations
|
|
- **WHEN** calling `factory.getProvider("cosyvoice")`
|
|
- **THEN** the factory SHALL return the CosyVoiceProvider instance
|
|
- **AND** cache the instance for subsequent requests
|
|
|
|
#### Scenario: Factory returns default
|
|
- **GIVEN** the factory is configured with default provider
|
|
- **WHEN** calling `factory.getProvider(null)`
|
|
- **THEN** the factory SHALL return the default provider instance
|
|
|
|
## MODIFIED Requirements
|
|
|
|
### Requirement: Voice Creation Flow
|
|
The voice creation process SHALL use the provider abstraction layer instead of directly calling CosyVoice client.
|
|
|
|
#### Scenario: Create voice with CosyVoice
|
|
- **GIVEN** a user uploads a voice audio file
|
|
- **WHEN** creating a voice configuration through the API
|
|
- **THEN** the system SHALL:
|
|
1. Validate the file exists and belongs to voice category
|
|
2. Call `provider.cloneVoice()` with the audio URL
|
|
3. Store the returned `voiceId` in the database
|
|
4. Return success response with voice configuration ID
|
|
|
|
#### Scenario: Create voice with transcription
|
|
- **GIVEN** a voice configuration is created without transcription
|
|
- **WHEN** the user triggers transcription
|
|
- **THEN** the system SHALL:
|
|
1. Fetch the audio file URL
|
|
2. Call the transcription service
|
|
3. Store the transcription text
|
|
4. Update the voice configuration
|
|
|
|
### Requirement: Voice Preview
|
|
The voice preview functionality SHALL work with both cloned voices (voiceId) and reference audio (file URL).
|
|
|
|
#### Scenario: Preview cloned voice
|
|
- **GIVEN** a voice configuration with a valid `voiceId`
|
|
- **WHEN** requesting a preview with custom text
|
|
- **THEN** the system SHALL call `provider.synthesize()` with the voiceId
|
|
- **AND** return audio data in Base64 format
|
|
|
|
#### Scenario: Preview with reference audio
|
|
- **GIVEN** a voice configuration without `voiceId` but with audio file
|
|
- **WHEN** requesting a preview
|
|
- **THEN** the system SHALL call `provider.synthesize()` with the file URL
|
|
- **AND** use the stored transcription as reference text
|
|
- **AND** return audio data in Base64 format
|
|
|
|
## REMOVED Requirements
|
|
|
|
None. This change is additive and refactoring only.
|