feat: 功能优化

2026-01-27 01:39:08 +08:00
parent bf12e70339
commit 24f66c8e81
24 changed files with 1570 additions and 133 deletions
--- a/openspec/changes/refactor-voice-provider/specs/voice-clone/spec.md
+++ b/openspec/changes/refactor-voice-provider/specs/voice-clone/spec.md
@@ -0,0 +1,132 @@
+# Voice Clone Capability Specification
+
+## ADDED Requirements
+
+### Requirement: Provider Abstraction Layer
+The system SHALL provide a unified provider abstraction layer for voice cloning services, supporting multiple vendors through a common interface.
+
+#### Scenario: Get provider by type
+- **GIVEN** the system is configured with multiple voice clone providers
+- **WHEN** requesting a provider by type
+- **THEN** the system SHALL return the corresponding provider instance
+- **AND** the provider SHALL implement the `VoiceCloneProvider` interface
+
+#### Scenario: Provider not found
+- **GIVEN** the system is configured with a default provider
+- **WHEN** requesting a non-existent provider type
+- **THEN** the system SHALL fallback to the default provider
+- **AND** log a warning message
+
+### Requirement: Voice Cloning
+The system SHALL support voice cloning through the provider interface, accepting an audio file URL and returning a unique voice ID.
+
+#### Scenario: Successful voice cloning with CosyVoice
+- **GIVEN** a valid CosyVoice provider is configured
+- **WHEN** submitting a voice clone request with audio URL
+- **THEN** the system SHALL return a voice ID
+- **AND** the voice ID SHALL be usable for subsequent TTS synthesis
+
+#### Scenario: Voice cloning failure
+- **GIVEN** the provider API is unavailable or returns an error
+- **WHEN** submitting a voice clone request
+- **THEN** the system SHALL throw a `VOICE_TTS_FAILED` exception
+- **AND** log the error details for debugging
+
+### Requirement: Text-to-Speech Synthesis
+The system SHALL support TTS synthesis through cloned voices or system voices, accepting text input and returning audio data.
+
+#### Scenario: TTS with cloned voice
+- **GIVEN** a valid voice ID from a previous clone operation
+- **WHEN** submitting a TTS request with text and voice ID
+- **THEN** the system SHALL return audio data in the specified format
+- **AND** the audio SHALL match the cloned voice characteristics
+
+#### Scenario: TTS with system voice
+- **GIVEN** a system voice ID is configured
+- **WHEN** submitting a TTS request with text and system voice ID
+- **THEN** the system SHALL return audio data using the system voice
+- **AND** the audio SHALL match the system voice characteristics
+
+#### Scenario: TTS with reference audio (file URL)
+- **GIVEN** a reference audio URL and transcription text
+- **WHEN** submitting a TTS request with file URL
+- **THEN** the system SHALL perform on-the-fly voice cloning
+- **AND** return audio data matching the reference voice
+
+### Requirement: Configuration Management
+The system SHALL support multi-provider configuration through a unified configuration structure.
+
+#### Scenario: Configure multiple providers
+- **GIVEN** the application configuration file
+- **WHEN** configuring multiple voice providers
+- **THEN** each provider SHALL have independent `enabled` flag
+- **AND** the system SHALL only use enabled providers
+
+#### Scenario: Default provider selection
+- **GIVEN** the configuration specifies a `default-provider`
+- **WHEN** no provider is explicitly specified
+- **THEN** the system SHALL use the default provider
+- **AND** fallback to `cosyvoice` if default is not configured
+
+#### Scenario: Backward compatibility
+- **GIVEN** existing configuration using `yudao.cosyvoice.*`
+- **WHEN** the system starts
+- **THEN** the system SHALL automatically migrate to new config structure
+- **AND** existing functionality SHALL remain unchanged
+
+### Requirement: Provider Factory
+The system SHALL provide a factory component for managing provider instances and resolving providers by type.
+
+#### Scenario: Factory resolves provider
+- **GIVEN** the factory is initialized with provider configurations
+- **WHEN** calling `factory.getProvider("cosyvoice")`
+- **THEN** the factory SHALL return the CosyVoiceProvider instance
+- **AND** cache the instance for subsequent requests
+
+#### Scenario: Factory returns default
+- **GIVEN** the factory is configured with default provider
+- **WHEN** calling `factory.getProvider(null)`
+- **THEN** the factory SHALL return the default provider instance
+
+## MODIFIED Requirements
+
+### Requirement: Voice Creation Flow
+The voice creation process SHALL use the provider abstraction layer instead of directly calling CosyVoice client.
+
+#### Scenario: Create voice with CosyVoice
+- **GIVEN** a user uploads a voice audio file
+- **WHEN** creating a voice configuration through the API
+- **THEN** the system SHALL:
+  1. Validate the file exists and belongs to voice category
+  2. Call `provider.cloneVoice()` with the audio URL
+  3. Store the returned `voiceId` in the database
+  4. Return success response with voice configuration ID
+
+#### Scenario: Create voice with transcription
+- **GIVEN** a voice configuration is created without transcription
+- **WHEN** the user triggers transcription
+- **THEN** the system SHALL:
+  1. Fetch the audio file URL
+  2. Call the transcription service
+  3. Store the transcription text
+  4. Update the voice configuration
+
+### Requirement: Voice Preview
+The voice preview functionality SHALL work with both cloned voices (voiceId) and reference audio (file URL).
+
+#### Scenario: Preview cloned voice
+- **GIVEN** a voice configuration with a valid `voiceId`
+- **WHEN** requesting a preview with custom text
+- **THEN** the system SHALL call `provider.synthesize()` with the voiceId
+- **AND** return audio data in Base64 format
+
+#### Scenario: Preview with reference audio
+- **GIVEN** a voice configuration without `voiceId` but with audio file
+- **WHEN** requesting a preview
+- **THEN** the system SHALL call `provider.synthesize()` with the file URL
+- **AND** use the stored transcription as reference text
+- **AND** return audio data in Base64 format
+
+## REMOVED Requirements
+
+None. This change is additive and refactoring only.