Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CosyVoice什么时候支持呢 #29

Open
try2020-code opened this issue Jul 11, 2024 · 4 comments
Open

CosyVoice什么时候支持呢 #29

try2020-code opened this issue Jul 11, 2024 · 4 comments

Comments

@try2020-code
Copy link

貌似是最近新加的语音合成模型。非常好用。希望添加

@ikesnowy
Copy link
Contributor

文档里似乎没有 HTTP 调用的 API,这个得研究一下 Java SDK 的源码了

@ikesnowy
Copy link
Contributor

好吧,原来是 WebSocket 的 API,它甚至不是 HTTP 的

@oldmanpushcart
Copy link

oldmanpushcart commented Aug 20, 2024

@ikesnowy

我刚好用Java实现了一次,给自己的娱乐项目使用的。这里分享下dashscope关于webscoket交互大概的协议格式。希望我的分享能对你研发.net版本的SDK有一定的帮助。

总述

dashscope的wss通信地址是:wss://dashscope.aliyuncs.com/api-ws/v1/inference/

当webscoket通道被打开之后,云端会启动一个task配合本次webscoket通道。task-id由client自己生成,只要确保一定时间范围内唯一,推荐用UUID

以cosyvoice-v1的全双工模式为例

客户端发送

webscoket通道打开后,客户端依次向发起三类请求:启动任务、数据传输(多次)、结束任务

通道开始

  • 方向:client -> server
  • 指令:启动任务
{
  "header": {
    "action": "run-task",
    "task_id": "439e0616-2f5b-44e0-8872-0002a066a49c",
    "streaming": "duplex"
  },
  "payload": {
    "model": "cosyvoice-v1",
    "task_group": "audio",
    "task": "tts",
    "function": "SpeechSynthesizer",
    "input": {},
    "parameters": {
      "voice": "longxiaochun",
      "volume": 50,
      "text_type": "PlainText",
      "sample_rate": 0,
      "rate": 1.0,
      "phoneme_timestamp_enabled": false,
      "format": "Default",
      "pitch": 1.0,
      "word_timestamp_enabled": false
    }
  }
}

数据传输(多次)

  • 方向:client -> server
  • 指令:传输数据
  • 说明:如果有多段、句话要合成,则传输多次
{
  "header": {
    "action": "continue-task",
    "task_id": "439e0616-2f5b-44e0-8872-0002a066a49c",
    "streaming": "duplex"
  },
  "payload": {
    "model": "cosyvoice-v1",
    "task_group": "audio",
    "task": "tts",
    "function": "SpeechSynthesizer",
    "input": {
      "text": "今天天气怎么样?"
    },
    "parameters": {
      "voice": "longxiaochun",
      "volume": 50,
      "text_type": "PlainText",
      "sample_rate": 0,
      "rate": 1.0,
      "phoneme_timestamp_enabled": false,
      "format": "Default",
      "pitch": 1.0,
      "word_timestamp_enabled": false
    }
  }
}

任务结束

  • 方向:client -> server
  • 指令:结束任务
{
  "header": {
    "action": "finish-task",
    "task_id": "439e0616-2f5b-44e0-8872-0002a066a49c",
    "streaming": "duplex"
  },
  "payload": {
    "input": {}
  }
}

服务端响应

当客户端在启动任务后,服务端就开始响应本次语音合成请求,也分三类:任务开始、数据生成(多次)、任务结束/失败

任务开始

{
  "header": {
    "task_id": "439e0616-2f5b-44e0-8872-0002a066a49c",
    "event": "task-started",
    "attributes": {}
  },
  "payload": {}
}

数据生成(多次)

接收数据分两类:数据生成应答和音频数据块,这两个没有严格的先后顺序。

第一类:数据生成应答(json)

在cosyvoice-v1模型中没有实际意义,但在其他模型中能返回整句、单字的时间轴和音调、音素等信息,方便做字幕等功能。

{
  "header": {
    "task_id": "439e0616-2f5b-44e0-8872-0002a066a49c",
    "event": "result-generated",
    "attributes": {}
  },
  "payload": {
    "output": {
      "sentence": {
        "words": []
      }
    },
    "usage": null
  }
}
第二类:音频数据块(ByteBuffer)

这个为音频数据块,可以存储下来,也可以直接输入给音频输出流播放音频

任务结束/失败

任务结束的状态分两种,成功和失败

任务结束
{
  "header": {
    "task_id": "439e0616-2f5b-44e0-8872-0002a066a49c",
    "event": "task-finished",
    "attributes": {}
  },
  "payload": {
    "output": null,
    "usage": {
      "characters": 15
    }
  }
}
任务失败
{
  "header": {
    "task_id": "439e0616-2f5b-44e0-8872-0002a066a49c",
    "event": "task-failed",
    "error_code": "InvalidParameter",
    "error_message": "request timeout after 23 seconds.",
    "attributes": {}
  },
  "payload": {}
}

@ikesnowy
Copy link
Contributor

感谢,非常有帮助,周末我抽时间再研究一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants