How to get embeddings of audio data streaming from microphone. #56

gaushh · 2021-04-17T04:18:04Z

I am using resemblyzer to create embeddings for speaker diarization.
It works fine when a whole wave file is loaded into the resemblyzer.
Now I want to try out real-time speaker diarization using data streaming from microphone using pyaudio (in form of chunks).
A chunk is essentially a frame of fixed size (100 ms in my case).
How do I get separate embedding for each chunk using resemblyzer?

CorentinJ · 2021-04-17T06:24:11Z

The difficult part of the implementation is to get a reliable system for receiving these chunks and for triggering a function call when enough chunks are gathered to compute an embedding. If you have that already, that's great.

Take a look at embed_utterance(). Partial embeddings are created by forwarding chunks of the mel spectrogram of the audio. These chunks are extracted from the audio at specific locations predetermined by compute_partial_slices. You can copy the code in embed_utterance() and call compute_partial_slices with a very large number to know where to split chunks in your streaming audio. Forward a chunk to get a single partial frame.

gaushh · 2021-04-19T04:34:02Z

The difficult part of the implementation is to get a reliable system for receiving these chunks and for triggering a function call when enough chunks are gathered to compute an embedding.

To do that I'm using code provided by Google for streaming speech recognition on an audio stream

I am getting embeddings but I believe that I'm doing something wrong since the clustering algo is producing a single class (cluster) while trying to perform speaker diarization on the extracted embedding

Here's what my code looks like :

        `import numpy as np
        import pyaudio
        from six.moves import queue
        
        from resemblyzer import preprocess_wav, VoiceEncoder
        from pathlib import Path
        
        from links_clustering.links_cluster import LinksCluster
        
        # Audio recording parameters
        RATE = 16000
        CHUNK = int(RATE)  # 100ms
        
        encoder = VoiceEncoder("cpu")
        links_cluster = LinksCluster(0.5, 0.5, 0.5)
        
        class MicrophoneStream(object):
            """Opens a recording stream as a generator yielding the audio chunks."""
        
            def __init__(self, rate, chunk):
                self._rate = rate
                self._chunk = chunk
        
                # Create a thread-safe buffer of audio data
                self._buff = queue.Queue()
                self.closed = True
        
            def __enter__(self):
                self._audio_interface = pyaudio.PyAudio()
                self._audio_stream = self._audio_interface.open(
                    format=pyaudio.paInt16,
                    # The API currently only supports 1-channel (mono) audio
                    # https://goo.gl/z757pE
                    channels=2,
                    rate=self._rate,
                    input=True,
                    frames_per_buffer=self._chunk,
                    # Run the audio stream asynchronously to fill the buffer object.
                    # This is necessary so that the input device's buffer doesn't
                    # overflow while the calling thread makes network requests, etc.
                    stream_callback=self._fill_buffer,
                )
        
                self.closed = False
        
                return self
        
            def __exit__(self, type, value, traceback):
                self._audio_stream.stop_stream()
                self._audio_stream.close()
                self.closed = True
                # Signal the generator to terminate so that the client's
                # streaming_recognize method will not block the process termination.
                self._buff.put(None)
                self._audio_interface.terminate()
        
            def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
                """Continuously collect data from the audio stream, into the buffer."""
                #print("len(in_data)",len(in_data))
                self._buff.put(in_data)
                return None, pyaudio.paContinue
        
            def generator(self):
                while not self.closed:
                    # Use a blocking get() to ensure there's at least one chunk of
                    # data, and stop iteration if the chunk is None, indicating the
                    # end of the audio stream.
                    chunk = self._buff.get()
                    if chunk is None:
                        return
                    data = [chunk]
                    # Now consume whatever other data's still buffered.
                    while True:
                        try:
                            chunk = self._buff.get(block=False)
                            if chunk is None:
                                return
                            data.append(chunk)
                        except queue.Empty:
                            break
                    yield b"".join(data)
        
        
        def main():
        
        
            with MicrophoneStream(RATE, CHUNK) as stream:
                audio_generator = stream.generator()
                for content in audio_generator:
                    numpy_array = np.frombuffer(content, dtype=np.float32)
                    wav = preprocess_wav(numpy_array)
                    _, cont_embeds, wav_splits = encoder.embed_utterance(wav, return_partials=True, rate=16)
                    predicted_cluster = links_cluster.predict(cont_embeds[0])
                    print("predicted_cluster :", predicted_cluster)
                    print("------------")
        
        def write_frame(file_name, data):
            wf = wave.open(file_name, 'wb')
            wf.setnchannels(CHANNELS)
            wf.setsampwidth(p.get_sample_size(FORMAT))
            wf.setframerate(RATE)
            wf.writeframes(data)
            wf.close()
            return
        
        if __name__ == "__main__":
            main()`

milind-soni · 2021-08-29T08:10:41Z

How to avoid losing information when you split a file into chunks.

MichaelScofield123 · 2022-05-08T08:33:51Z

I am using resemblyzer to create embeddings for speaker diarization. It works fine when a whole wave file is loaded into the resemblyzer. Now I want to try out real-time speaker diarization using data streaming from microphone using pyaudio (in form of chunks). A chunk is essentially a frame of fixed size (100 ms in my case). How do I get separate embedding for each chunk using resemblyzer?

I am also trying to implement this function, have you implemented it, or have any good suggestions?My email is [email protected],hope your reply.Thanks.

gaushh changed the title ~~How to get embeddings of audio data streaming from pyaudio.~~ How to get embeddings of audio data streaming from microphone. Apr 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get embeddings of audio data streaming from microphone. #56

How to get embeddings of audio data streaming from microphone. #56

gaushh commented Apr 17, 2021

CorentinJ commented Apr 17, 2021

gaushh commented Apr 19, 2021 •

edited

Loading

milind-soni commented Aug 29, 2021

MichaelScofield123 commented May 8, 2022

How to get embeddings of audio data streaming from microphone. #56

How to get embeddings of audio data streaming from microphone. #56

Comments

gaushh commented Apr 17, 2021

CorentinJ commented Apr 17, 2021

gaushh commented Apr 19, 2021 • edited Loading

milind-soni commented Aug 29, 2021

MichaelScofield123 commented May 8, 2022

gaushh commented Apr 19, 2021 •

edited

Loading