Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get embeddings of audio data streaming from microphone. #56

Open
gaushh opened this issue Apr 17, 2021 · 4 comments
Open

How to get embeddings of audio data streaming from microphone. #56

gaushh opened this issue Apr 17, 2021 · 4 comments

Comments

@gaushh
Copy link

gaushh commented Apr 17, 2021

I am using resemblyzer to create embeddings for speaker diarization.
It works fine when a whole wave file is loaded into the resemblyzer.
Now I want to try out real-time speaker diarization using data streaming from microphone using pyaudio (in form of chunks).
A chunk is essentially a frame of fixed size (100 ms in my case).
How do I get separate embedding for each chunk using resemblyzer?

@gaushh gaushh changed the title How to get embeddings of audio data streaming from pyaudio. How to get embeddings of audio data streaming from microphone. Apr 17, 2021
@CorentinJ
Copy link
Contributor

The difficult part of the implementation is to get a reliable system for receiving these chunks and for triggering a function call when enough chunks are gathered to compute an embedding. If you have that already, that's great.

Take a look at embed_utterance(). Partial embeddings are created by forwarding chunks of the mel spectrogram of the audio. These chunks are extracted from the audio at specific locations predetermined by compute_partial_slices. You can copy the code in embed_utterance() and call compute_partial_slices with a very large number to know where to split chunks in your streaming audio. Forward a chunk to get a single partial frame.

@gaushh
Copy link
Author

gaushh commented Apr 19, 2021

The difficult part of the implementation is to get a reliable system for receiving these chunks and for triggering a function call when enough chunks are gathered to compute an embedding.

To do that I'm using code provided by Google for streaming speech recognition on an audio stream

I am getting embeddings but I believe that I'm doing something wrong since the clustering algo is producing a single class (cluster) while trying to perform speaker diarization on the extracted embedding

Here's what my code looks like :

        `import numpy as np
        import pyaudio
        from six.moves import queue
        
        from resemblyzer import preprocess_wav, VoiceEncoder
        from pathlib import Path
        
        from links_clustering.links_cluster import LinksCluster
        
        # Audio recording parameters
        RATE = 16000
        CHUNK = int(RATE)  # 100ms
        
        encoder = VoiceEncoder("cpu")
        links_cluster = LinksCluster(0.5, 0.5, 0.5)
        
        class MicrophoneStream(object):
            """Opens a recording stream as a generator yielding the audio chunks."""
        
            def __init__(self, rate, chunk):
                self._rate = rate
                self._chunk = chunk
        
                # Create a thread-safe buffer of audio data
                self._buff = queue.Queue()
                self.closed = True
        
            def __enter__(self):
                self._audio_interface = pyaudio.PyAudio()
                self._audio_stream = self._audio_interface.open(
                    format=pyaudio.paInt16,
                    # The API currently only supports 1-channel (mono) audio
                    # https://goo.gl/z757pE
                    channels=2,
                    rate=self._rate,
                    input=True,
                    frames_per_buffer=self._chunk,
                    # Run the audio stream asynchronously to fill the buffer object.
                    # This is necessary so that the input device's buffer doesn't
                    # overflow while the calling thread makes network requests, etc.
                    stream_callback=self._fill_buffer,
                )
        
                self.closed = False
        
                return self
        
            def __exit__(self, type, value, traceback):
                self._audio_stream.stop_stream()
                self._audio_stream.close()
                self.closed = True
                # Signal the generator to terminate so that the client's
                # streaming_recognize method will not block the process termination.
                self._buff.put(None)
                self._audio_interface.terminate()
        
            def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
                """Continuously collect data from the audio stream, into the buffer."""
                #print("len(in_data)",len(in_data))
                self._buff.put(in_data)
                return None, pyaudio.paContinue
        
            def generator(self):
                while not self.closed:
                    # Use a blocking get() to ensure there's at least one chunk of
                    # data, and stop iteration if the chunk is None, indicating the
                    # end of the audio stream.
                    chunk = self._buff.get()
                    if chunk is None:
                        return
                    data = [chunk]
                    # Now consume whatever other data's still buffered.
                    while True:
                        try:
                            chunk = self._buff.get(block=False)
                            if chunk is None:
                                return
                            data.append(chunk)
                        except queue.Empty:
                            break
                    yield b"".join(data)
        
        
        def main():
        
        
            with MicrophoneStream(RATE, CHUNK) as stream:
                audio_generator = stream.generator()
                for content in audio_generator:
                    numpy_array = np.frombuffer(content, dtype=np.float32)
                    wav = preprocess_wav(numpy_array)
                    _, cont_embeds, wav_splits = encoder.embed_utterance(wav, return_partials=True, rate=16)
                    predicted_cluster = links_cluster.predict(cont_embeds[0])
                    print("predicted_cluster :", predicted_cluster)
                    print("------------")
        
        def write_frame(file_name, data):
            wf = wave.open(file_name, 'wb')
            wf.setnchannels(CHANNELS)
            wf.setsampwidth(p.get_sample_size(FORMAT))
            wf.setframerate(RATE)
            wf.writeframes(data)
            wf.close()
            return
        
        if __name__ == "__main__":
            main()`

@milind-soni
Copy link

How to avoid losing information when you split a file into chunks.

@MichaelScofield123
Copy link

I am using resemblyzer to create embeddings for speaker diarization. It works fine when a whole wave file is loaded into the resemblyzer. Now I want to try out real-time speaker diarization using data streaming from microphone using pyaudio (in form of chunks). A chunk is essentially a frame of fixed size (100 ms in my case). How do I get separate embedding for each chunk using resemblyzer?

I am also trying to implement this function, have you implemented it, or have any good suggestions?My email is [email protected],hope your reply.Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants