Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend whitening tests #3531

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

JoeZiminski
Copy link
Collaborator

@JoeZiminski JoeZiminski commented Nov 12, 2024

Further to #3505, this PR extends the whitening tests to check the values are correct for both int16 and float32, as well as checking most of the parameters. Also, some functions are factored out in whiten.py to help with testing.

For the most part the approach is to generate some test data with known covariance, then compute whitening and check that a) the covariance matrix is recovered correctly b) the whitened data is indeed white.

Some points for discussion @yger would be great to get your thoughts:

  1. apply_mean=False by default, I don't think (?) there is any reason not to remove the mean prior to computing the covariance, which will reduce bias. Indeed the whitened data is not really white when apply_mean=False but is when apply_mean=True. Should this be default True?

  2. For regularizing, regularize_kwargs["assume_centered"] = True is always set in the code, though the default apply_mean=False as above. I added an error in this case, to force the user to switch to apply_mean=True. Also, in case the user explicitly sets regularize_kwargs["assume_centered"] = False an error is raised, as this will be overwritten to True. However, maybe a better approach is to set regularize_kwargs["assume_centered"] directly from apply_mean?

  3. At the moment sklearn_covariance provides some features that are represented as classes, and others as functions. At the moment the regularize kwargs can only accept the class ones and will crash for function ones. I just added a note to the docstring, but wanted to raise this in case it is important to support both.

Question on CustomRecording

@alejoe91 @samuelgarcia I wanted to make a quick recording object and set some custom data on it for testing purposes. The result is the small CustomRecording (top of test_whiten,py) which basically exposes a set_data function, and get_traces() just returns the data that was set directly without processing. Is this okay for you both? If so I will factor this out into a separate PR.

@JoeZiminski JoeZiminski force-pushed the extend_whitening_tests branch from 5e91f67 to 19105e2 Compare November 12, 2024 13:45
@JoeZiminski JoeZiminski force-pushed the extend_whitening_tests branch from 19105e2 to 602e6ce Compare November 12, 2024 14:16
@JoeZiminski JoeZiminski force-pushed the extend_whitening_tests branch from dda72da to 8841f3d Compare November 12, 2024 14:52
@JoeZiminski JoeZiminski force-pushed the extend_whitening_tests branch from 8841f3d to 14c83af Compare November 12, 2024 14:54
@@ -223,27 +198,88 @@ def compute_whitening_matrix(
# type and we estimate a more reasonable eps in the case
# where the data is on a scale less than 1.
if eps is None:
eps = 1e-8
if data.dtype.kind == "f":
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the random_chunk -> data is always float, I removed the int16 option, let me know if this makes sense.

Out of interest, what is the purpose of this scaling? will it have much effect if it is capped in the maximum direction by 1e-16?

@JoeZiminski JoeZiminski force-pushed the extend_whitening_tests branch from 0070301 to 1d84e5d Compare November 12, 2024 15:04
@JoeZiminski JoeZiminski requested a review from yger November 12, 2024 15:24
@alejoe91
Copy link
Member

@JoeZiminski for the CustomRecording, wouldn't a NumpyRecording work as well? What would be the added value of the CustomRecording?

@JoeZiminski
Copy link
Collaborator Author

Thanks @alejoe91, there is no added value! I just didn't know NumpyRecording existed 😄 thanks I'll use that

@h-mayorquin
Copy link
Collaborator

The NoiseGeneratorRecording already has the option of passing the covariance matrix:

class NoiseGeneratorRecording(BaseRecording):
"""
A lazy recording that generates white noise samples if and only if `get_traces` is called.
This done by tiling small noise chunk.
2 strategies to be reproducible across different start/end frame calls:
* "tile_pregenerated": pregenerate a small noise block and tile it depending the start_frame/end_frame
* "on_the_fly": generate on the fly small noise chunk and tile then. seed depend also on the noise block.
Parameters
----------
num_channels : int
The number of channels.
sampling_frequency : float
The sampling frequency of the recorder.
durations : list[float]
The durations of each segment in seconds. Note that the length of this list is the number of segments.
noise_levels : float | np.array, default: 1.0
Std of the white noise (if an array, defined by per channels)
cov_matrix : np.array | None, default: None
The covariance matrix of the noise
dtype : np.dtype | str | None, default: "float32"
The dtype of the recording. Note that only np.float32 and np.float64 are supported.
seed : int | None, default: None
The seed for np.random.default_rng.
strategy : "tile_pregenerated" | "on_the_fly", default: "tile_pregenerated"
The strategy of generating noise chunk:
* "tile_pregenerated": pregenerate a noise chunk of noise_block_size sample and repeat it
very fast and cusume only one noise block.
* "on_the_fly": generate on the fly a new noise block by combining seed + noise block index
no memory preallocation but a bit more computaion (random)
noise_block_size : int, default: 30000
Size in sample of noise block.
Notes

Would not that eliminate some of your scaffolding? Is there something that you are missing there?

Also, maybe the following would be useful:
https://docs.pytest.org/en/stable/reference/reference.html#pytest-importorskip

@JoeZiminski
Copy link
Collaborator Author

JoeZiminski commented Nov 13, 2024

Thanks @h-mayorquin! I did look at that but I think the one thing I need which is not there is the option to cast the data to int16, and to pass some custom means. The latter would be easy to extend, the former also I think actually, but would it be useful? If so I could extend the NoiseRecording to take int16 as a dtype and cast the MVN data using this strategy? (unless there is a better one).

Awesome thanks for that importdata skip will check it out! 🤤

@alejoe91 alejoe91 added the preprocessing Related to preprocessing module label Nov 13, 2024
@alejoe91
Copy link
Member

Thanks @h-mayorquin! I did look at that but I think the one thing I need which is not there is the option to cast the data to int16, and to pass some custom means. The latter would be easy to extend, the former also I think actually, but would it be useful? If so I could extend the NoiseRecording to take int16 as a dtype and cast the MVN data using this strategy? (unless there is a better one).

Awesome thanks for that importdata skip will check it out! 🤤

I think that since you need to craft specific sets of traces, you'd be better off using the NumpyRecording directly :)

@h-mayorquin
Copy link
Collaborator

Yeah, I think @alejoe91 suggestion is the simpler one. I would vote for that unless you need considerable memory heavy traces as it would simplify your scaffolding and would not require any other changes.

For the NoiseRecording You could `astype' preprocessing if you need the type of that data (the problem is that we use white noise under the hood and that' can be generated on the fly, I can give you more details on slack if you are curios). Extend NoiseRecording to accept a mean seems like a good thing though, maybe something to do but is outside of the scope of this PR.

@JoeZiminski
Copy link
Collaborator Author

Great thanks both! I went for NumpyRecording in the end as suggested, I think this is ready for review.

@samuelgarcia
Copy link
Member

Hi.
About the apply_mean I think this is important to keep it False.
Traces with spikes are biased because the distribution is not simetric (more neg values) remove the real ean is not the best very often because it is not the middle of the noise.
Also the filtering is already supposed to remove the mean.
Lets discuss this with a call

@JoeZiminski
Copy link
Collaborator Author

Hi @samuelgarcia this is a good point, I will think about this more and we should discuss on call. I will leave some thoughts here as notes.

For sure the mean is biased by the spikes when estimating correlated noise. But, the whole procedure is biased by the spikes, as the covariance is computed over data with the spikes included. To estimate covariance unbiased by the spikes we will need to remove the spikes explicitly at the start of the procedure. Otherwise, there is no way around the fact we are whitening the traces over a covariance that includes both noise and signal (spikes), and including vs. not including the mean does not get around that. The only way would be to cut the spikes first.

As an aside, I wonder whether in kilosort the first paper says the covariance is estimated from traces with spikes removed, but all later versions include the spikes in the covariance estimate. It may be that this was on purpose, so the whitening step not only removes correlated noise but also removes correlated signal i.e. forces spikes waveforms onto a subset of channels. Maybe in the end this is better for the sorter, and is an explicit aim of the whitening step.

The above is slightly philosophical, I think a more convincing point is that later preprocessing steps will re-introduce noise and so the means might no longer be close to zero, as they are after filtering. For example CMR (computed at separate timepoints) and drift correction (interpolate separate time chunks) will add noise to the data across the timeseries of a single channel. So the bias introduced here may outweigh the bias introduced by the spikes.

I think some benchmarking could be useful, mainly:

  1. how white is real-life data with apply_mean=True vs. apply_mean=False. I think simply from the mathematical definition it will be whiter with apply_mean=True. Then maybe the question is : how white then is this data but with spikes removed i.e. just looking at correlated noise? But, I think if that's what we are interested in, we should be removing the spikes at the start.

  2. What are the per-channel means after filter -> CMR -> drift correction?

  3. Even if the mean is bias, it is the least-bad estimate without explicitly removing the spikes. Not removing the mean will just result in an even worse bias such as that introduced by later preprocessing steps. Unless, we compute and remove our own estimation of central tendency. But, I guess this will interact badly with the rest of the covariance computation. We would have to completely reformulate the covariance estimation e.g. median covariance.

@JoeZiminski
Copy link
Collaborator Author

Hey @yger might you have some time to take a look at this? would be great to get your feedback

@yger
Copy link
Collaborator

yger commented Jan 14, 2025

Sorry, I'm only catching up late with this thread. Whitening is indeed a very sensitive issue, since it can improve the overall results, but at the same time easilly be illconditionned and worsen everything. This is why I introduced the possibility to use sklearn.covariance and all the estimators that are built-in to properly estimate regularized covariance matrix. The problem is that they can be fairly slow. As can bee seen here https://scikit-learn.org/1.5/auto_examples/covariance/plot_sparse_cov.html#sphx-glr-auto-examples-covariance-plot-sparse-cov-py however, most of these estimators are better than the naive one we are using, only relying on random snippets of data (with spikes). In KS (and also in spyking-circus), covariance has always been estimated on snippets with spikes. A while ago, we thought about concatenating only snippets with no peaks, assuming they were pure noise, and computing the covariance matrix on such concatenated recording, but it was rather complicated to build and not making so much difference (might depend on the recording). That beeing said, the level of benchmark back then was not the one we are expecting today, so this might be worth re-checking. Happy to discuss everything with a call whenever you want

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preprocessing Related to preprocessing module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants