Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect RDF normalization #396

Closed
mphoward opened this issue Aug 22, 2019 · 20 comments · Fixed by #529
Closed

Incorrect RDF normalization #396

mphoward opened this issue Aug 22, 2019 · 20 comments · Fixed by #529
Assignees
Labels
density enhancement New feature or request
Milestone

Comments

@mphoward
Copy link
Collaborator

Describe the bug
The radial distribution function g(r) that I accumulated for a trajectory does not have the correct limiting value. The system is dilute and so g -> 1 for large r, but freud's result is consistently smaller. I recomputed g using VMD and obtained the behavior I physically expected. I can rescale the freud result onto VMD's by multiplying by a factor of ~1.05, suggesting a possible normalization issue. (This factor could be specific to my system though: 20 particles, cubic box with L = 50, 1000 frames.)

To Reproduce
Minimum reproducer script (gr.py) and a sample trajectory (cm.gsd) are attached below. The script reads the trajectory and accumulates g(r) with rmax = 20 and dr = 0.2. The freud output (freud_gr.dat) should match the VMD output (vmd_gr.dat), and both should approach 1.

gr_repro.tar.gz

Error output
Comparison of freud and VMD output
compare

System configuration

  • OS: Ubuntu 16
  • python 3.7.4
  • freud 1.2.2
    (I confirmed this on other system configurations too.)
@bdice
Copy link
Member

bdice commented Aug 22, 2019

@mphoward Thanks for reporting this issue. We'll investigate.

@mphoward
Copy link
Collaborator Author

@bdice thanks! I think I had a similar issue myself when I wrote this plugin code:

https://github.com/mphoward/azplugins/blob/master/azplugins/RDFAnalyzer.cc

It was a while ago, but I think there was some problem with how the looping over pairs was done and counting which pairs to remove from the normalization. I think I had some issue looping on 0 <= i < N and i < j < N that I couldn't figure out, so I gave up and did two full loops. That counting problem was most exaggerated for small N.

Not sure if that is relevant to freud's implementation, but wanted to let you know in case it's helpful.

@mphoward
Copy link
Collaborator Author

I suspect the difference between the two programs is here:

m_pcf_array.get()[i] = m_avg_counts.get()[i] / m_vol_array.get()[i] / ndens;

Since there are only N-1 pairs per particle (and not N) when ref_points and points are the same, ndens is computed differently from VMD here:

float ndens = float(m_n_p) / m_box.getVolume();

via here

m_n_p = n_p;

The rescaling factor I fit by eye is pretty much N/(N-1) = 1.053 for N = 20. (This would be negligible for N >> 100.)

This does make me wonder if this is just a difference in definition for small, finite systems. Technically, g(r) for an N particle ideal gas is 1-1/N if it is defined from the two-body correlation function, so the freud result is correct (the large r limit should be 0.95 not 1 for N = 20). But VMD and also LAMMPS use the normalization convention based on pairs, presumably because you are not counting one of the pairs and are interested in the thermodynamic limit of an infinitely replicated system (it shouldn't matter that you throw out exactly N pairs).

Thoughts on how this fits your philosophy?

@vyasr
Copy link
Collaborator

vyasr commented Aug 22, 2019

I think you are correct that it appears to be a definition issue. In the thermodynamic limit the two results are identical, but for the finite systems we're calculating they differ by that factor of 1/N. To be honest I'm somewhat agnostic to the choice we make, as long as it's properly documented (if we choose not to change the code I definitely think we need to make it clear in the docs that users should expect that scaling factor). I think that it's probably a bit more precise not to change it, but users could certainly be surprised by the behavior for small N.

I'd like to get @joaander's take on this before making any final changes since I believe he either wrote or oversaw the creation of this implementation. He's out of town until after labor day unfortunately, but if this is holding you back let us know and we can at least create a fix ASAP and hold off on a final decision for when he returns.

@mphoward
Copy link
Collaborator Author

Now that I've looked at the code and think I understand the discrepancy, I'm happy enough to multiply my results by N/(N-1) if necessary while y'all decide on what you thinks makes the most sense once Josh is back.

Either way, it would be great if you could add an equation to the docs showing how freud defines and normalizes the RDF, and possibly add a sentence or two of discussion on potential differences (or similarities) with other tools for doing the calculation. (Something like mdanalysis.analysis.rdf... which makes me wonder which result that equation returns, since it is normed by N*N but also allows i=j in the sum.)

A solution that could be nice is if this normalization could be an option when ref_points and points are the same (like ddof is for numpy.var). For backwards compatibility or different ref_points and points arrays, you could define the default value so that it uses the current normalization.

@vyasr
Copy link
Collaborator

vyasr commented Aug 22, 2019

I like the idea of a normalization flag, what would you name it? I would definitely like to add an equation or two to the documentation, perhaps pointing at Wikipedia's RDF description as well for more explanation.

I would probably prefer to leave the default as is, to at least make users think about it before requesting the normalization (since for small N our definition is exactly what you get from the pair correlation function); would you agree that it makes sense?

@mphoward
Copy link
Collaborator Author

I'm not quite sure what I would call it... I guess it should probably be a bool that toggles the modes, something like your exclude_ii, but that name doesn't make sense here.

I agree that it is cleanest to leave the current behavior as default.

@vyasr vyasr self-assigned this Oct 17, 2019
@vyasr vyasr added this to the v2.0 final milestone Oct 17, 2019
@vyasr vyasr added density enhancement New feature or request labels Oct 17, 2019
@vyasr vyasr mentioned this issue Oct 17, 2019
10 tasks
@vyasr
Copy link
Collaborator

vyasr commented Oct 17, 2019

Hey @mphoward I've opened a PR #529 to fix this. Sorry it took us so long, we've been in the middle of a thorough rewrite and this issue fell by the wayside. I added you as a collaborator on freud so that you can review, feel free to propose any changes.

@vyasr
Copy link
Collaborator

vyasr commented Jul 21, 2020

As part of glotzerlab/freud-examples#19 @jennyfothergill has expressed interest in resolving this issue for the general case of pairs from two arbitrary subsets of the system, i.e. points != query_points. We chose to hold off on this change in #529 due to a lack of interest, but now that we have a use case it's worth revisiting. The basic idea (as discussed in #529) would be that instead of multiplying by a constant prefactor of (N-1)/N to normalize the RDF, we maintain a count of the total number of pairs for a given selection and use that instead in the normalization.

@vyasr vyasr reopened this Jul 21, 2020
@jennyfothergill
Copy link

Hi @vyasr, thank you for your help! I'm not sure exactly where to start... I'm looking at https://github.com/glotzerlab/freud/blob/master/cpp/density/RDF.cc, is that right?
I am not familiar with C++, so to me it looks like the normalization happens here

number_density *= static_cast<float>(m_n_query_points - 1) / (m_n_query_points);

We want to add a check that points != query_points and change the normalization to points*query_points if True. (Equation 3 from http://www-s.ks.uiuc.edu/Research/gpu/files/rdfgpu/rdfgpu.pdf)
Does this sound about right?

@vyasr vyasr assigned vyasr and jennyfothergill and unassigned vyasr Jul 23, 2020
@vyasr
Copy link
Collaborator

vyasr commented Jul 23, 2020

Yes, that's where the normalization is currently happening. However, I'm not sure if your suggestion is exactly right. IIRC there's a prefactor that's being computed just above this number density (literally called prefactor in the code I think :)) and it encodes a factor of the number of points as well.

I'm not sure how familiar you are with the theory of the RDF, so here's some background just in case. Basically, the pair correlation function should rigorously include a V/(N*(N-1)) prefactor, where the V/N is the ideal gas density. Most people are accustomed to g(r)->1 in the r->infinity limit. In fact, this only holds for large N (the thermodynamic limit), where V/(N*(N-1)) is basically equal to V/N^2. For very small N, in fact g(r) does not approach 1; instead it approaches (N-1)/N. However, since this is not what people are typically accustomed to, the existing normalization flag basically shifts it back to 1.

I'm actually not quite sure what the most general case that we should handle is. @joaander I don't think there's any meaningful way to do this in the context of a custom NeighborList, is there? It's one thing if you pass two distinct sets of points (points != query_points), because in that context you can analytically compute the total number of pairs available for normalization. However, if you use query arguments or a custom NeighborList, then does normalizing even make sense? What would the normalization factor be? And zooming out, does it make sense scientifically to try to normalize something at that point if you're restricting to arbitrary pairs, in which case it may not really be an RDF? Basically, if you specify two selections, it's possible to compute the ideal gas limiting density. But if you only include specific (potentially arbitrary) subsets of the two selections (whether or not they're equal), what should we do?

@joaander
Copy link
Member

VMD can solve this in a general way because it computes the RDF between two arbitrary subsets of particles and thus can compute the proper normalization based on the intersection of those two sets. In freud's case (by design) we don't know what the properties of those sets are or even if the two sets of points came frame the same set of particles (maybe they are reference positions on a grid). The user is going to need to supply that information in some way. We could either provide unnormalized output (or output with a given well defined normalization) and ask the user to normalize it. OR we could allow the user to set the prefactor as a property and document how to compute that prefactor for common cases such as those outlined in the VMD paper.

@mphoward
Copy link
Collaborator Author

I don't think there's a need to set or apply a prefactor if the code is not able to do the exclusion between arbitrary sets of particles anyway. Right now, I think you only are able to exclude i == i pairs when points == query_points, and when points != query_points then you are always going to get N_pts N_query pairs, right? If so, everything works as is.

For the more general case, you would need to be able to give a set of labels to both points and query_points, and exclude any pairs that have same label during the RDF averaging. Then the normalization that makes g(r) go to 1 is defined using the pair density (N_pts N_query - N_overlaps)/V. When N_pts and N_query are disjoint, you get back Npts N_query pairs, and when they are the same, you get N(N-1) pairs, which reduces to the normalization fix you are using. For more complex selections, you can get something else based on how many overlaps there.

@vyasr
Copy link
Collaborator

vyasr commented Jul 27, 2020

I agree that there's an important distinction here. The concept of exclusion is simply implemented differently in freud. In VMD you have a complete system, then you subset this system to generate the two sets, whereas in freud you would generate the two subsets by subsetting your arrays appropriately before passing them into freud. If the user passes in two sets of particles that are not mutually exclusive (the use case @mphoward is mentioning), freud has no way of knowing that, and the user has to rescale themselves to get the RDF normalization to look as expected. The case I was considering was where you pass a custom NeighborList, which has similar problems because it could include arbitrary exclusions.

I think the only cases that freud can reasonably handle within its data model are the two simplest cases: two disjoint sets of points, and two identical sets of points (with or without exclude_ii). In all other cases, I think we simply have to document the appropriate normalization for the user and ask them to do it. If they pass a custom set of query arguments and they want a normalized calculation, they can simply construct the NeighborList manually, count the number of pairs, then pass it in as the neighbors argument to get the same RDF behavior while also having the necessary counts for normalization. In that case, they will also need to undo the default normalization, which we can also document (but I think we would want to leave that in as the default behavior since it's the most common usage). @joaander @mphoward does that sound good?

@mphoward
Copy link
Collaborator Author

I think this is very reasonable. One thing that could perhaps be done as a convenience (and not sure if this is any easier for the user) is to optionally return the count of pairs that have distances that are identically zero (or close to zero within some tolerance). The number of such counts can somehow be related to the factor the user will need to use to correct the normalization.

Or, you could do as Josh suggested, and allow the user to provide the number of overlaps themselves, and then apply the normalization. I think all of this sort of falls into an "advanced" usage category, so I'm on board with any solution that's well documented.

@vyasr
Copy link
Collaborator

vyasr commented Jul 28, 2020

I think my preference would be to provide recipes for how a user could accomplish this rather than incorporating it into the API. There are enough ways that we could do this that all seem to handle special cases that I think it would be more maintainable in the long term to just provide unnormalized output (except perhaps in the two special cases discussed above) and maintain a list of examples of how you could, for instance construct a NeighborList and count the number of self-interactions prior to the calculation. We've been working to buff up the examples anyway, so adding shorter copy-pasteable recipes to our documentation on how to handle this seems like the easiest approach.

@bdice do you have any thoughts here?
@jennyfothergill does this discussion make sense to you?

@jennyfothergill
Copy link

@vyasr hm, I think I'm following, but let's make sure: It sounds like for the example I'm working on (glotzerlab/freud-examples#19) I should think about showing an example of how to normalize an RDF which uses a custom neighborlist after its creation, as opposed changing anything in the C++ code. This could be done using the points and query_points attributes of the neighborlist to count the number of pairs as @mphoward suggested. Then the normalization would be N_pairs/V.
Does that sound about right?

@vyasr
Copy link
Collaborator

vyasr commented Jul 31, 2020

Yes, that's correct regarding your example. However, I think whatever you do there needs to be documented in the RDF docstring as well, so to close this issue I think we need a PR that updates the docstring regarding how normalize works, in what cases users need to manually normalize, and what the normalization factor needs to be. Also that PR should update the existing normalization to work in the case that points != query_points (it can assume that the two points are distinct).

@vyasr
Copy link
Collaborator

vyasr commented May 1, 2021

We should aim to resolve this along with #635 for freud 3.0

@vyasr vyasr modified the milestones: v2.0 final, v3.0.0 May 1, 2021
@tommy-waltmann
Copy link
Collaborator

Closing this issue, as we now extensively document the normalization we use, and the conditions under which that normalization is appropriate. Feel free to re-open if someone wants to add further documentation on the proper normalization when there are overlapping points in the set of points and query_points

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
density enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants