Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing fixes, optimization, and etc. for graphgen #91

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ShaunWeatherly
Copy link
Contributor

@ShaunWeatherly ShaunWeatherly commented Jan 20, 2025

A few quick fixes. Perhaps next time just send me a message on slack, @mcocdawc .

Incomplete Initialization

  • Frag_atom, along with all of the additional attributes generated by autogen and not generated by graphgen, are currently unused in the code. As per your insistence on removing "dead code", I excluded the initialization of these parameters, because it was, and is, unused. I don't see how this could be described as "incomplete initialization", but I'll just assume that you plan to make use of this attribute and I've added it back in at your request.

Fails at Molecule

  • This is an edge case, but it is a good catch. There was an issue with remove_nonnunique_frags that messed up the indexing when Natm was very large. This led to a number of smaller issues, along with the issue you encountered of some center sites not being appended to the superset fragments. I've fixed the issue and everything seems to be working just fine on my end now, but let me know if you encounter any other abnormalities.

Bad Scaling

  • Yup, that was expected. graphgen previously found the shortest paths from a center to all other atoms in the system. Obviousuly, finding the shortest path from all-to-all is non-trivial for graphs with 100s of nodes. The fix is straightforward: 1) exclude edges from adjacency_graph if they are below some threshold, and 2) only find the shortest path for neighboring atoms which have an edge weight above that threshold. I've implemented that fix and the threshold can be found as graphgen.cutoff.

Misleading name and interface

  • Okay, how about this. @mscho527 @lweisburn , if you agree that this function is terribly named, misleading, and must be changed, then I'll change it. If not, then it stays. I prefer this name and I don't find it misleading in the slightest, personally, and frankly I think this type of pedantry is a bit counter productive. But again, I'm open to admitting wrongness if others also find it misleading.

@ShaunWeatherly ShaunWeatherly linked an issue Jan 20, 2025 that may be closed by this pull request
4 tasks
@ShaunWeatherly ShaunWeatherly marked this pull request as ready for review January 20, 2025 17:52
@mcocdawc
Copy link
Contributor

mcocdawc commented Jan 20, 2025

graphgen previously found the shortest paths from a center to all other atoms in the system.

Ok, so it went from N^2 scaling to linear one?

As per your insistence on removing "dead code", I excluded the initialization of these parameters, because it was, and is, unused.

I softly suggested to remove commented out code, if you had wanted to keep it it would have been fine as well.
In general I would also remove locally unused code, but values that are returned via a public API cannot known to be unused. I did not and would not suggest to remove them.
Admittedly I missed in my review that there are values missing from the returned code and will pay closer attention to this case in the future.

if you agree that this function is terribly named

I never said it is terribly named. Asking it in this way is a bit of a rhetorical question.
Metric and norm are clearly defined and different mathematical terms.
Using one for the other is misleading.
In the same way as a variable hermitian_M for a matrix that is actually unitary, or a variable name i_AO_idx for an MO index is subtly wrong and dangerously misleading.
(just for reference, here is the previous discussion.

@ShaunWeatherly
Copy link
Contributor Author

ShaunWeatherly commented Jan 20, 2025

@mcocdawc

  1. That should be order constant scaling per fragment (or linear in the number of fragments), just like autogen I believe.
  2. Then I think you need to make your standards for this code more clear, because it's currently impossible to tell what you'll find issue with.
  3. As for the norm discussion...

Again, every norm induces a metric, but not every metric induces a norm. Norms have distinct properties, such as translational invariance. Metrics are more general, and the use of norm here agrees with the more strict properties of a norm. To say the only difference is that "a metric is a function of two vectors whereas a norm is a function of one" is an oversimplification, because you can always just define the origin to be one of the two coordinates, ie. ||x-y|| = d(x,y), such that d(u,0) is both a norm and a metric.

But semantics aside, the function eucidean_norm satisfies your own tests for defining a norm. To recall, you gave the following properties as those defining a metric:

assert euclidean_metric(x, x) == 0
assert euclidean_metric(x, y) == euclidean_metric(y, x)
assert euclidean_metric(x, z) <=  euclidean_metric(y, z) + euclidean_metric(y, z)

whereas a norm would satisfy:

assert norm(x + y) <= norm(x) + norm(y)
assert norm(a * x) <= abs(a) * norm(x) 
assert (norm(x) == 0) == (all(x == 0))

Naturally, euclidean_norm satisfies the properties of a metric, because a norm induces a metric and hence it would satisfy those properties regardless of whether it were a norm or a metric. What's meaningful is if you test euclidean_norm for the properties of a norm:


a = np.random.rand(1, )[0]
x1 = np.random.rand(3, 1)
x2 = np.random.rand(3, 1)
y1 = np.random.rand(3, 1)
y2 = np.random.rand(3, 1)

assert euclidean_norm(x1 + x2, y1 + y2) <= euclidean_norm(x1, y1) + euclidean_norm(x2, y2)
assert euclidean_norm(a * x1, a * y1) <= abs(a) * euclidean_norm(x1, y1) 
assert (euclidean_norm(x1, y1) == 0) == (all(x1 == 0))

You find that by your own tests, it satisfies the properties of a norm. I truly don't understand your insistence on this, ultimately trivial, point, but if you insist on brick walling then feel free to change the name of the function to whatever you like.

@mcocdawc
Copy link
Contributor

mcocdawc commented Jan 20, 2025

Then I think you need to make your standards for this code more clear, because it's currently impossible to tell what you'll find issue with.

I don't define the standards for the code alone.
There were only two problems where I explicitly required a change, which I indicated by required in the title.
The misleading naming of euclidean_norm and this here

All other comments were just comments of the kind "I would do it like this".
If you agree, great, if you don't agree it is also fine.


If I see a function call euclidean_norm(x, y), then I assume that it calculates sqrt(x**2 + y**2), i.e. the euclidean norm of a 2D vector with components x and y
If I see a function call euclidean_metric(x, y), then I assume that it calculates euclidean_norm(x - y), i.e. the metric/distance between the vectors x and y.

I think that others also would make the same assumption, hence would get confused by the current naming. So let's just wait for the comment of @mscho527 and @lweisburn , perhaps I am wrong then it stays as it is.

¹ Reinforced by the fact that initially the argument type was declared as scalar floats.

@mscho527
Copy link
Member

@ShaunWeatherly and @mcocdawc, it is a bit difficult for me to exactly follow everything since i) multiple issues are discussed at the same time and ii) from what I understand, is based off of what you discussed in-person between you two. Do you want to walk us through in-person? Perhaps during the subgroup tomorrow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

issues with graphgen
3 participants