Indexing fixes, optimization, and etc. for `graphgen` #91

ShaunWeatherly · 2025-01-20T16:53:10Z

A few quick fixes. Perhaps next time just send me a message on slack, @mcocdawc .

Incomplete Initialization

Frag_atom, along with all of the additional attributes generated by autogen and not generated by graphgen, are currently unused in the code. As per your insistence on removing "dead code", I excluded the initialization of these parameters, because it was, and is, unused. I don't see how this could be described as "incomplete initialization", but I'll just assume that you plan to make use of this attribute and I've added it back in at your request.

Fails at Molecule

This is an edge case, but it is a good catch. There was an issue with remove_nonnunique_frags that messed up the indexing when Natm was very large. This led to a number of smaller issues, along with the issue you encountered of some center sites not being appended to the superset fragments. I've fixed the issue and everything seems to be working just fine on my end now, but let me know if you encounter any other abnormalities.

Bad Scaling

Yup, that was expected. graphgen previously found the shortest paths from a center to all other atoms in the system. Obviousuly, finding the shortest path from all-to-all is non-trivial for graphs with 100s of nodes. The fix is straightforward: 1) exclude edges from adjacency_graph if they are below some threshold, and 2) only find the shortest path for neighboring atoms which have an edge weight above that threshold. I've implemented that fix and the threshold can be found as graphgen.cutoff.

Misleading name and interface

Okay, how about this. @mscho527 @lweisburn , if you agree that this function is terribly named, misleading, and must be changed, then I'll change it. If not, then it stays. I prefer this name and I don't find it misleading in the slightest, personally, and frankly I think this type of pedantry is a bit counter productive. But again, I'm open to admitting wrongness if others also find it misleading.

mcocdawc · 2025-01-20T18:10:06Z

graphgen previously found the shortest paths from a center to all other atoms in the system.

Ok, so it went from N^2 scaling to linear one?

As per your insistence on removing "dead code", I excluded the initialization of these parameters, because it was, and is, unused.

I softly suggested to remove commented out code, if you had wanted to keep it it would have been fine as well.
In general I would also remove locally unused code, but values that are returned via a public API cannot known to be unused. I did not and would not suggest to remove them.
Admittedly I missed in my review that there are values missing from the returned code and will pay closer attention to this case in the future.

if you agree that this function is terribly named

I never said it is terribly named. Asking it in this way is a bit of a rhetorical question.
Metric and norm are clearly defined and different mathematical terms.
Using one for the other is misleading.
In the same way as a variable hermitian_M for a matrix that is actually unitary, or a variable name i_AO_idx for an MO index is subtly wrong and dangerously misleading.
(just for reference, here is the previous discussion.

ShaunWeatherly · 2025-01-20T21:35:43Z

@mcocdawc

That should be order constant scaling per fragment (or linear in the number of fragments), just like autogen I believe.
Then I think you need to make your standards for this code more clear, because it's currently impossible to tell what you'll find issue with.
As for the norm discussion...

Again, every norm induces a metric, but not every metric induces a norm. Norms have distinct properties, such as translational invariance. Metrics are more general, and the use of norm here agrees with the more strict properties of a norm. To say the only difference is that "a metric is a function of two vectors whereas a norm is a function of one" is an oversimplification, because you can always just define the origin to be one of the two coordinates, ie. ||x-y|| = d(x,y), such that d(u,0) is both a norm and a metric.

But semantics aside, the function eucidean_norm satisfies your own tests for defining a norm. To recall, you gave the following properties as those defining a metric:

assert euclidean_metric(x, x) == 0
assert euclidean_metric(x, y) == euclidean_metric(y, x)
assert euclidean_metric(x, z) <=  euclidean_metric(y, z) + euclidean_metric(y, z)

whereas a norm would satisfy:

assert norm(x + y) <= norm(x) + norm(y)
assert norm(a * x) <= abs(a) * norm(x) 
assert (norm(x) == 0) == (all(x == 0))

Naturally, euclidean_norm satisfies the properties of a metric, because a norm induces a metric and hence it would satisfy those properties regardless of whether it were a norm or a metric. What's meaningful is if you test euclidean_norm for the properties of a norm:


a = np.random.rand(1, )[0]
x1 = np.random.rand(3, 1)
x2 = np.random.rand(3, 1)
y1 = np.random.rand(3, 1)
y2 = np.random.rand(3, 1)

assert euclidean_norm(x1 + x2, y1 + y2) <= euclidean_norm(x1, y1) + euclidean_norm(x2, y2)
assert euclidean_norm(a * x1, a * y1) <= abs(a) * euclidean_norm(x1, y1) 
assert (euclidean_norm(x1, y1) == 0) == (all(x1 == 0))

You find that by your own tests, it satisfies the properties of a norm. I truly don't understand your insistence on this, ultimately trivial, point, but if you insist on brick walling then feel free to change the name of the function to whatever you like.

mcocdawc · 2025-01-20T22:01:36Z

Then I think you need to make your standards for this code more clear, because it's currently impossible to tell what you'll find issue with.

I don't define the standards for the code alone.
There were only two problems where I explicitly required a change, which I indicated by required in the title.
The misleading naming of euclidean_norm and this here

All other comments were just comments of the kind "I would do it like this".
If you agree, great, if you don't agree it is also fine.

If I see a function call euclidean_norm(x, y), then I assume that it calculates sqrt(x**2 + y**2), i.e. the euclidean norm of a 2D vector with components x and y.¹
If I see a function call euclidean_metric(x, y), then I assume that it calculates euclidean_norm(x - y), i.e. the metric/distance between the vectors x and y.

I think that others also would make the same assumption, hence would get confused by the current naming. So let's just wait for the comment of @mscho527 and @lweisburn , perhaps I am wrong then it stays as it is.

¹ Reinforced by the fact that initially the argument type was declared as scalar floats.

mscho527 · 2025-01-20T23:47:40Z

@ShaunWeatherly and @mcocdawc, it is a bit difficult for me to exactly follow everything since i) multiple issues are discussed at the same time and ii) from what I understand, is based off of what you discussed in-person between you two. Do you want to walk us through in-person? Perhaps during the subgroup tomorrow?

Indexing fixes, optimization, and etc.

47a1c4e

ShaunWeatherly linked an issue Jan 20, 2025 that may be closed by this pull request

issues with graphgen #90

Open

4 tasks

ShaunWeatherly added 2 commits January 20, 2025 12:36

Modify docstrings.

4af68ae

Expose other keywords and update docstrings

f3042cd

ShaunWeatherly marked this pull request as ready for review January 20, 2025 17:52

ShaunWeatherly requested review from mcocdawc, mscho527 and lweisburn January 20, 2025 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing fixes, optimization, and etc. for `graphgen` #91

Indexing fixes, optimization, and etc. for `graphgen` #91

ShaunWeatherly commented Jan 20, 2025 •

edited

Loading

mcocdawc commented Jan 20, 2025 •

edited

Loading

ShaunWeatherly commented Jan 20, 2025 •

edited

Loading

mcocdawc commented Jan 20, 2025 •

edited

Loading

mscho527 commented Jan 20, 2025

Indexing fixes, optimization, and etc. for graphgen #91

Are you sure you want to change the base?

Indexing fixes, optimization, and etc. for graphgen #91

Conversation

ShaunWeatherly commented Jan 20, 2025 • edited Loading

Incomplete Initialization

Fails at Molecule

Bad Scaling

Misleading name and interface

mcocdawc commented Jan 20, 2025 • edited Loading

ShaunWeatherly commented Jan 20, 2025 • edited Loading

mcocdawc commented Jan 20, 2025 • edited Loading

mscho527 commented Jan 20, 2025

Indexing fixes, optimization, and etc. for `graphgen` #91

Indexing fixes, optimization, and etc. for `graphgen` #91

ShaunWeatherly commented Jan 20, 2025 •

edited

Loading

mcocdawc commented Jan 20, 2025 •

edited

Loading

ShaunWeatherly commented Jan 20, 2025 •

edited

Loading

mcocdawc commented Jan 20, 2025 •

edited

Loading