issues with `graphgen` #90

mcocdawc · 2025-01-19T21:34:49Z

While comparing the output of the different fragmentation codes, i.e. "autogen", "graphgen" and my upcoming "chemgen" (#85) for larger molecules I noticed a few problems with "graphgen".

Incomplete Initialization
Fails at molecule
Bad scaling
Misleading name in code

Incomplete Initialization

The graphgen fragmentation exposes its result via the fragpart class and looks as if it can be used as a replacement for autogen.
Unfortunately a few components are not initialized although the information should available.
graphgen should either initialize them or make it explicit that it is incompletely initialized compared to autogen.

Let's say the user wanted to visualize the fragments and loops over Frag_atom, then it gives unexpected results for graphgen, because it is still an empty list in the case of graphgen.

graph_frags = fragpart(mol=mol, frag_type='graphgen', be_type='be2')
auto_frags = fragpart(mol=mol, frag_type='autogen', be_type='be2')

for fragment in auto_frags.Frag_atom:
    # visualize fragment
 
# The following loop is empty, because `graph_frags.Frag_atom == []`
for fragment in graph_frags.Frag_atom:
    # visualize fragment

Since this is (?) relevant for the mixed basis code. I added @lweisburn and @mscho527 as well to this issue.

Fails at molecule

Testing graphgen at a large (342 atoms) molecule of the project of @lweisburn and @mscho527 fails with.
I cannot append the xyz file here, but will send it to you @ShaunWeatherly

File ~/code/quemb/src/quemb/molbe/autofrag.py:277, in graphgen(mol, be_type, frozen_core, remove_nonunique_frags, frag_prefix, connectivity, iao_valence_basis)
...
File ~/code/quemb/src/quemb/molbe/autofrag.py:277, in graphgen(mol, be_type, frozen_core, remove_nonunique_frags, frag_prefix, connectivity, iao_valence_basis)
    273 # Update relative center site indices (centerf_idx) and weights
    274 # for center site contributions to the energy (ebe_weights):
    275 for adx, center in enumerate(fragment_map.center):
    276     centerf_idx = tuple(
--> 277         set([fragment_map.fsites[adx].index(cdx) for cdx in center])
    278     )
    279     ebe_weight = (1.0, tuple(centerf_idx))
    280     fragment_map.centerf_idx.append(centerf_idx)

ValueError: tuple.index(x): x not in tuple

To reproduce just call on the xyz file

import pyscf

mol = pyscf.gto.M(path, basis="sto-3g", charge=1)
graph_frag_map = graphgen(mol, be_type = 'be2')

Bad scaling

The aforementioned error happens after 36 s, i.e. the fragmentation is not finished after 36 s.
On the same molecule both "autogen" and "chemgen" need less than a second.
Apparently graphgen has a far worse scaling behaviour than the other algorithms since it is equally fast for small molecules such as octane.
If it is documented as a more or less drop-in-replacement, it should preferably have a similar scaling behaviour.
Or it should be documented, that it is a considerably slower algorithm.

Misleading name and interface

Coming back to our discussion from the review.
The interface and name of euclidean_norm is misleading and should be changed.

The arguments are of type Vector and not just arbitrary tensors, (the np.asarray is fully redundant in both cases)
np.floating[Any] is unnecessary from my understanding; it can be just np.floating
But most importantly it is a metric/distance; calling it a norm is misleading. When one sees code such as euclidean_norm(x, y) they could assume it is the norm of one 2D vector with scalar components x, y while in reality it is the distance between vectors x and y.

All in all

def euclidean_norm(
    i_coord: np.ndarray,
    j_coord: np.ndarray,
) -> np.floating[Any]:
    return norm(np.asarray(i_coord - j_coord))

is better named and typed as

def euclidean_metric(i_coord: Vector], j_coord: Vector) -> np.floating:
    return norm(i_coord - j_coord)

The text was updated successfully, but these errors were encountered:

mscho527 · 2025-01-19T21:56:35Z

I haven't taken a closer look at either #85 or #86, so please correct me if I am wrong. Based on the excerpt of conversations I have seen on this issue and elsewhere, here are my 2cts:

There is a scenario where we agree to make it clear in the documentation and the paper that a) graphgen provides an alternative way of automatically generating fragments, b) it excels in some explicitly specified cases, and c) supporting other cases (like the ones @mcocdawc mentioned in this issue) with graphgen will happen later. Here, I think we can just mark this issue as 'will fix later' and move on for now.
Otherwise, if we want to brand it as a drop-in alternative, I think we should definitely fix the 'incomplete initialization' bit and the 'bad scaling' bit. I am okay with graphgen failing for mixed-basis, since this code lives in a separate repository anyway. @lweisburn and I can just make a note there that using autogen is a necessity.

mcocdawc added bug Something isn't working documentation Improvements or additions to documentation maintenance things that should make our life easier in the future labels Jan 19, 2025

mcocdawc added this to the v0.1 milestone Jan 19, 2025

mcocdawc assigned ShaunWeatherly Jan 19, 2025

ShaunWeatherly linked a pull request Jan 20, 2025 that will close this issue

Indexing fixes, optimization, and etc. for graphgen #91

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issues with `graphgen` #90

issues with `graphgen` #90

mcocdawc commented Jan 19, 2025 •

edited

Loading

mscho527 commented Jan 19, 2025

issues with graphgen #90

issues with graphgen #90

Comments

mcocdawc commented Jan 19, 2025 • edited Loading

Incomplete Initialization

Fails at molecule

Bad scaling

Misleading name and interface

mscho527 commented Jan 19, 2025

issues with `graphgen` #90

issues with `graphgen` #90

mcocdawc commented Jan 19, 2025 •

edited

Loading