Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An issue regarding the use of LAMMPS with the deep potential trained using MACE #44

Open
wxwth opened this issue Dec 10, 2024 · 5 comments

Comments

@wxwth
Copy link

wxwth commented Dec 10, 2024

Dear Dr. Zeng,

I am sorry to bother you for an issue regarding the use of LAMMPS (internally installed within DeePMD-kit 3.0.0) with the deep potential trained using MACE implemented in DeePMD-kit 3.0.0.

Following the detailed guidance provided by Bohrium and the WeChat post by DeepModeling, I added the following command in my job submission script for LAMMPS:

export DP_PLUGIN_PATH=/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/libdeepmd_gnn.so

The file libdeepmd_gnn.so does exist in the dictionary /gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/. Additionally, I successfully trained a deep potential using MACE in DeePMD-kit 3.0.0, indicating that there should not be any issues with the DeePMD-kit itself.

However, an error occurs, stating that the libdeepmd_gnn.so file cannot be found:

ERROR on proc 28: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: DeePMD-kit Error: /gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/libdeepmd_gnn.so is not found! You can add the library directory to LD_LIBRARY_PATH (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1732355244818/work/source/lmp/pair_deepmd.cpp:539)

I requested someone else to submit the job on another cluster, but they encountered the same error. My job submission script for LAMMPS is attached for reference. Could you kindly provide any expert suggestions to resolve this issue?

Thank you for your time, and I am looking forward to your reply.

Sincerely,
Xu

myjob.sh.txt

@njzjz
Copy link
Member

njzjz commented Dec 10, 2024

Setting the environment variable LD_DEBUG=libs would print more helpful information.

@wxwth
Copy link
Author

wxwth commented Dec 11, 2024

Dear Dr. Zeng,

Thank you for your reply. Enclosed is the output file generated with the environment variable LD_DEBUG=libs. I am sorry that I cannot find the solution to this issue. Could you kindly take a look? It should be noted that this simulation was performed on a single core to avoid redundant output. Thank you for your time!

Sincerely,
Xu

slurm-1937867.out.txt

@njzjz
Copy link
Member

njzjz commented Dec 11, 2024

It seems that libcuda.so.1 is not found, which is the last file it tries to find before the error is thrown. The error message could be improved, though.

3463222:	find library=libcuda.so.1 [0]; searching
   3463222:	 search path=/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/tls/x86_64/x86_64:/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/tls/x86_64:/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/tls/x86_64:/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/tls:/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/x86_64/x86_64:/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/x86_64:/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/x86_64:/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib		(RPATH from file /gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/libdeepmd_gnn.so)
   3463222:	  trying file=/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/tls/x86_64/x86_64/libcuda.so.1
   3463222:	  trying file=/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/tls/x86_64/libcuda.so.1
   3463222:	  trying file=/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/tls/x86_64/libcuda.so.1
   3463222:	  trying file=/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/tls/libcuda.so.1
   3463222:	  trying file=/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/x86_64/x86_64/libcuda.so.1
   3463222:	  trying file=/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/x86_64/libcuda.so.1
   3463222:	  trying file=/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/x86_64/libcuda.so.1
   3463222:	  trying file=/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/lib/python3.12/site-packages/deepmd_gnn/lib/libcuda.so.1
   3463222:	 search path=/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/bin/../lib		(RPATH from file lmp)
   3463222:	  trying file=/gpfs/software/deepmd-kit-3.0.0-cuda12.6_gpu/bin/../lib/libcuda.so.1
   3463222:	 search path=/gpfs/softs/intel/oneapi/mpi/2021.2.0/libfabric/lib:/gpfs/softs/intel/oneapi/mpi/2021.2.0/lib/release:/gpfs/softs/intel/oneapi/mpi/2021.2.0/lib:/gpfs/softs/intel/oneapi/mkl/2021.2.0/lib/intel64:/gpfs/softs/intel/oneapi/tbb/2021.2.0/lib/intel64/gcc4.8:/gpfs/softs/intel/oneapi/compiler/2021.2.0/linux/lib:/gpfs/softs/intel/oneapi/compiler/2021.2.0/linux/lib/x64:/gpfs/softs/intel/oneapi/compiler/2021.2.0/linux/lib/emu:/gpfs/softs/intel/oneapi/compiler/2021.2.0/linux/compiler/lib/intel64_lin:/gpfs/softs/intel/oneapi/compiler/2021.2.0/linux/compiler/lib/intel64:/gpfs/softs/intel/oneapi/debugger/10.1.1/dep/lib:/gpfs/softs/intel/oneapi/debugger/10.1.1/gdb/intel64/lib		(LD_LIBRARY_PATH)
   3463222:	  trying file=/gpfs/softs/intel/oneapi/mpi/2021.2.0/libfabric/lib/libcuda.so.1
   3463222:	  trying file=/gpfs/softs/intel/oneapi/mpi/2021.2.0/lib/release/libcuda.so.1
   3463222:	  trying file=/gpfs/softs/intel/oneapi/mpi/2021.2.0/lib/libcuda.so.1
   3463222:	  trying file=/gpfs/softs/intel/oneapi/mkl/2021.2.0/lib/intel64/libcuda.so.1
   3463222:	  trying file=/gpfs/softs/intel/oneapi/tbb/2021.2.0/lib/intel64/gcc4.8/libcuda.so.1
   3463222:	  trying file=/gpfs/softs/intel/oneapi/compiler/2021.2.0/linux/lib/libcuda.so.1
   3463222:	  trying file=/gpfs/softs/intel/oneapi/compiler/2021.2.0/linux/lib/x64/libcuda.so.1
   3463222:	  trying file=/gpfs/softs/intel/oneapi/compiler/2021.2.0/linux/lib/emu/libcuda.so.1
   3463222:	  trying file=/gpfs/softs/intel/oneapi/compiler/2021.2.0/linux/compiler/lib/intel64_lin/libcuda.so.1
   3463222:	  trying file=/gpfs/softs/intel/oneapi/compiler/2021.2.0/linux/compiler/lib/intel64/libcuda.so.1
   3463222:	  trying file=/gpfs/softs/intel/oneapi/debugger/10.1.1/dep/lib/libcuda.so.1
   3463222:	  trying file=/gpfs/softs/intel/oneapi/debugger/10.1.1/gdb/intel64/lib/libcuda.so.1
   3463222:	 search cache=/etc/ld.so.cache
   3463222:	 search path=/lib64/tls:/lib64:/usr/lib64/tls:/usr/lib64		(system search path)
   3463222:	  trying file=/lib64/tls/libcuda.so.1
   3463222:	  trying file=/lib64/libcuda.so.1
   3463222:	  trying file=/usr/lib64/tls/libcuda.so.1
   3463222:	  trying file=/usr/lib64/libcuda.so.1
   3463222:

@wxwth
Copy link
Author

wxwth commented Dec 11, 2024

Dear Dr. Zeng,

Thank you for your reply. I am confused about the requirement for the libcuda.so.1 file, as my job submission script indicates that I am running LAMMPS simulations using only the CPU, not the GPU.

Besides, I can only find the libcuda.so file in the /gpfs/softs/cuda/12.6.2/targets/x86_64-linux/lib/stubs/ dictionary of my cluster, but the libcuda.so.1 file is not present in the corresponding CUDA-12.6.2 directory.

Could you kindly clarify if it is still possible to perform the LAMMPS simulations (with the deep potential trained using MACE implemented in DeePMD-kit 3.0.0) under these conditions? Thank you for your time!

Sincerely,
Xu

@njzjz
Copy link
Member

njzjz commented Dec 11, 2024

Does your pytorch link to libcuda.so.1? libdeepmd_gnn.so does not explicitly link to CUDA.

njzjz added a commit that referenced this issue Dec 11, 2024
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Dec 22, 2024
github-merge-queue bot pushed a commit to deepmodeling/deepmd-kit that referenced this issue Dec 22, 2024
xref: deepmodeling/deepmd-gnn#44

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enhanced error messages for library loading failures on non-Windows
platforms.
- Updated thread management environment variable checks for improved
compatibility.
- Added support for mixed types in tensor input handling, allowing for
more flexible configurations.

- **Bug Fixes**
	- Improved error reporting for dynamic library loading issues.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Dec 22, 2024
xref: deepmodeling/deepmd-gnn#44

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enhanced error messages for library loading failures on non-Windows
platforms.
- Updated thread management environment variable checks for improved
compatibility.
- Added support for mixed types in tensor input handling, allowing for
more flexible configurations.

- **Bug Fixes**
	- Improved error reporting for dynamic library loading issues.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
(cherry picked from commit cfe17a3)
njzjz added a commit to deepmodeling/deepmd-kit that referenced this issue Dec 23, 2024
xref: deepmodeling/deepmd-gnn#44

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enhanced error messages for library loading failures on non-Windows
platforms.
- Updated thread management environment variable checks for improved
compatibility.
- Added support for mixed types in tensor input handling, allowing for
more flexible configurations.

- **Bug Fixes**
	- Improved error reporting for dynamic library loading issues.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
(cherry picked from commit cfe17a3)
iProzd added a commit to iProzd/deepmd-kit that referenced this issue Dec 24, 2024
* change property.npy to any name

* Init branch

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* change | to Union

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* change sub_var_name default to []

* Solve pre-commit

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* solve scanning github

* fix UT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* delete useless file

* Solve some UT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Solve precommit

* slove pre

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Solve dptest UT, dpatomicmodel UT, code scannisang

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* delete param  and

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Solve UT fail caused by task_dim and property_name

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix UT

* Fix UT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix UT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix permutation error

* Add property bias UT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* recover rcond doc

* recover blank

* Change code according  according to coderabbitai

* solve pre-commit

* Fix UT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* change apply_bias doc

* update the version compatibility

* feat (tf/pt): add atomic weights to tensor loss (deepmodeling#4466)

Interfaces are of particular interest in many studies. However, the
configurations in the training set to represent the interface normally
also include large parts of the bulk material. As a result, the final
model would prefer the bulk information while the interfacial
information is less learnt. It is difficult to simply improve the
proportion of interfaces in the configurations since the electronic
structures of the interface might only be reasonable with a certain
thickness of bulk materials. Therefore, I wonder whether it is possible
to define weights for atomic quantities in loss functions. This allows
us to add higher weights for the atomic information for the regions of
interest and probably makes the model "more focused" on the region of
interest.
In this PR, I add the keyword `enable_atomic_weight` to the loss
function of the tensor model. In principle, it could be generalised to
any atomic quantity, e.g., atomic forces.
I would like to know the developers' comments/suggestions about this
feature. I can add support for other loss functions and finish unit
tests once we agree on this feature.

Best. 




<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced an optional parameter for atomic weights in loss
calculations, enhancing flexibility in the `TensorLoss` class.
- Added a suite of unit tests for the `TensorLoss` functionality,
ensuring consistency between TensorFlow and PyTorch implementations.

- **Bug Fixes**
- Updated logic for local loss calculations to ensure correct
application of atomic weights based on user input.

- **Documentation**
- Improved clarity of documentation for several function arguments,
including the addition of a new argument related to atomic weights.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

* delete sub_var_name

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* recover to property key

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix conflict

* Fix UT

* Add document of property fitting

* Delete checkpoint

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add get_property_name to DeepEvalBackend

* pd: fix learning rate setting when resume (deepmodeling#4480)

"When resuming training, there is no need to add `self.start_step` to
the step count because Paddle uses `lr_sche.last_epoch` as the input for
`step`, which already records the `start_step` steps."

learning rate are correct after fixing


![22AD6874B74E437E9B133D75ABCC02FE](https://github.com/user-attachments/assets/1ad0ce71-6e1c-4de5-87dc-0daca1f6f038)



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enhanced training process with improved optimizer configuration and
learning rate adjustments.
	- Refined logging of training and validation results for clarity.
- Improved model saving logic to preserve the latest state during
interruptions.
- Enhanced tensorboard logging for detailed tracking of training
metrics.

- **Bug Fixes**
- Corrected lambda function for learning rate scheduler to reference
warmup steps accurately.

- **Chores**
- Streamlined data loading and handling for efficient training across
different tasks.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

* docs: update deepmd-gnn URL (deepmodeling#4482)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Documentation**
- Updated guidelines for creating and integrating new models in the
DeePMD-kit framework.
- Added new sections on descriptors, fitting networks, and model
requirements.
	- Enhanced unit testing section with instructions for regression tests.
- Updated URL for the DeePMD-GNN plugin to reflect new repository
location.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Jinzhe Zeng <[email protected]>

* docs: update DPA-2 citation (deepmodeling#4483)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Updated references in the bibliography for the DPA-2 model to include
a new article entry for 2024.
	- Added a new reference for an attention-based descriptor.
  
- **Bug Fixes**
- Corrected reference links in documentation to point to updated DOI
links instead of arXiv.

- **Documentation**
- Revised entries in the credits and model documentation to reflect the
latest citations and details.
- Enhanced clarity and detail in fine-tuning documentation for
TensorFlow and PyTorch implementations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <[email protected]>

* docs: fix a minor typo on the title of `install-from-c-library.md` (deepmodeling#4484)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Documentation**
- Updated formatting of the installation guide for the pre-compiled C
library.
- Icons for TensorFlow and JAX are now displayed together in the header.
	- Retained all installation instructions and compatibility notes.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Jinzhe Zeng <[email protected]>

* fix: print dlerror if dlopen fails (deepmodeling#4485)

xref: deepmodeling/deepmd-gnn#44

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enhanced error messages for library loading failures on non-Windows
platforms.
- Updated thread management environment variable checks for improved
compatibility.
- Added support for mixed types in tensor input handling, allowing for
more flexible configurations.

- **Bug Fixes**
	- Improved error reporting for dynamic library loading issues.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* change doc to py

* Add out_bias out_std doc

* change bias method to compute_stats_do_not_distinguish_types

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* change var_name to property_name

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* change logic of extensive bias

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add doc for neww added parameter

* change doc for compute_stats_do_not_distinguish_types

* try to fix dptest

* change all property to property_name

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix UT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Delete key 'property' completely

* Fix UT

* Fix dptest UT

* pd: fix oom error (deepmodeling#4493)

Paddle use `MemoryError` rather than `RuntimeError` used in pytorch, now
I can test DPA-1 and DPA-2 in 16G V100...

![image](https://github.com/user-attachments/assets/42ead773-bf26-4195-8f67-404b151371de)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
- Improved detection of out-of-memory (OOM) errors to enhance
application stability.
- Ensured cached memory is cleared upon OOM errors, preventing potential
memory leaks.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

* pd: add missing `dp.eval()` in pd backend (deepmodeling#4488)

Switch to eval mode when evaluating model, otherwise `self.training`
will be `True`, backward graph will be created and cause OOM

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced model evaluation state management to ensure correct behavior
during evaluation.

- **Bug Fixes**
- Improved type consistency in the `normalize_coord` function for better
computational accuracy.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

* [pre-commit.ci] pre-commit autoupdate (deepmodeling#4497)

<!--pre-commit.ci start-->
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.8.3 →
v0.8.4](astral-sh/ruff-pre-commit@v0.8.3...v0.8.4)
<!--pre-commit.ci end-->

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Delete attribute

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Solve comment

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Solve error

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* delete property_name in serialize

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Chenqqian Zhang <[email protected]>
Co-authored-by: Jia-Xin Zhu <[email protected]>
Co-authored-by: HydrogenSulfate <[email protected]>
Co-authored-by: Jinzhe Zeng <[email protected]>
iProzd added a commit to iProzd/deepmd-kit that referenced this issue Jan 4, 2025
* Refactor property (#37)

* change property.npy to any name

* Init branch

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* change | to Union

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* change sub_var_name default to []

* Solve pre-commit

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* solve scanning github

* fix UT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* delete useless file

* Solve some UT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Solve precommit

* slove pre

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Solve dptest UT, dpatomicmodel UT, code scannisang

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* delete param  and

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Solve UT fail caused by task_dim and property_name

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix UT

* Fix UT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix UT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix permutation error

* Add property bias UT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* recover rcond doc

* recover blank

* Change code according  according to coderabbitai

* solve pre-commit

* Fix UT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* change apply_bias doc

* update the version compatibility

* feat (tf/pt): add atomic weights to tensor loss (deepmodeling#4466)

Interfaces are of particular interest in many studies. However, the
configurations in the training set to represent the interface normally
also include large parts of the bulk material. As a result, the final
model would prefer the bulk information while the interfacial
information is less learnt. It is difficult to simply improve the
proportion of interfaces in the configurations since the electronic
structures of the interface might only be reasonable with a certain
thickness of bulk materials. Therefore, I wonder whether it is possible
to define weights for atomic quantities in loss functions. This allows
us to add higher weights for the atomic information for the regions of
interest and probably makes the model "more focused" on the region of
interest.
In this PR, I add the keyword `enable_atomic_weight` to the loss
function of the tensor model. In principle, it could be generalised to
any atomic quantity, e.g., atomic forces.
I would like to know the developers' comments/suggestions about this
feature. I can add support for other loss functions and finish unit
tests once we agree on this feature.

Best. 




<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced an optional parameter for atomic weights in loss
calculations, enhancing flexibility in the `TensorLoss` class.
- Added a suite of unit tests for the `TensorLoss` functionality,
ensuring consistency between TensorFlow and PyTorch implementations.

- **Bug Fixes**
- Updated logic for local loss calculations to ensure correct
application of atomic weights based on user input.

- **Documentation**
- Improved clarity of documentation for several function arguments,
including the addition of a new argument related to atomic weights.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

* delete sub_var_name

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* recover to property key

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix conflict

* Fix UT

* Add document of property fitting

* Delete checkpoint

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add get_property_name to DeepEvalBackend

* pd: fix learning rate setting when resume (deepmodeling#4480)

"When resuming training, there is no need to add `self.start_step` to
the step count because Paddle uses `lr_sche.last_epoch` as the input for
`step`, which already records the `start_step` steps."

learning rate are correct after fixing


![22AD6874B74E437E9B133D75ABCC02FE](https://github.com/user-attachments/assets/1ad0ce71-6e1c-4de5-87dc-0daca1f6f038)



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enhanced training process with improved optimizer configuration and
learning rate adjustments.
	- Refined logging of training and validation results for clarity.
- Improved model saving logic to preserve the latest state during
interruptions.
- Enhanced tensorboard logging for detailed tracking of training
metrics.

- **Bug Fixes**
- Corrected lambda function for learning rate scheduler to reference
warmup steps accurately.

- **Chores**
- Streamlined data loading and handling for efficient training across
different tasks.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

* docs: update deepmd-gnn URL (deepmodeling#4482)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Documentation**
- Updated guidelines for creating and integrating new models in the
DeePMD-kit framework.
- Added new sections on descriptors, fitting networks, and model
requirements.
	- Enhanced unit testing section with instructions for regression tests.
- Updated URL for the DeePMD-GNN plugin to reflect new repository
location.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Jinzhe Zeng <[email protected]>

* docs: update DPA-2 citation (deepmodeling#4483)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Updated references in the bibliography for the DPA-2 model to include
a new article entry for 2024.
	- Added a new reference for an attention-based descriptor.
  
- **Bug Fixes**
- Corrected reference links in documentation to point to updated DOI
links instead of arXiv.

- **Documentation**
- Revised entries in the credits and model documentation to reflect the
latest citations and details.
- Enhanced clarity and detail in fine-tuning documentation for
TensorFlow and PyTorch implementations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <[email protected]>

* docs: fix a minor typo on the title of `install-from-c-library.md` (deepmodeling#4484)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Documentation**
- Updated formatting of the installation guide for the pre-compiled C
library.
- Icons for TensorFlow and JAX are now displayed together in the header.
	- Retained all installation instructions and compatibility notes.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Jinzhe Zeng <[email protected]>

* fix: print dlerror if dlopen fails (deepmodeling#4485)

xref: deepmodeling/deepmd-gnn#44

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enhanced error messages for library loading failures on non-Windows
platforms.
- Updated thread management environment variable checks for improved
compatibility.
- Added support for mixed types in tensor input handling, allowing for
more flexible configurations.

- **Bug Fixes**
	- Improved error reporting for dynamic library loading issues.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* change doc to py

* Add out_bias out_std doc

* change bias method to compute_stats_do_not_distinguish_types

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* change var_name to property_name

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* change logic of extensive bias

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add doc for neww added parameter

* change doc for compute_stats_do_not_distinguish_types

* try to fix dptest

* change all property to property_name

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix UT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Delete key 'property' completely

* Fix UT

* Fix dptest UT

* pd: fix oom error (deepmodeling#4493)

Paddle use `MemoryError` rather than `RuntimeError` used in pytorch, now
I can test DPA-1 and DPA-2 in 16G V100...

![image](https://github.com/user-attachments/assets/42ead773-bf26-4195-8f67-404b151371de)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
- Improved detection of out-of-memory (OOM) errors to enhance
application stability.
- Ensured cached memory is cleared upon OOM errors, preventing potential
memory leaks.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

* pd: add missing `dp.eval()` in pd backend (deepmodeling#4488)

Switch to eval mode when evaluating model, otherwise `self.training`
will be `True`, backward graph will be created and cause OOM

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced model evaluation state management to ensure correct behavior
during evaluation.

- **Bug Fixes**
- Improved type consistency in the `normalize_coord` function for better
computational accuracy.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

* [pre-commit.ci] pre-commit autoupdate (deepmodeling#4497)

<!--pre-commit.ci start-->
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.8.3 →
v0.8.4](astral-sh/ruff-pre-commit@v0.8.3...v0.8.4)
<!--pre-commit.ci end-->

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Delete attribute

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Solve comment

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Solve error

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* delete property_name in serialize

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Chenqqian Zhang <[email protected]>
Co-authored-by: Jia-Xin Zhu <[email protected]>
Co-authored-by: HydrogenSulfate <[email protected]>
Co-authored-by: Jinzhe Zeng <[email protected]>

* add multig1 mess

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
Signed-off-by: Duo <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Chenqqian Zhang <[email protected]>
Co-authored-by: Jia-Xin Zhu <[email protected]>
Co-authored-by: HydrogenSulfate <[email protected]>
Co-authored-by: Jinzhe Zeng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants