Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug flags #169

Merged
merged 40 commits into from
Jan 9, 2025
Merged

Debug flags #169

merged 40 commits into from
Jan 9, 2025

Conversation

scemama
Copy link
Member

@scemama scemama commented Nov 12, 2024

This PR adds --enable-debug and --enable-sanitizer to configure.ac to make many checks on the library for the github actions.
It depends on PR #168
There are still some issues to fix before it can be merged, but I created the draft so that you know that I am working on it.

@scemama scemama marked this pull request as ready for review December 5, 2024 12:10
@scemama
Copy link
Member Author

scemama commented Dec 5, 2024

All good now! Agressive compiler checks don't give any warning, and we can now easily use the sanitizer to detect many issues at runtime.

@q-posev
Copy link
Member

q-posev commented Dec 5, 2024

@scemama there is a bug at the compilation step:

src/trexio.c: In function ‘trexio_string_of_error_f’:
src/trexio.c:184:16: error: ‘MAX_STRING_LENGTH’ undeclared (first use in this function)
  184 |   if (sizeCp > MAX_STRING_LENGTH) sizeCp = MAX_STRING_LENGTH;
      |                ^~~~~~~~~~~~~~~~~
src/trexio.c:184:16: note: each undeclared identifier is reported only once for each function it appears in

@q-posev q-posev self-requested a review December 5, 2024 15:24
@scemama
Copy link
Member Author

scemama commented Dec 5, 2024

Sorry... it compiles now :-)

@q-posev
Copy link
Member

q-posev commented Dec 6, 2024

@scemama thanks! I ran a few tests through valgrind. The C tests look fine though there is one pthread-related error reported when they are compiled with all debug and sanitizer flags on. No error reported in conventional ./configure build.

However, on the Fortran test I get the following error:
(my valgrind-libtool command is this: libtool --mode=execute valgrind - it's the recommended way to run valgrind for memory leaks)

I guess it's an artifact of the Fortran test modifications from PR #168

~/trexio $ valgrind-libtool ./tests/test_f
==27762== Memcheck, a memory error detector
==27762== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==27762== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==27762== Command: /home/q-posev/trexio/tests/.libs/test_f
==27762== 
============================================
         TREXIO VERSION STRING : 2.5.1       
         TREXIO MAJOR VERSION  :   2
         TREXIO MINOR VERSION  :   5
============================================
TREXIO_PACKAGE_VERSION : 2.5.1
TREXIO_GIT_HASH        : cd369bd1875a46e7da73457a24a8fad09d915a1e
HAVE_HDF5              : true
HDF5 library version: 1.10.7
 call test_write
 SUCCESS HAS NOT 1
 SUCCESS HAS NOT 2
 SUCCESS HAS NOT 2.1
 SUCCESS HAS NOT 2.2
 SUCCESS HAS NOT 3
 SUCCESS HAS NOT 4
 SUCCESS HAS NOT 5
 SUCCESS HAS NOT 6
 SUCCESS WRITE NUM
 SUCCESS WRITE CHARGE
 SUCCESS WRITE COORD
 SUCCESS WRITE LABEL
 SUCCESS WRITE POINT GROUP
 SUCCESS WRITE BASIS NUM
 SUCCESS WRITE INDEX
 SUCCESS WRITE INDEX TYPE
 SUCCESS WRITE AO NUM
 SUCCESS WRITE MO NUM
 SUCCESS WRITE ENERGY
 SUCCESS WRITE SPIN
 SUCCESS WRITE SPARSE
 SUCCESS WRITE SPARSE
 SUCCESS WRITE SPARSE
 SUCCESS WRITE SPARSE
 SUCCESS WRITE SPARSE
 SUCCESS WRITE DET LIST
 SUCCESS WRITE DET LIST
 SUCCESS WRITE DET LIST
 SUCCESS WRITE DET LIST
 SUCCESS WRITE DET LIST
 SUCCESS HAS 1
 SUCCESS HAS 2
 SUCCESS HAS 3
 SUCCESS HAS 4
 SUCCESS HAS 5
 SUCCESS HAS 6
 SUCCESS CLOSE
 call test_read
 SUCCESS READ NUM
 SUCCESS READ CHARGE
 SUCCESS READ COORD
 SUCCESS READ LABEL
 SUCCESS READ INDEX
 SUCCESS READ INDEX TYPE
 SUCCESS READ POINT GROUP
 SUCCESS READ SPARSE DATA
 SUCCESS READ SPARSE DATA EOF
 SUCCESS READ SPARSE SIZE
 SUCCESS GET INT64_NUM
 SUCCESS READ DET LIST
 SUCCESS READ DET NUM
 SUCCESS CONVERT DET LIST
 SUCCESS CONVERT ORB LIST
 call test_read_void
==27762== Conditional jump or move depends on uninitialised value(s)
==27762==    at 0x4C4A981: _gfortran_string_len_trim (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==27762==    by 0x10D1CE: test_read_void_ (test_f.f90:675)
==27762==    by 0x10FE4B: MAIN__ (test_f.f90:51)
==27762==    by 0x10B72E: main (test_f.f90:2)
==27762== 
==27762== Conditional jump or move depends on uninitialised value(s)
==27762==    at 0x4C4A8FD: _gfortran_string_len_trim (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==27762==    by 0x10D1CE: test_read_void_ (test_f.f90:675)
==27762==    by 0x10FE4B: MAIN__ (test_f.f90:51)
==27762==    by 0x10B72E: main (test_f.f90:2)
==27762== 
==27762== Syscall param write(buf) points to uninitialised byte(s)
==27762==    at 0x4DAD887: write (write.c:26)
==27762==    by 0x4C3BED8: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==27762==    by 0x4C445B1: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==27762==    by 0x4C36E14: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==27762==    by 0x4C39E41: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==27762==    by 0x4C3A323: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==27762==    by 0x10D1EA: test_read_void_ (test_f.f90:675)
==27762==    by 0x10FE4B: MAIN__ (test_f.f90:51)
==27762==    by 0x10B72E: main (test_f.f90:2)
==27762==  Address 0x639c798 is 40 bytes inside a block of size 512 alloc'd
==27762==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==27762==    by 0x49E0D88: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==27762==    by 0x4C44465: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==27762==    by 0x4C3B2A1: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==27762==    by 0x49DF3D1: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==27762==    by 0x400647D: call_init.part.0 (dl-init.c:70)
==27762==    by 0x4006567: call_init (dl-init.c:33)
==27762==    by 0x4006567: _dl_init (dl-init.c:117)
==27762==    by 0x40202C9: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
==27762== 
 Test error message: Error opening file�������G���������
 call test_write('test_write_f.h5', TREXIO_HDF5)
 SUCCESS HAS NOT 1
 SUCCESS HAS NOT 2
 SUCCESS HAS NOT 2.1
 SUCCESS HAS NOT 2.2
 SUCCESS HAS NOT 3
 SUCCESS HAS NOT 4
 SUCCESS HAS NOT 5
 SUCCESS HAS NOT 6
 SUCCESS WRITE NUM
 SUCCESS WRITE CHARGE
 SUCCESS WRITE COORD
 SUCCESS WRITE LABEL
 SUCCESS WRITE POINT GROUP
 SUCCESS WRITE BASIS NUM
 SUCCESS WRITE INDEX
 SUCCESS WRITE INDEX TYPE
 SUCCESS WRITE AO NUM
 SUCCESS WRITE MO NUM
 SUCCESS WRITE ENERGY
 SUCCESS WRITE SPIN
 SUCCESS WRITE SPARSE
 SUCCESS WRITE SPARSE
 SUCCESS WRITE SPARSE
 SUCCESS WRITE SPARSE
 SUCCESS WRITE SPARSE
 SUCCESS WRITE DET LIST
 SUCCESS WRITE DET LIST
 SUCCESS WRITE DET LIST
 SUCCESS WRITE DET LIST
 SUCCESS WRITE DET LIST
 SUCCESS HAS 1
 SUCCESS HAS 2
 SUCCESS HAS 3
 SUCCESS HAS 4
 SUCCESS HAS 5
 SUCCESS HAS 6
 SUCCESS CLOSE
 call test_read('test_write_f2.h5', TREXIO_HDF5)
 SUCCESS READ NUM
 SUCCESS READ CHARGE
 SUCCESS READ COORD
 SUCCESS READ LABEL
 SUCCESS READ INDEX
 SUCCESS READ INDEX TYPE
 SUCCESS READ POINT GROUP
 SUCCESS READ SPARSE DATA
 SUCCESS READ SPARSE DATA EOF
 SUCCESS READ SPARSE SIZE
 SUCCESS GET INT64_NUM
 SUCCESS READ DET LIST
 SUCCESS READ DET NUM
 SUCCESS CONVERT DET LIST
 SUCCESS CONVERT ORB LIST
==27762== Conditional jump or move depends on uninitialised value(s)
==27762==    at 0x4C4A981: _gfortran_string_len_trim (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==27762==    by 0x10D1CE: test_read_void_ (test_f.f90:675)
==27762==    by 0x10B72E: main (test_f.f90:2)
==27762== 
==27762== Conditional jump or move depends on uninitialised value(s)
==27762==    at 0x4C4A8FD: _gfortran_string_len_trim (in /usr/lib/x86_64-linux-gnu/libgfortran.so.5.0.0)
==27762==    by 0x10D1CE: test_read_void_ (test_f.f90:675)
==27762==    by 0x10B72E: main (test_f.f90:2)
==27762== 
 Test error message: Error opening file             test_write_f2.dir                                               ��>
==27762== 
==27762== HEAP SUMMARY:
==27762==     in use at exit: 1,864 bytes in 3 blocks
==27762==   total heap usage: 7,383 allocs, 7,380 frees, 5,647,684 bytes allocated
==27762== 
==27762== LEAK SUMMARY:
==27762==    definitely lost: 0 bytes in 0 blocks
==27762==    indirectly lost: 0 bytes in 0 blocks
==27762==      possibly lost: 0 bytes in 0 blocks
==27762==    still reachable: 1,864 bytes in 3 blocks
==27762==         suppressed: 0 bytes in 0 blocks
==27762== Rerun with --leak-check=full to see details of leaked memory
==27762== 
==27762== Use --track-origins=yes to see where uninitialised values come from
==27762== For lists of detected and suppressed errors, rerun with: -s
==27762== ERROR SUMMARY: 6 errors from 5 contexts (suppressed: 0 from 0)

@scemama
Copy link
Member Author

scemama commented Dec 6, 2024

In this statement:

  call trexio_string_of_error(rc, str)
  print *, 'Test error message: ', trim(str)

the string returned by trexio_string_of_error was shorter than the max size of str. The trim function scans it completely, so as str was not initialized (such as str = '') trim looked at uninitialized values.

I fixed it in the tests by initializing str to ''.
But now I am fixing it in the library to avoid expecting initialized string. Wait a minute before merging.

@scemama
Copy link
Member Author

scemama commented Dec 6, 2024

Fixed! You can merge now :-)

@q-posev
Copy link
Member

q-posev commented Dec 6, 2024

Thank you @scemama ! It's interesting that I haven't seen this bug before, I used to run valgrind on the Fortran test too and it was clean. Perhaps I need to add valgrind calls to the CI.

@q-posev
Copy link
Member

q-posev commented Dec 6, 2024

If that's OK with you, I prefer to fix the Python determinant tests first in PR #168 and then merge this PR. Nothing to do on your side, I will update this branch when it's done.
I hope to get some time over Christmas to fix the Python tests.

@scemama
Copy link
Member Author

scemama commented Dec 6, 2024

Perfect!

@scemama
Copy link
Member Author

scemama commented Dec 30, 2024

While I was improving the rust binding, the tests I made for determinants were not working with the text backend on the master branch. This was due to the too small representation of integers in the files, which is fixed in this PR. I think that this fix is an important one, but it breaks backward-compatibility of the text backend. When we merge, we may need to set the version to 3.0.0.

@q-posev
Copy link
Member

q-posev commented Dec 30, 2024

Are you talking about the fix from the PR #168? If yes, then it does not require an update of the major version as the TREXIO API remains unchanged. It is an important bug fix of the determinant IO in the text backend, which can be reflected in the minor version bump, but i am not convinced that the API compatibility is violated.

@scemama
Copy link
Member Author

scemama commented Dec 30, 2024

No, I am talking about the current PR. This particular commit: 0867434
where we have things like this:

-  uint64_t line_length = dims[1]*11UL + 1UL; // 10 digits per int64_t bitfield + 1 space = 11 spots + 1 newline char
+  uint64_t line_length = dims[1]*21UL + 1UL; // 20 digits per int64_t bitfield + 1 space = 11 spots + 1 newline char

You are right that the API is unchanged, maybe we should not change the major version.
But the old text files will not be readable anymore. TREXIO will produce an error, but the produced files are very likely to be wrong anyway....

I think it is a bit urgent to fix this particular bug in the master branch. Maybe we can create another PR with only this commit to merge it quickly. What do you think?

I think that the current PR is also important: when I tried to run the rust interface with the TREXIO of this particular branch, I had many errors detected at runtime that were silent before (some safe functions were not really safe...). It helped me fix some silent bugs in the rust interface! :-)

@q-posev
Copy link
Member

q-posev commented Jan 2, 2025

The commit you mentioned was introduced in PR #168 and then appeared here after you forked the branch. If merging this bug fix it is urgent for you - I can merge PR #168 as it is but the python tests will remain broken until i find some time to fix them. Will it work for you?
This PR #169 introduced a lot of changes unrelated to the determinant IO and i am not convinced yet that we need all of them (i know that these changes make the compiler happy). I prefer to have a detailed look at these changes before merging them, if that's ok with you. But this might take time, given my current workload.
The safe functions have been originally introduced as a dummy proxy for the Python SWIG interface. I don't know anyone who uses them directly. On the C side there is no guarantee of the safety anyways because one might accidentally pass a pointer to a shifted memory address (e.g. following some pointer arithmetic) and the size-max argument is completely disconnected from that passed pointer. But I am absolutely happy to see these improvements, especially if they reinforce the Rust interface! :-)

@scemama
Copy link
Member Author

scemama commented Jan 2, 2025

The commit you mentioned was introduced in PR #168 and then appeared here after you forked the branch. If merging this bug fix it is urgent for you - I can merge PR #168 as it is but the python tests will remain broken until i find some time to fix them. Will it work for you?

This is a good idea! Can you comment out the python tests that are broken so that we can get a green CI?

This PR #169 introduced a lot of changes unrelated to the determinant IO and i am not convinced yet that we need all of them (i know that these changes make the compiler happy).

It is not only that they make the compiler happy, it is that they enable the possibility to use the adress sanitizer and some more agressive checking in the CI. So it will help keep the code clean in the long term.
I understand that this PR is big. Take the time you need to look at it carefully instead of merging in a rush ;-)

The safe functions have been originally introduced as a dummy proxy for the Python SWIG interface. I don't know anyone who uses them directly.

In the foreign interfaces, I always use the safe functions. Also, it is possible that I use them in some QP plugins. I agree with you that they are not 100% safe, but they are as safe as the safe variants of the dangerous C functions (like strnlen, etc..).

Copy link
Member

@q-posev q-posev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scemama I am done with my review. I fixed the data corruption issue reported by @sheepforce, cleaned up some tests and addded the valgrind checks to the CI.

We should fix the TREXIO exit codes decoding function before merging this PR.

src/templates_front/templator_front.org Show resolved Hide resolved
src/templates_text/templator_text.org Outdated Show resolved Hide resolved
src/templates_front/templator_front.org Outdated Show resolved Hide resolved
@scemama
Copy link
Member Author

scemama commented Jan 8, 2025

We should fix the TREXIO exit codes decoding function before merging this PR.

Done!

@q-posev q-posev self-requested a review January 8, 2025 17:12
Copy link
Member

@q-posev q-posev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @scemama ! Ready to merge?

@scemama scemama merged commit 173fc3b into master Jan 9, 2025
4 checks passed
@scemama
Copy link
Member Author

scemama commented Jan 9, 2025

@q-posev Thanks for the review!

@q-posev q-posev deleted the debug_flags branch January 9, 2025 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants