Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing Tool: Code Revision and Cleanup #38

Merged
merged 25 commits into from
Nov 26, 2024
Merged

Indexing Tool: Code Revision and Cleanup #38

merged 25 commits into from
Nov 26, 2024

Conversation

esimpsons3ti
Copy link
Collaborator

@esimpsons3ti esimpsons3ti commented Aug 19, 2024

The indexing tool pds4_create_xml_index.py contained code that was redundant/nonfunctional. This pull request contains a revision of a large potion of the indexing tool, along with redone supplemental files and 100% unit test coverage.

  1. Changes within the pds4_create_xml_index tool:
  • Some pieces of code were turned into functions, such as get_true_type and sort_dataframe.
  • If multiple glob patterns catch the same label file, that label file will no longer appear with duplicate entries in the final index file.
  • Added additional error catching throughout the program to catch additional problems, such as an incorrectly formatted LID.
  • Revised the variable label_results to have a simplified structure, allowing for easier data extraction and header modification.
  1. Changes within the unit tests:
  • Added unit tests for label generation
  • Added unit tests for testing “no input” queries (just the directory path and the glob patterns).
  • Restructured unit test suite to better accommodate different output results
  • Added more unit tests to failures.
  • The label template now allows for a user-defined creation_date_time value, under the File class in File_Area_Ancillary/File_Area_Metadata
  • An issue where md5_checksum was different between operating systems was fixed.
  • Some variables within the unit tests were capitalized to mark their status as global variables.
  • Added additional unit tests get unit test coverage to 100%

Fixes #37

Summary by Bito

This PR enhances the PDS4 XML index creation tool with expanded test coverage, improved error handling, and optimized processing. Key updates include refactoring the 'get_true_type' function, enhancing XML template handling, and optimizing XPath processing. New configuration files and test cases were added to validate these changes.

Code change type: Refactoring, Testing, Optimization, Documentation

Unit tests added: True

Estimated effort to review (1-5, lower is better): 5

Copy link

codecov bot commented Aug 19, 2024

Codecov Report

Attention: Patch coverage is 86.25000% with 22 lines in your changes missing coverage. Please review.

Project coverage is 96.34%. Comparing base (a51e030) to head (a299746).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pds4indextools/pds4_create_xml_index.py 86.25% 17 Missing and 5 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #38       +/-   ##
===========================================
+ Coverage   72.01%   96.34%   +24.33%     
===========================================
  Files           1        1               
  Lines         611      602        -9     
  Branches      142      124       -18     
===========================================
+ Hits          440      580      +140     
+ Misses        141       17      -124     
+ Partials       30        5       -25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rfrenchseti
Copy link
Collaborator

/review

@pds-admin
Copy link
Collaborator

pds-admin commented Aug 26, 2024

Code Review Agent Run #91ec65

  • AI Based Review: ✔️ Successful
  • Static Analysis: ✔️ Successful

High-level Feedback

Ensure all XPath expressions follow PDS4 standards by removing non-standard suffixes like '<1>'. Update special constant values in configuration files to align with PDS4 specifications. Add comprehensive assertions to test cases to verify expected outcomes, including file content checks and error handling. Improve error messages and logging for better debugging. Consider refactoring complex functions for better maintainability. Ensure all datetime formats comply with PDS4 standards. Review and update documentation to reflect changes in functionality and configuration options.

Actionable Issues

📄 test_files/expected/tester_config.yaml
Issues: Total - 1, High importance - 1
Line 22-23 🔴 High importance - 1   
📄 test_files/samples/element_extra_file_info.txt
Issues: Total - 1, High importance - 1
Line 4-4 🔴 High importance - 1   
📄 tests/test_pds4_create_xml_index_blackbox.py
Issues: Total - 1, High importance - 1
Line 571-618 🔴 High importance - 1   
📄 test_files/labels/bad_lid_label.xml
Issues: Total - 1, High importance - 1
Line 20-20 🔴 High importance - 1   
📄 tests/test_pds4_create_xml_index_whitebox.py
Issues: Total - 1, High importance - 0
Line 572-583 🟠 Medium importance - 1   
📄 test_files/expected/simplify_xpaths_success_1.txt
Issues: Total - 1, High importance - 1
Line 1-4 🔴 High importance - 1   
📄 test_files/expected/tester_config_nillable.yaml
Issues: Total - 1, High importance - 1
Line 1-18 🔴 High importance - 1   

AI Code Review powered by Bito Logo

@rfrenchseti
Copy link
Collaborator

The Internal_Reference, External_Reference, Source_Product_Internal, and Source_Product_External label entries are missing a real implementation.

    <Reference_List>
        $IF(Internal_Reference)
        $FOR(field, k=Internal_Reference)
        <Internal_Reference>
            <lid_reference></lid_reference>
            <reference_type></reference_type>
            <comment></comment>
        </Internal_Reference>
        $END_FOR
        $END_IF
        $IF(External_Reference)
        $FOR(field, k=External_Reference)
        <External_Reference>
            <doi></doi>
            <reference_text></reference_text>
            <description></description>
        </External_Reference>
        $END_FOR
        $END_IF
        $IF(Source_Product_Internal)
        $FOR(field, k=Source_Product_Internal)
        <Source_Product_Internal>
            <lidvid_reference></lidvid_reference>
            <reference_type></reference_type>
            <comment></comment>
        </Source_Product_Internal>
        $END_FOR
        $END_IF
        $IF(Source_Product_External)
        $FOR(field, k=Source_Product_External)
        <Source_Product_External>
            <external_source_product_identifier></external_source_product_identifier>
            <reference_type></reference_type>
            <doi></doi>
            <curating_facility></curating_facility>
            <description></description>
        </Source_Product_External>
        $END_FOR
        $END_IF
    </Reference_List>

@rfrenchseti
Copy link
Collaborator

I think I mentioned this before, but it would be nice if TEMPFILE in the label template were named something else, and agreed with the other variables in case. Maybe this is just index_file_name?

@rfrenchseti
Copy link
Collaborator

--simplify-xpaths does not appear to be working reliably. For example, here are two label files. When I run with:

pds4_create_xml_index . rf_tester_label_1.xml rf_tester_label_2.xml --output-headers headers.txt --simplify-xpaths

I get this result (I sorted the lines so it's easier to see relationships):

pds:author_list<1>
pds:information_model_version<1>
pds:lid_reference<1>
pds:logical_identifier<1>
pds:Product_Observational/pds:Identification_Area<1>/pds:Citation_Information<1>/pds:keyword<1>
pds:Product_Observational/pds:Identification_Area<1>/pds:Citation_Information<1>/pds:keyword<2>
pds:Product_Observational/pds:Identification_Area<1>/pds:Citation_Information<1>/pds:keyword<3>
pds:Product_Observational/pds:Observation_Area<1>/pds:Discipline_Area<1>/geom:Geometry<1>/geom:SPICE_Kernel_Files<1>/geom:comment<1>
pds:Product_Observational/pds:Observation_Area<1>/pds:Discipline_Area<1>/geom:Geometry<1>/geom:SPICE_Kernel_Files<1>/geom:SPICE_Kernel_Identification<1>/geom:kernel_type<1>
pds:Product_Observational/pds:Observation_Area<1>/pds:Discipline_Area<1>/geom:Geometry<1>/geom:SPICE_Kernel_Files<1>/geom:SPICE_Kernel_Identification<1>/geom:spice_kernel_file_name<1>
pds:Product_Observational/pds:Observation_Area<1>/pds:Discipline_Area<1>/geom:Geometry<1>/geom:SPICE_Kernel_Files<1>/geom:SPICE_Kernel_Identification<2>/geom:kernel_type<1>
pds:Product_Observational/pds:Observation_Area<1>/pds:Discipline_Area<1>/geom:Geometry<1>/geom:SPICE_Kernel_Files<1>/geom:SPICE_Kernel_Identification<2>/geom:spice_kernel_file_name<1>
pds:Product_Observational/pds:Observation_Area<1>/pds:Discipline_Area<1>/geom:Geometry<1>/geom:SPICE_Kernel_Files<2>/geom:comment<1>
pds:Product_Observational/pds:Observing_System<1>/pds:name<1>
pds:Product_Observational/pds:Observing_System<1>/pds:Observing_System_Component<1>/pds:name<1>
pds:Product_Observational/pds:Observing_System<1>/pds:Observing_System_Component<1>/pds:type<1>
pds:Product_Observational/pds:Observing_System<2>/pds:name<1>
pds:Product_Observational/pds:Observing_System<2>/pds:Observing_System_Component<1>/pds:name<1>
pds:Product_Observational/pds:Observing_System<2>/pds:Observing_System_Component<1>/pds:type<1>
pds:publication_year<1>
pds:reference_type<1>
pds:title<1>
pds:version_id<1>

The xpath pds:keyword should be simplified.

Also, I don't see how the xpath:

pds:Product_Observational/pds:Observation_Area<1>/pds:Discipline_Area<1>/geom:Geometry<1>/geom:SPICE_Kernel_Files<1>/geom:SPICE_Kernel_Identification<2>/geom:kernel_type<1>

can even exist. There is no place in these labels where there are two geom:SPICE_Kernel_Identification under a single geom:SPICE_Kernel_Files. To the contrary, there are two geom:SPICE_Kernel_Files, each with a single geom:SPICE_Kernel_Identification under it.

Here is rf_tester_label_1.xml:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1B00.sch"
    schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-model href="https://pds.nasa.gov/pds4/disp/v1/PDS4_DISP_1B00.sch"
    schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-model href="https://pds.nasa.gov/pds4/mission/cassini/v1/PDS4_CASSINI_1B00_1300.sch"
    schematypens="http://purl.oclc.org/dsdl/schematron"?>
<Product_Observational xmlns="http://pds.nasa.gov/pds4/pds/v1"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xmlns:geom="http://pds.nasa.gov/pds4/geom/v1"
 xmlns:rings="http://pds.nasa.gov/pds4/rings/v1"
 xsi:schemaLocation="http://pds.nasa.gov/pds4/pds/v1 https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1E00.xsd
                     http://pds.nasa.gov/pds4/geom/v1 https://pds.nasa.gov/pds4/geom/v1/PDS4_GEOM_1B10_1700.xsd
                     http://pds.nasa.gov/pds4/rings/v1 https://pds.nasa.gov/pds4/rings/v1/PDS4_RINGS_1E00_1A00.xsd">
    <Identification_Area>
        <logical_identifier>urn:nasa:pds:cassini_iss_saturn:data_raw:1455200455n</logical_identifier>
        <version_id>1.0</version_id>
        <title>Cassini ISS Image 1455200455n.img</title>
        <information_model_version>1.11.0.0</information_model_version>
        <Citation_Information>
            <author_list>French, Richard G.</author_list>
            <publication_year nilReason="unknown" xsi:nil="true"/>
            <keyword>kw1</keyword>
        </Citation_Information>
    </Identification_Area>
    <Observation_Area>
        <Discipline_Area>
            <geom:Geometry>
                <geom:SPICE_Kernel_Files>
                    <geom:SPICE_Kernel_Identification>
                        <geom:kernel_type>SPK</geom:kernel_type>
                        <geom:spice_kernel_file_name>ura111.bsp</geom:spice_kernel_file_name>
                    </geom:SPICE_Kernel_Identification>
                    <geom:comment>These kernel files were used in the generation of the products in the parent bundle. Some or all of them may not have been used directly in the generation of this product.</geom:comment>
                </geom:SPICE_Kernel_Files>
            </geom:Geometry>
        </Discipline_Area>
    </Observation_Area>
    <Observing_System>
        <name>Cassini Orbiter Imaging Science Subsystem</name>
        <Observing_System_Component>
            <name>Cassini Orbiter</name>
            <type>Spacecraft</type>
            <Internal_Reference>
                <lid_reference>urn:nasa:pds:context:instrument_host:spacecraft.co</lid_reference>
                <reference_type>is_instrument_host</reference_type>
            </Internal_Reference>
        </Observing_System_Component>
    </Observing_System>
</Product_Observational>

Here is rf_tester_label_2.xml:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1B00.sch"
    schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-model href="https://pds.nasa.gov/pds4/disp/v1/PDS4_DISP_1B00.sch"
    schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-model href="https://pds.nasa.gov/pds4/mission/cassini/v1/PDS4_CASSINI_1B00_1300.sch"
    schematypens="http://purl.oclc.org/dsdl/schematron"?>
<Product_Observational xmlns="http://pds.nasa.gov/pds4/pds/v1"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xmlns:geom="http://pds.nasa.gov/pds4/geom/v1"
 xmlns:rings="http://pds.nasa.gov/pds4/rings/v1"
 xsi:schemaLocation="http://pds.nasa.gov/pds4/pds/v1 https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1E00.xsd
                     http://pds.nasa.gov/pds4/geom/v1 https://pds.nasa.gov/pds4/geom/v1/PDS4_GEOM_1B10_1700.xsd
                     http://pds.nasa.gov/pds4/rings/v1 https://pds.nasa.gov/pds4/rings/v1/PDS4_RINGS_1E00_1A00.xsd">
    <Identification_Area>
        <logical_identifier>urn:nasa:pds:cassini_iss_saturn:data_raw:1455200455n</logical_identifier>
        <version_id>1.0</version_id>
        <title>Cassini ISS Image 1455200455n.img</title>
        <information_model_version>1.11.0.0</information_model_version>
        <Citation_Information>
            <author_list>French, Richard G.</author_list>
            <publication_year nilReason="unknown" xsi:nil="true"/>
            <keyword>kw1</keyword>
            <keyword>kw2</keyword>
            <keyword>kw3</keyword>
        </Citation_Information>
    </Identification_Area>
    <Observation_Area>
        <Discipline_Area>
            <geom:Geometry>
                <geom:SPICE_Kernel_Files>
                    <geom:SPICE_Kernel_Identification>
                        <geom:kernel_type>SPK</geom:kernel_type>
                        <geom:spice_kernel_file_name>ura111.bsp</geom:spice_kernel_file_name>
                    </geom:SPICE_Kernel_Identification>
                    <geom:comment>These kernel files were used in the generation of the products in the parent bundle. Some or all of them may not have been used directly in the generation of this product.</geom:comment>
                </geom:SPICE_Kernel_Files>
                <geom:SPICE_Kernel_Files>
                    <geom:SPICE_Kernel_Identification>
                        <geom:kernel_type>SPK</geom:kernel_type>
                        <geom:spice_kernel_file_name>earthstns_itrf93_040916.bsp</geom:spice_kernel_file_name>
                    </geom:SPICE_Kernel_Identification>
                    <geom:comment>These kernel files were used in the generation of the products in the parent bundle. Some or all of them may not have been used directly in the generation of this product.</geom:comment>
                </geom:SPICE_Kernel_Files>
            </geom:Geometry>
        </Discipline_Area>
    </Observation_Area>
    <Observing_System>
        <name>Cassini Orbiter Imaging Science Subsystem</name>
        <Observing_System_Component>
            <name>Cassini Orbiter</name>
            <type>Spacecraft</type>
        </Observing_System_Component>
    </Observing_System>
    <Observing_System>
        <name>Another thing</name>
        <Observing_System_Component>
            <name>Another thing</name>
            <type>Spacecraft</type>
            <Internal_Reference>
                <lid_reference>urn:nasa:pds:context:instrument_host:spacecraft.co</lid_reference>
                <reference_type>is_instrument_host</reference_type>
            </Internal_Reference>
        </Observing_System_Component>
    </Observing_System>
</Product_Observational>

@rfrenchseti
Copy link
Collaborator

When generating a fixed-width table, the label has the wrong positions for the fields:

                    <Field_Character>
                        <name>pds:logical_identifier&lt;1&gt;</name>
                        <field_number>1</field_number>
                        <field_location unit="byte">1</field_location>
                        <data_type>ASCII_LID</data_type>
                        <field_length unit="byte">52</field_length>
                    </Field_Character>
                    <Field_Character>
                        <name>pds:version_id&lt;1&gt;</name>
                        <field_number>2</field_number>
                        <field_location unit="byte">26</field_location>
                        <data_type>ASCII_Short_String_Collapsed</data_type>
                        <field_length unit="byte">17</field_length>
                    </Field_Character>

Note the math does not add up (a 52 character field and then the next one starts at 26).

@rfrenchseti
Copy link
Collaborator

In the documentation, it says:

Below is the label-contents section of the default configuration file:

label-contents:
  version_id: 1.0
  title: Index File
  Citation_Information:
  Modification_Detail:
  Internal_Reference:
  External_Reference:
  Source_Product_Internal:
  Source_Product_External:

but this is missing the last two lines:

  File_Area_Ancillary:
  File_Area_Metadata:

@rfrenchseti
Copy link
Collaborator

In the documentation under "For reference, provided below are the full contents of the optional label classes:" I think every line should end with a colon. Right now if you copy this list into your yaml config file, it won't parse.

@rfrenchseti
Copy link
Collaborator

If there are no references, the label contains:

    <Reference_List>
    </Reference_List>

It seems like in this case the section should be left out entirely.

@rfrenchseti
Copy link
Collaborator

I'm not sure why, but this part of the label template doesn't work:

        $FOR(Citation_Information['keyword'])
            <keyword>$VALUE$</keyword>
        $END_FOR

If a Citation_Information section is provided in the user's config file, even if every field is empty, the label generation stops (without error!) at the beginning of the citation information section. For example this config file:

label-contents:
  version_id: '2.5'
  title: "This is a long string, with commas, & symbols, : <> stuff, etc."
  Citation_Information:
    author_list: me
    editor_list:
    publication_year:
    doi:
    keyword:
    description:
    Funding_Acknowledgement:
      funding_source:
      funding_year:
      funding_award:
      funding_acknowledgement_text:

If I remove that loop from the label template, it works.

Is this a bug in PdsTemplate?

@rfrenchseti
Copy link
Collaborator

The documentation for the label config is a little confusing. For example, author_list probably isn't actually expecting a list. If you give it:

  Citation_Information:
    author_list:
      me
      you

then it just separates them with spaces:

            <author_list>me you</author_list>

On the other hand, keyword really is expecting a list, and if you just give it a single keyword like the name implies, you definitely don't get what you expect:

    keyword: fred

yields

            <keyword>f</keyword>
            <keyword>r</keyword>
            <keyword>e</keyword>
            <keyword>d</keyword>

@rfrenchseti
Copy link
Collaborator

Likewise, let's say you put this in your config file:

  Modification_Detail:
    modification_date: "2020-01-01"
    version_id: 15.5
    description: This is a description

You get this:

        <Modification_History>
            <Modification_Detail>
                <modification_date>[[[TypeError(string indices must be integers, not 'str') at line 45]]]</modification_date>
                <version_id>[[[TypeError(string indices must be integers, not 'str') at line 46]]]</version_id>
                <description>[[[TypeError(string indices must be integers, not 'str') at line 47]]]</description>
            </Modification_Detail>
            <Modification_Detail>
                <modification_date>[[[TypeError(string indices must be integers, not 'str') at line 45]]]</modification_date>
                <version_id>[[[TypeError(string indices must be integers, not 'str') at line 46]]]</version_id>
                <description>[[[TypeError(string indices must be integers, not 'str') at line 47]]]</description>
            </Modification_Detail>
            <Modification_Detail>
                <modification_date>[[[TypeError(string indices must be integers, not 'str') at line 45]]]</modification_date>
                <version_id>[[[TypeError(string indices must be integers, not 'str') at line 46]]]</version_id>
                <description>[[[TypeError(string indices must be integers, not 'str') at line 47]]]</description>
            </Modification_Detail>
        </Modification_History>

@rfrenchseti
Copy link
Collaborator

--clean-header crashes sometimes. Using the test labels in earlier comments:

pds4_create_xml_index . rf_tester_label_1.xml rf_tester_label_2.xml --output-headers headers.txt --generate-label ancillary --output-index-file out.csv  --clean-header

gives:

Non-nillable element in rf_tester_label_1.xml has no associated text: publication_year
Non-nillable element in rf_tester_label_2.xml has no associated text: publication_year
Index file generated at out.csv
XPath headers file generated at headers.txt.
Traceback (most recent call last):
  File "/seti/all_repos/rms-pds4indextools/venv/bin/pds4_create_xml_index", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/seti/all_repos/rms-pds4indextools/pds4indextools/pds4_create_xml_index.py", line 1566, in main
    true_type = true_type.split(':')[-1]
                ^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'split'

@rfrenchseti
Copy link
Collaborator

rfrenchseti commented Sep 24, 2024

--clean-headers and --sort-by do not play well together. --sort-by apparently wants to use the header names before cleaning. But the error message says this:

❯ pds4_create_xml_index . rf_tester_label_1.xml rf_tester_label_2.xml --output-index-file out.csv --output-headers headers.txt --clean-header --sort-by pds_Product_Observational__pds_Identification_Area_1__pds_logical_identifier_1
Unknown sort key 'pds_Product_Observational__pds_Identification_Area_1__pds_logical_identifier_1'. For a list of available sort keys, use the --output-headers-file option.

If you follow those instructions and look at the output headers file, you will see the cleaned names, which then don't actually work with the sort option. Perhaps the message should say "For a list of available sort keys, use the --output-headers-file option without --clean-header-field-names".

Alternatively, you could do the sort on the cleaned names instead of the original names, which might make more sense given that those are the actual header names in the index file.

@rfrenchseti
Copy link
Collaborator

For this option:

  --limit-xpaths-file XPATHS_FILEPATH
                        Optional text file specifying which XPaths to scrape. If not specified, all XPaths found in the label files are included.

I suggest adding a note that the XPaths in the file must be the full, not simplified, version. This should be put in the main documentation, too.

@rfrenchseti
Copy link
Collaborator

The column byte positions in the label are wrong for fixed-width index files. Consider:

python ../pds4indextools/pds4_create_xml_index.py . "rf_tester_label*.xml" --generate-label metadata --fixed

The first three columns actually start at 1, 80, and 151, but the label says they start at 1, 79, and 150.

Also, any column that has a string in quotes, like pds:author_list, doesn't line up with the headers because the padding doesn't take into account the two characters of the quotes. This also makes the label wrong by an additional 2 bytes.

@rfrenchseti
Copy link
Collaborator

In the Citation_Information section, are some of those fields optional? I think Funding Acknowledgment is at least. The way things are written right now, there doesn't seem to be a way to omit these fields from the label. If you don't supply values, you just get:

            <Funding_Acknowledgement>
                <funding_source>None</funding_source>
                <funding_year>None</funding_year>
                <funding_award>None</funding_award>
                <funding_acknowledgement_text>None</funding_acknowledgement_text>
            </Funding_Acknowledgement>

I assume this general issue could be true for the other optional label fields as well. Perhaps for the optional fields there should be IF statements in the template to allow the individual fields to be omitted?

@rfrenchseti rfrenchseti merged commit d9b7761 into main Nov 26, 2024
15 checks passed
@rfrenchseti rfrenchseti deleted the es-overhaul branch November 26, 2024 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pds4_create_xml_index : Legacy, Non-Functional, and Overly Complex Code
3 participants