Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to add metadata when uploading a resource #12862

Open
ridoo opened this issue Jan 29, 2025 · 19 comments · May be fixed by #12874
Open

Allow to add metadata when uploading a resource #12862

ridoo opened this issue Jan 29, 2025 · 19 comments · May be fixed by #12874
Labels
feature A new feature to be added to the codebase

Comments

@ridoo
Copy link
Contributor

ridoo commented Jan 29, 2025

Is your feature request related to a problem? Please describe.
Currently, datasets upload via importer does not serialize any metadata payload which could be set during import process. This requires either to

  • wait for new metadata API in GeoNode v5
  • use geonode-mapstore-client metadata editor

However, both approaches are either cumbersome (metadata editor) or would require two requests, which may end up in eventual consistency in case of network failure.

Describe the solution you'd like
The importer already takes data payload to offer some control over the import process (overwrite, skip existing layers, etc.). I can imagine a data field which would take (if necessary a reduced set of) attribute values to be set directly when creating the resource. For example:

response = requests.request(
            "POST",
            f"{url}?{token_param}",
            headers=headers,
            files=files,
            json={
                "overwrite_existing_layer": "false",
                "data": {"group": {"id": group}}
            }
        )

On import the handler would pass the data as **kwargs:

saved_dataset = resource_manager.create(
      None,
      resource_type=resource_type,
      defaults=dict(
          name=alternate,
          workspace=workspace,
          subtype="raster",
          alternate=f"{workspace}:{alternate}",
          dirty_state=True,
          title=layer_name,
          owner=_exec.user,
          asset=asset,
          **kwargs  ## or use _exec.input_params.get("data",  **{})
      ),
  )

Describe alternatives you've considered

As said before, a two-step workflow (upload and patch) may work too, but

  • GeoNode v5 metadata API is not available yet
  • PATCHing resource base metadata is not possible via API

Additional context
It seems, that upload.models.py#Upload (from v4.4) had been refactored to use ResourceBase model (part of the metadata API refactoring, I guess). However, it also misses the payload serialization which would be very convenient during API upload.

For the time being, I am still on v4.4 but tend to upgrade to v5 if you think, this should be part of the next version. However, I also would be happy to find a quick solution for v4.4. For example, the metadata field in the old Upload model looked promising to me on a first glimpse.

Any hints welcome :)

@ridoo ridoo added the feature A new feature to be added to the codebase label Jan 29, 2025
@mattiagiupponi
Copy link
Contributor

mattiagiupponi commented Jan 30, 2025

Hi @ridoo
During the upload, you can always send an XML file with the metadata you need and when the resource is created the XML file is applied, is done here: (i'm taking the vector handler for example and is the same as for the sld file)

self.handle_xml_file(saved_dataset, _exec)

def handle_xml_file(self, saved_dataset: Dataset, _exec: ExecutionRequest):
_path = _exec.input_params.get("files", {}).get("xml_file", "")
resource_manager.update(
None,
instance=saved_dataset,
xml_file=_path,
metadata_uploaded=True if _path else False,
vals={"dirty_state": True},
)

As you can see are field handled by the serializer:

xml_file = serializers.FileField(required=False)
sld_file = serializers.FileField(required=False)

If the XML (ISO metadata or Dublin core) is not the way you want to upload the data, you can define your custom METADATA_STORER in the settings file:

geonode/geonode/settings.py

Lines 2219 to 2224 in 306a958

"""
List of modules that implement custom metadata storers that will be called when the metadata of a resource is saved
"""
METADATA_STORERS = [
# 'geonode.resource.regions_storer.spatial_predicate_region_assignor',
]

This existing way, works for both GN5 and GN44x

@ridoo
Copy link
Contributor Author

ridoo commented Jan 30, 2025

@mattiagiupponi indeed XML would be too verbose and cumbersome for just one property. But METADATA_STORERS
look promising! Thanks for the hint. Did not know about it. Starting to🔧 ... I will let you know, if it works out.

@ridoo
Copy link
Contributor Author

ridoo commented Jan 30, 2025

@mattiagiupponi Looking into it, it seems that METADATA_STORER is meant to calculate things after an instance has been stored (see resource.regions_storer.spatial_predicate_region_assignor).

However, what is missing from resource.manager.py#create() is the missing kwargs. In resource.manager.py#create() there is the custom parameter which is past to the configured metadata storer. All those parameters are never passed to this location, as the importer just sets the defaults dict.

As described above, wouldn't it be just enough to let the importer read and pass those defaults?

@ridoo
Copy link
Contributor Author

ridoo commented Jan 31, 2025

@mattiagiupponi I see these change necessary to make this work:

  • importer:
    • add custom = serializers.JSONField(required=False, default={}) to Serializer
    • pop custom parameter from _data into extracted_params
    • pass custom=_exec.input_params.get("custom") to (each) handler (each since otherwise some refactoring is needed)
  • geonode:
    • resource.manager.py#create() needs to accept *args and **kwargs
    • Pass to resourcebase_post_save(_resource.get_real_instance(), **kwargs)
  • Then do the metadata storer handling
    • (not sure why this should not be useful in general, but I see the post_save was necessary for relationship being set properly)
    • handle the post save via
      if custom:
          instance = resource_manager.update(
              instance.uuid, instance=instance, vals=custom
          )
      return instance
      

Sending the request seem to work (the title is not set, but abstract and group are set)

files = [
    ("base_file", (filename, open(file_path, "rb"), "application/octet-stream")),
]

upload_base_url = self.geonode_base_url.geturl()
url = f"{upload_base_url}/api/v2/uploads/upload"
headers, token_param = self._prepare_auth()

response = requests.request(
    "POST",
    f"{url}?{token_param}",
    headers=headers,
    files=files,
    data={
        "overwrite_existing_layer": "false",
        "custom": json.dumps({
            "group": group,
            "title": filename,
            "abstract": "This data has been set while uploading file."
        })
    }
)

Do you see any problems with this? I could provide a PR on master as the importer is already migrated and everything is in place. However, what do you think about setting metadata values directly through passing JSON data in general? Perhaps I do not see problems this may have.

@mattiagiupponi
Copy link
Contributor

Hi @ridoo
Sorry for the late reply.

Do you see any problems with this?

I dont see any problem in this. Will for sure speedup the resource creation via import. The only thing i'm worried about are the foreign key with the resource such as keywords, thesaurus, categories etc...

I could provide a PR on master as the importer is already migrated and everything is in place.

This would be amazing!

pass custom=_exec.input_params.get("custom") to (each) handler (each since otherwise some refactoring is needed)

How big could be the refactor in your opinion? If we can improve something i'm always happy to do it.

The idea of how to pass it, looks good to me

@ridoo
Copy link
Contributor Author

ridoo commented Feb 3, 2025

The only thing i'm worried about are the foreign key with the resource such as keywords, thesaurus, categories etc...

Hm, do you think of something specific here (permissions, or similar)? I can come up with something (by this week) and we can test and see, what dragons we have to deal with.

@mattiagiupponi
Copy link
Contributor

The only thing i'm worried about are the foreign key with the resource such as keywords, thesaurus, categories etc...

Hm, do you think of something specific here (permissions, or similar)? I can come up with something (by this week) and we can test and see, what dragons we have to deal with.

I'll take the keyword as example. there is the Handler for managing the keyword:

class KeywordHandler:

The handler already tries to decide if is a normal keyword or if is a thesaurus one, but is based on how the metadata parsers return them.

Let's assume the following scenario:

  • Upload of a resource
  • I add the custom payload {"group": "group1", "keyword": [1,2,3]}
  • Group is managed by the custom, so not an issue.

Notes:

  • Keywords are not passed via the custom, but are a separate field in the .update function, (we have to consider it, but a get in the payload should do the trick)
  • We cannot managed the thesaurus for now, are complicated so i'll keep them via the XML
  • It must be re-check if passing the keywords via payload AND via XML we add all the keywords combined
  • We have to re-check how the extra_metadata are handled in this workflow

@ridoo
Copy link
Contributor Author

ridoo commented Feb 3, 2025

@mattiagiupponi Using a custom metadata storer we would be totally free to store whatever is needed. Are you having something more general in mind?

If you think of a general storer which is enabled by default and works in combination with XML metadata:

  • We should simplify such feature by documenting "XML overrides JSON payload" or "JSON payload overrides XML"
  • We should exclude extra_metadata for now. As far I can see, this would have to be re-worked anyway in near future when introducing metadata handlers, right?
  • I would have to investigate the keyword example to decide if we should handle it with this PR

I would start simple and document limitations on the go. Then we can decide what to include or postpone with regard to the PR.

@mattiagiupponi
Copy link
Contributor

  • We should simplify such feature by documenting "XML overrides JSON payload" or "JSON payload overrides XML"

Sounds good for now, i would go with "JSON payload overrides XML"

  • We should exclude extra_metadata for now. As far I can see, this would have to be re-worked anyway in near future when introducing metadata handlers, right?

u are right

  • I would have to investigate the keyword example to decide if we should handle it with this PR

great

I would start simple and document limitations on the go. Then we can decide what to include or postpone with regard to the PR.

Agree, let's keept is simple for now

I'm waiting for ur pr :)

@ridoo
Copy link
Contributor Author

ridoo commented Feb 4, 2025

@mattiagiupponi there were changes to the API which now require the actions attribute (previously source) in the request. The source attribute was not required and there was a default being set if not present. Why you chose to require that attribute now? Is the API not working as intended, if there is not actions provided? I cannot see any reason to enforce the caller to provide that parameter.

Edit: Other observations:

  • The update_resource() confuses me. There are a lot of if/else statements making decisions when to add values (from defaults) to to_update but suddenly there's to_update.update(defaults) which makes the previous if/else obsolete, doesn't it? Do I overlook something here?
  • Did you ever get this when uploading a file? (this SO answer seems to be related)
    [2025-02-04 08:26:33,152: ERROR/ForkPoolWorker-2] Detected path traversal attempt in '/mnt/volumes/statics/uploaded/thumbs/dataset-14cc90f8-189b-4ddd-85bd-c15f20075dcf-thumb-492499e9-1a15-4996-b211-3b55062dde64.jpg'
    Traceback (most recent call last):
      File "/usr/src/geonode/geonode/base/models.py", line 1564, in save_thumbnail
        storage_manager.save(storage_manager.path(upload_path), img)
      File "/usr/src/geonode/geonode/storage/manager.py", line 161, in save
        return self._concrete_storage_manager.save(name, content, max_length=max_length)
      File "/usr/src/geonode/geonode/storage/manager.py", line 300, in save
        return self._fsm.save(name, content, max_length=max_length)
      File "/usr/local/lib/python3.10/dist-packages/django/core/files/storage/base.py", line 41, in save
        validate_file_name(name, allow_relative_path=True)
      File "/usr/local/lib/python3.10/dist-packages/django/core/files/utils.py", line 17, in validate_file_name
        raise SuspiciousFileOperation(
    django.core.exceptions.SuspiciousFileOperation: Detected path traversal attempt in '/mnt/volumes/statics/uploaded/thumbs/dataset-14cc90f8-189b-4ddd-85bd-c15f20075dcf-thumb-492499e9-1a15-4996-b211-3b55062dde64.jpg'
    
  • During my tests, I set group: 1 by accident, which set the contributors group (of type Group) instead of the GroupProfile I created .. Is this intended? This should be GroupProfile, right? The metadata editor at least autocompletes groupprofiles only.

@ridoo ridoo linked a pull request Feb 4, 2025 that will close this issue
13 tasks
@ridoo
Copy link
Contributor Author

ridoo commented Feb 5, 2025

Commenting on my observations to document decisions I made.

there were changes to the API which now require the actions attribute (previously source) in the request. The source attribute was not required and there was a default being set if not present. Why you chose to require that attribute now? Is the API not working as intended, if there is not actions provided? I cannot see any reason to enforce the caller to provide that parameter.

For now, I will expect upload action in case it is not present in _data.

The update_resource() confuses me. There are a lot of if/else statements making decisions when to add values (from defaults) to to_update but suddenly there's to_update.update(defaults) which makes the previous if/else obsolete, doesn't it? Do I overlook something here?

TBH, this is weird when reading the code and makes a bit uncertain reading the code, however, will not block this feature.

Did you ever get this when uploading a file? (this SO answer seems to be related [...]

This observation does not block working on this feature. However, these errors also appear on master test suite.

During my tests, I set group: 1 by accident, which set the contributors group (of type Group) instead of the GroupProfile I created .. Is this intended? This should be GroupProfile, right? The metadata editor at least autocompletes groupprofiles only.

This one seems to be the only issue which I need some feedback for. It seems, ResourceBase allows us to set any Group (not only GroupProfile groups). This will be an issue on the metadata API, I guess, which allows you to set Group relations rather than enforcing GroupProfile relations. Using GroupProfile, however, will raise this exception, when opening the metadata editor:

ERROR Internal Server Error: /api/v2/metadata/instance/22/
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/django/core/handlers/exception.py", line 55, in inner
  response = get_response(request)
File "/usr/local/lib/python3.10/dist-packages/django/core/handlers/base.py", line 197, in _get_response
  response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/local/lib/python3.10/dist-packages/django/views/decorators/csrf.py", line 56, in wrapper_view
  return view_func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/rest_framework/viewsets.py", line 125, in view
  return self.dispatch(request, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/rest_framework/views.py", line 509, in dispatch
  response = self.handle_exception(exc)
File "/usr/local/lib/python3.10/dist-packages/rest_framework/views.py", line 469, in handle_exception
  self.raise_uncaught_exception(exc)
File "/usr/local/lib/python3.10/dist-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
  raise exc
File "/usr/local/lib/python3.10/dist-packages/rest_framework/views.py", line 506, in dispatch
  response = handler(request, *args, **kwargs)
File "/usr/src/geonode/geonode/metadata/api/views.py", line 102, in schema_instance
  schema_instance = metadata_manager.build_schema_instance(resource, lang)
File "/usr/src/geonode/geonode/metadata/manager.py", line 116, in build_schema_instance
  content = handler.get_jsonschema_instance(resource, fieldname, context, errors, lang)
File "/usr/src/geonode/geonode/metadata/handlers/group.py", line 63, in get_jsonschema_instance
  {"id": str(resource.group.groupprofile.pk), "label": resource.group.groupprofile.title}
File "/usr/local/lib/python3.10/dist-packages/django/db/models/fields/related_descriptors.py", line 492, in __get__
  raise self.RelatedObjectDoesNotExist(
django.contrib.auth.models.Group.groupprofile.RelatedObjectDoesNotExist: Group has no groupprofile.

@mattiagiupponi what do you think? Is this worth an extra issue?

@mattiagiupponi
Copy link
Contributor

Hi @ridoo
sorry but are messy days for me

@mattiagiupponi there were changes to the API which now require the actions attribute (previously source) in the request. The source attribute was not required and there was a default being set if not present. Why you chose to require that attribute now? Is the API not working as intended, if there is not actions provided? I cannot see any reason to enforce the caller to provide that parameter.

We decided to enforce it, to let the importer pick the list of tasks to be executed from their internal list based on the caller action provided, so we dont give anymore by default the "upload" as action.

I guess this is due to not a proper cleaning during few commits. But in the end no it not going to invalidate the if/else because they are converting info into object and if you notice the to_update contains only a name key, so the payload is merged and given to the resource

  • Did you ever get this when uploading a file? (this SO answer seems to be related)
    [2025-02-04 08:26:33,152: ERROR/ForkPoolWorker-2] Detected path traversal attempt in '/mnt/volumes/statics/uploaded/thumbs/dataset-14cc90f8-189b-4ddd-85bd-c15f20075dcf-thumb-492499e9-1a15-4996-b211-3b55062dde64.jpg'
    Traceback (most recent call last):
      File "/usr/src/geonode/geonode/base/models.py", line 1564, in save_thumbnail
        storage_manager.save(storage_manager.path(upload_path), img)
      File "/usr/src/geonode/geonode/storage/manager.py", line 161, in save
        return self._concrete_storage_manager.save(name, content, max_length=max_length)
      File "/usr/src/geonode/geonode/storage/manager.py", line 300, in save
        return self._fsm.save(name, content, max_length=max_length)
      File "/usr/local/lib/python3.10/dist-packages/django/core/files/storage/base.py", line 41, in save
        validate_file_name(name, allow_relative_path=True)
      File "/usr/local/lib/python3.10/dist-packages/django/core/files/utils.py", line 17, in validate_file_name
        raise SuspiciousFileOperation(
    django.core.exceptions.SuspiciousFileOperation: Detected path traversal attempt in '/mnt/volumes/statics/uploaded/thumbs/dataset-14cc90f8-189b-4ddd-85bd-c15f20075dcf-thumb-492499e9-1a15-4996-b211-3b55062dde64.jpg'
    

Actually no, this needs to be investigated. It happen in docker or locally?

  • During my tests, I set group: 1 by accident, which set the contributors group (of type Group) instead of the GroupProfile I created .. Is this intended? This should be GroupProfile, right? The metadata editor at least autocompletes groupprofiles only.

Yes, it should create GroupProfile

@mattiagiupponi what do you think? Is this worth an extra issue?

Yes, better to open another issue

@ridoo
Copy link
Contributor Author

ridoo commented Feb 5, 2025

We decided to enforce it, to let the importer pick the list of tasks to be executed from their internal list based on the caller action provided, so we dont give anymore by default the "upload" as action.

So, what is the code supposed to expect? Do we want the code to raise an exception in case the action is not present? Currently, I use upload if not present ... See the PR.

Actually no, this needs to be investigated. It happen in docker or locally?

Happens on docker (see the geonode test case)

@mattiagiupponi what do you think? Is this worth an extra issue?

Yes, better to open another issue

I created the issue here: #12875

@ridoo ridoo linked a pull request Feb 5, 2025 that will close this issue
13 tasks
@etj
Copy link
Contributor

etj commented Feb 11, 2025

Dropping here late in the discussion I'm afraid...
In 4.x we had parsers and storers, which had been designed with the purpose of making metadata handling more modular and easily overridable.
The whole new metadata structure has a finer granularity about the metadata storing, while it's lacking a bit on the parsing side.
I guess that storers should be rewritten to map keys of the parsed metadata to json schema properties, and leave to the new metadata handlers the task of persisting them. I guess I'll open an issue about that.

@etj
Copy link
Contributor

etj commented Feb 11, 2025

#12902

@etj
Copy link
Contributor

etj commented Feb 11, 2025

@ridoo what about adding a jsonschema instance payload in the resource manager create(), and modify slightly the logic to also call metadata_manager.update_schema_instance()?

https://github.com/GeoNode/geonode/blob/master/geonode/metadata/manager.py#L131

I guess it would be a lot simpler.

@mattiagiupponi
Copy link
Contributor

mattiagiupponi commented Feb 11, 2025

Hi @ridoo
Just to be more thecnical, the ideal will be the following:

  1. let the importer have in input a new parameter named "metadata" instead of "custom" (which ill make more sense)
  2. modify the create to handle the new parameter (like you did, is enough to rename the field)
  3. add the call from the metadata_manager by passing the metadata field.
  4. remove the storer, since the metadata will be updated correctly during the creation workflow

This will be a huge step forward with the idea of let the metadata manager handle all the metadata in GeoNode and this can be the first step.

The metadata manager is expecting a payload like exposed in this test:
https://github.com/GeoNode/geonode/blob/master/geonode/metadata/tests/tests.py#L884

What do you think? is something that will solve also your requirement?

@mattiagiupponi locally my tests run fine 🤔 Do you have an idea?

Of course i'll assist about the test, i'll not leave you alone 😄

@ridoo
Copy link
Contributor Author

ridoo commented Feb 11, 2025

@etj @mattiagiupponi my concrete use case is at v4.4.x actually :). As mentioned above what I considered was

  • wait for new metadata API in GeoNode v5
  • use geonode-mapstore-client metadata editor

However, both was not suitable for my use case, so I decided to add such feature via this PR, hoping it would be backported. I see, that using the metadata editor makes more sense here. Unfortunately, this brings me a bit farther away from serving my actually use case.

Anyway, I will have a look tomorrow.

@etj do you think we can move my current approach to v4.4.x (probably in combination with #12883)?

@etj
Copy link
Contributor

etj commented Feb 12, 2025

@ridoo

my concrete use case is at v4.4.x actually :)

ouch, ok, I saw the #12874 was aimed at master, so we were not aligned on that.

I guess that you may want to merge your PR against 4.4.x, and doing so it will diverge from master, where we will be solving your problem workin on #12902

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature to be added to the codebase
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants