Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creat a fetchresourcelist plugin that queries Drupal for media to check #14

Closed
mjordan opened this issue Oct 28, 2018 · 9 comments
Closed
Assignees

Comments

@mjordan
Copy link
Owner

mjordan commented Oct 28, 2018

Related to #6 and Islandora/documentation#945.

We should have a fetchresourcelist plugin that queries Drupal for resources to check. The code below is a working proof of concept. It requires that the Drupal JSON API contrib module is enabled.

<?php
// content type will need to be a Riprap admin option, as will limit (but
// apparently the JSON API's max is 50 items per page). In this example, 
// we requrest page 2 with a size of 3 nodes.
$page_url = "http://localhost:8000/jsonapi/node/islandora_object?page[offset]=2&page[limit]=3";

$page_output = file_get_contents($page_url);
$page_output = json_decode($page_output, true);

// Taxonomy terms to check will need to be a Riprap admin option.
$taxonomy_terms_to_check = array('/taxonomy/term/2'); // "Preservation master"

// At this point, we have a list of 3 nodes.
foreach ($page_output['data'] as $node) {
  $nid = $node['attributes']['nid']; 
  // Get the media associated with this node using the Islandora-supplied Manage Media View.
  $media_url = "http://admin:islandora@localhost:8000/node/" . $nid . "/media?_format=json";
  $media_data = file_get_contents($media_url);
  $media_data = json_decode($media_data);
  // Loop through all the media and pick the ones that
  // are tagged with terms in $taxonomy_terms_to_check.
  foreach ($media_data as $media) {
    if (count($media->field_tags)) {
      foreach ($media->field_tags as $term) {
        if (in_array($term->url, $taxonomy_terms_to_check)) {
          // @todo: Convert to the equivalent Fedora URL and add to the plugin's output.
          // @todo: Add option to not convert to Fedora URL if the site doesn't use Fedora.
          // In that case, we need to figure out how to get Drupal's checksum for the file over HTTP.
          var_dump($media->field_media_image[0]->url);
        }
      }
    }
  }
}

We will also need to persist the page number to request during the next scheduled job. This should probably go into a db table.

@mjordan mjordan self-assigned this Oct 28, 2018
@mjordan
Copy link
Owner Author

mjordan commented Oct 29, 2018

Once Islandora-Devops/migrate_7x_claw#9 gets merged, the above code should look like:

<?php
// content type will need to be a Riprap admin option, as will limit (but
// apparently the JSON API's max is 50 items per page). In this example, 
// we requrest page 2 with a size of 3 nodes.
$page_url = "http://localhost:8000/jsonapi/node/islandora_object?page[offset]=2&page[limit]=3";

$page_output = file_get_contents($page_url);
$page_output = json_decode($page_output, true);

// Taxonomy terms to check will need to be a Riprap admin option.
// "Original File" and "Preservation Master File"
$taxonomy_terms_to_check = array('/taxonomy/term/15', '/taxonomy/term/16');

// At this point, we have a list of 3 nodes.
foreach ($page_output['data'] as $node) {
  $nid = $node['attributes']['nid']; 
  // Get the media associated with this node using the Islandora-supplied Manage Media View.
  $media_url = "http://admin:islandora@localhost:8000/node/" . $nid . "/media?_format=json";
  $media_data = file_get_contents($media_url);
  $media_data = json_decode($media_data);
  // Loop through all the media and pick the ones that
  // are tagged with terms in $taxonomy_terms_to_check.
  foreach ($media_data as $media) {
    if (count($media->field_media_use)) {
      foreach ($media->field_media_use as $term) {
        if (in_array($term->url, $taxonomy_terms_to_check)) {
          // @todo: Convert to the equivalent Fedora URL by querying Gemini
          // using the value of $media->field_media_image[0]->target_uuid to get this type of response:
          // {
          //  "drupal":"http:\/\/localhost:8000\/_flysystem\/fedora\/masters\/testing_12_OBJ.jpg",
          //  "fedora":"http:\/\/localhost:8080\/fcrepo\/rest\/masters\/testing_12_OBJ.jpg"
          // }
          // The Fedora URL is the one Riprap needs to validate the fixity of.
          // @todo: Add option to not convert to Fedora URL if the site doesn't use Fedora.
          // In that case, we need to figure out how to get Drupal's checksum for the file over HTTP.
        }
      }
    }
  }
}

@mjordan
Copy link
Owner Author

mjordan commented Dec 10, 2018

According to https://www.drupal.org/docs/8/modules/jsonapi/sorting, we can:

  • use a collection based on content type to limit nodes
  • sort. We probably want to sort by creation date desc, so newly added nodes get checked as soon as possible (if we sorted asc, the default, newly added nodes wouldn't get checked until all the end of a cycle). Can't think of a use case for sorting asc but maybe we can offer this as a Riprap config option if needed.

@mjordan
Copy link
Owner Author

mjordan commented Dec 10, 2018

We'll also need to include Basic auth credentials in Riprap for the JSON API and Views REST.

mjordan added a commit that referenced this issue Dec 10, 2018
mjordan added a commit that referenced this issue Dec 11, 2018
@mjordan
Copy link
Owner Author

mjordan commented Dec 11, 2018

Work in the issue-14 branch can now parse out the Drupal URLs of images attached to nodes:

php bin/console app:riprap:check_fixity
string(57) "http://localhost:8000/_flysystem/fedora/testing_8_OBJ.jpg"
string(57) "http://localhost:8000/_flysystem/fedora/testing_7_OBJ.jpg"
string(57) "http://localhost:8000/_flysystem/fedora/testing_6_OBJ.jpg"

This comes from each media entity's field_media_image field. We need to make sure that non-image files are also detected (i.e., what field do we use for non-image files?).

@mjordan
Copy link
Owner Author

mjordan commented Dec 11, 2018

Non-image files are in field_media_file.

mjordan added a commit that referenced this issue Dec 11, 2018
mjordan added a commit that referenced this issue Dec 11, 2018
@mjordan
Copy link
Owner Author

mjordan commented Dec 11, 2018

Only thing not working is the authenticating against Gemini using a JWT token.

mjordan added a commit that referenced this issue Dec 18, 2018
mjordan added a commit that referenced this issue Dec 29, 2018
@mjordan
Copy link
Owner Author

mjordan commented Dec 29, 2018

app:riprap:plugin:fetchresourcelist:from:drupal plugin is complete, but I'm getting some strange behavior. When riprap hits the last page of a JSON:API request, it throws a curl error:

In CurlFactory.php line 186:
                                                                                         
  cURL error 3: <url> malformed (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)  

However, the URL triggering this error works as expected (200 response code) when requested using curl on the command line, e.g., curl -v -uadmin:islandora "http://localhost:8000/jsonapi/node/islandora_object?page%5Blimit%5D=5&page%5Boffset%5D=10&sort=-changed".

mjordan added a commit that referenced this issue Dec 29, 2018
@mjordan
Copy link
Owner Author

mjordan commented Dec 29, 2018

OK, have tracked this down to an empty $media_list on a node.

mjordan added a commit that referenced this issue Dec 30, 2018
mjordan added a commit that referenced this issue Dec 30, 2018
@mjordan
Copy link
Owner Author

mjordan commented Dec 30, 2018

Closed with 342ba23.

@mjordan mjordan closed this as completed Dec 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant