-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
U/jrbogart/gcr tutorial #166
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
326 changes: 326 additions & 0 deletions
326
docs/source/tutorial_notebooks/query_gcr_datasets.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,326 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "302fd57e-ae4f-4a22-95ad-a69573212a98", | ||
"metadata": {}, | ||
"source": [ | ||
"<div style=\"overflow: hidden;\">\n", | ||
" <img src=\"images/DREGS_logo_v2.png\" width=\"300\" style=\"float: left; margin-right: 10px;\">\n", | ||
"</div>\n", | ||
"\n", | ||
"# Using dataregistry with GCRCatalogs\n", | ||
"\n", | ||
"Here we show how to access catalogs belonging to the `GCRCatalogs` package via the information stored in the dataregistry.\n", | ||
"\n", | ||
"### What we cover in this tutorial\n", | ||
"\n", | ||
"In this tutorial we will learn how to:\n", | ||
"\n", | ||
"1) Find and read the catalogs using the standard GCRCatalogs interface\n", | ||
"2) Query catalog metadata directly using the data registry, then use that metadata to find and read catalogs\n", | ||
"\n", | ||
"### Before we begin\n", | ||
"\n", | ||
"Currently (November, 2024) the required versions of gcr-catalogs and dataregistry are only available in the `desc-python-bleed` kernel. Make sure you have selected that kernel while running this tutorial.\n", | ||
"\n", | ||
"If you haven't done so already, check out the [getting setup](https://lsstdesc.org/dataregistry/tutorial_setup.html) page from the documentation if you want to run this tutorial interactively." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "40bc9de8-4451-412b-b5dd-025cd5254dc1", | ||
"metadata": {}, | ||
"source": [ | ||
"## 1) Using the usual GCRCatalogs interface\n", | ||
"\n", | ||
"Note that, using this method, we will not be calling any data registry services directly, but the data registry database still must be accessible. That means you must have gone through at least part of the tutorial setup referred to above, in particular the steps for creating a couple small files needed for authentication. See details [here](http://lsstdesc.org/dataregistry/installation.html#one-time-setup)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "29c7fb42-23c0-4026-a008-c9274a26ad0f", | ||
"metadata": {}, | ||
"source": [ | ||
"### Configuring GCRCatalogs \n", | ||
"\n", | ||
"A quick way to check everything is set up correctly is to run the first cell below, which should load the GCRCatalogs package, and print the package version. It should be at least 1.9.0." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "9231e5d8-a8f5-4eba-a020-0a7933b9a24a", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import GCRCatalogs\n", | ||
"print(f\"Working with GCRCatalogs version: {GCRCatalogs.__version__}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c9b03081-ba94-46ed-9ef5-a6f0c48d6c4c", | ||
"metadata": {}, | ||
"source": [ | ||
"We need to tell `GCRCatalogs` whether to use the old-style metadata access method (reading config files) or to fetch metadata from the data registry. There are two ways to do this:\n", | ||
"\n", | ||
"1. Before running, set the environment variable `GCR_CONFIG_SOURCE`to one of the two allowed values: \"files\" or \"dataregistry\"\n", | ||
"2. Invoke the GCRCatalogs routine `ConfigSource.set_config_source`\n", | ||
"\n", | ||
"Here we use the second method.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "f9901d89-b1d7-48c9-8110-ce16ecba3a7e", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Tell GCRCatalogs to use the data registry\n", | ||
"GCRCatalogs.ConfigSource.set_config_source(dr=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "17120e54-30d7-44a5-8875-d767ea7800d2", | ||
"metadata": {}, | ||
"source": [ | ||
"Now we can use any of the standard GCRCatalogs query routines." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "00c6d355-dca0-42a1-ae82-7fdbd1a46afa", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Find catalogs whose name starts with \"cosmo\"\n", | ||
"cosmos = GCRCatalogs.get_available_catalog_names(name_startswith=\"cosmo\")\n", | ||
"cosmos" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "6adb2f98-6103-41ef-a183-75cea430358a", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Load a catalog; find out something about it\n", | ||
"cat = GCRCatalogs.load_catalog(\"cosmoDC2_v1.1.4_small\")\n", | ||
"cat.native_filter_quantities" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "8a1a2f80-2909-416e-86d3-4a9b71a1a9ef", | ||
"metadata": {}, | ||
"source": [ | ||
"## 2) Using the data registry directly\n", | ||
"\n", | ||
"We learned how to connect to the DESC data registry in other tutorials using the `DataRegistry` class. Let's connect again using the defaults _except_ for the schema. Since the catalogs maintained by GCRCatalogs are stored in the DESC production shared area, their database entries are in the production schema, not in the (default) working schema.\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "bfad9170-a524-4aef-9217-38470373a6b8", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from dataregistry import DataRegistry\n", | ||
"from dataregistry.schema import DEFAULT_SCHEMA_PRODUCTION\n", | ||
"\n", | ||
"# Establish connection to the production schema\n", | ||
"datareg = DataRegistry(schema=DEFAULT_SCHEMA_PRODUCTION)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "3d3209c8-8132-421e-9ea2-e7e17bca9f7a", | ||
"metadata": {}, | ||
"source": [ | ||
"### Dataset properties\n", | ||
"\n", | ||
"Recall that a `DataRegistry` instance has a member `Query` which provides all the query services, the principal one being the ability to ask for values of attributes of datasets, subject to one or more filters. If you haven't already, we recommend you take a look at the tutorial \"Getting started: Part 3 - Simple queries\" before proceeding further.\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The query getting started is now "Part 2" |
||
"\n", | ||
"You can find out what the dataset properties (\"columns\" in database parlance) are with another of the `Query` services: " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "54a52029-2908-4056-bc68-4a87f6c3e6df", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"all_columns = datareg.Query.get_all_columns()\n", | ||
"print(all_columns)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "fa586592-2c2e-428b-b443-33ca26038add", | ||
"metadata": {}, | ||
"source": [ | ||
"That is a list of __all__ columns from __all__ tables, maybe more than we bargained for. Let's restrict it to columns in the `dataset` table." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "04082abb-585e-4d68-a4c4-874d22be70be", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"dataset_columns = [col for col in all_columns if col.startswith('dataset.')]\n", | ||
"print(dataset_columns)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ad32a278-694a-4364-8dcd-39cdc702039c", | ||
"metadata": {}, | ||
"source": [ | ||
"Among the more interesting for our purposes are `name`, `relative_path`, `access_api`, `access_api_configuration` and `location_type`. In the case of catalogs registered with GCRCatalogs, `name` in the data registry is the same name GCRCatalogs uses to refer to it: the basename of the corresponding config file, not including the suffix `.yaml`. But keep in mind that, unlike GCRCatalog, the dataregistry always respects case in names\n", | ||
"\n", | ||
"Let's look at those properties for the dataset `cosmoDC2_v1.1.4`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "bdbe8537-6195-4239-bbb8-976daacdfab7", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Define a filter on the \"name\" property and make query\n", | ||
"from dataregistry.query import Filter\n", | ||
"catname = 'cosmoDC2_v1.1.4'\n", | ||
"filters = [Filter('dataset.name', '==', catname)]\n", | ||
"property_names = ['dataset.name', 'dataset.relative_path', 'dataset.access_api', \n", | ||
" 'dataset.access_api_configuration', 'dataset.location_type']\n", | ||
"result = datareg.Query.find_datasets(property_names=property_names,\n", | ||
" filters=filters)\n", | ||
"# By default the return type is a dict\n", | ||
"for k, v in result.items():\n", | ||
" print(f'Key {k} has value \\n{v[0]}\\n')\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "90f5fa67-60c8-4567-a959-94d3ccbc764d", | ||
"metadata": {}, | ||
"source": [ | ||
"At NERSC (currently the only place this code can be run) the value for `relative_path` is relative to the DESC NERSC production shared area, `/global/cfs/cdirs/lsst/shared`, just like the path names used in GCRCatalogs. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "63a7294a-4872-4e42-a956-c603e218c849", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"abs_path = os.path.join('/global/cfs/cdirs/lsst/shared',\n", | ||
" result['dataset.relative_path'][0])\n", | ||
"abs_path" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "22e96998-4862-466d-9b82-1e97a9f2777d", | ||
"metadata": {}, | ||
"source": [ | ||
"The value \"GCRCatalogs\" for the property `dataset.access_api` is a clue that this\n", | ||
"dataset may be read and interpreted using GCRCatalogs.\n", | ||
"\n", | ||
"The value for `dataset.access_api_configuration` should look familiar. It's just the contents of this catalogs's config file. And the value for the location type, \"dataregistry\", just tells us this is a normal catalog whose data files are kept in the area managed by the data registry." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "de9f1a61-22d4-436b-a428-bfda5bade37f", | ||
"metadata": {}, | ||
"source": [ | ||
"Let's try this for another catalog. We'll just change the name and make the same query." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "34362bb2-83b5-403c-8a82-2de367affafa", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"catname = 'cosmoDC2_v1.1.4_small'\n", | ||
"filters = [Filter('dataset.name', '==', catname)]\n", | ||
"property_names = ['dataset.name', 'dataset.relative_path', 'dataset.access_api', \n", | ||
" 'dataset.access_api_configuration', 'dataset.location_type']\n", | ||
"result = datareg.Query.find_datasets(property_names=property_names,\n", | ||
" filters=filters)\n", | ||
"# By default the return type is a dict\n", | ||
"for k, v in result.items():\n", | ||
" print(f'Key {k} has value \\n{v[0]}\\n')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d4f05dbc-b6cf-49fa-beb8-34203f3e90fc", | ||
"metadata": {}, | ||
"source": [ | ||
"It all looks pretty much as you would expect, except what happened to the value of `dataset.relative_path`? That doesn't look like a path. You can see the reason in the catalog's configuration: it's based on another catalog. Or you can see it in the value for `dataset.location_type`. \"meta_only\" means that the data registry is only storing metadata for the catalog; it is not keeping track of the (indirectly) associated files. The same thing would happen for a composite catalog: the data registry just stores the catalog's configuration. It doesn't know how to parse it as GCRCatalogs would." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "5721858e-8e42-4285-9ef0-ead3d780e918", | ||
"metadata": {}, | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "desc-python-bleed", | ||
"language": "python", | ||
"name": "desc-python-bleed" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.12.7" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no
DEFAULT_SCHEMA_PRODUCTION
anymore.Options would be
DEFAULT_NAMESPACE
and connect toschema=f"{DEFAULT_NAMESPACE}_production"
DEFAULT_NAMESPACE
(but no need), and choosequery_mode="production"
to limit queries to production schema