The Metadata Standards Catalog is written in Python 3.8+, so as a first step this will need to be installed on your machine. (It should work with Python 3.7 as well, but this has not been tested recently. It will not work on earlier versions.)
You will also need quite a few non-standard packages; the instructions below will install these for you in an isolated virtual environment, but here they are if you want to look up the documentation:
- Flask, Flask-WTF (and hence WTForms), and Flask-Login are needed for the actual rendering of the pages.
- Email validator is used for email address validation in forms.
- Flask-OpenID provides Open ID v2.x login support.
- RAuth (which depends on Requests), and Google's oauth2client are used for Open ID Connect (OAuth) support.
- Flask-HTTPAuth and PassLib are used for API authentication.
- The database is implemented using TinyDB v4+ and tinyrecord.
- The subject thesaurus is converted from RDF to JSON via RDFLib.
- Dulwich is used to apply version control to the database.
- GitHub-Webhook allows the Catalog to update itself.
- Flask-CORS is used to allow requests from JavaScript.
A YAML configuration is provided to automate the process of initializing the Catalog in a Linux container. This is especially useful if you want to test how the code runs under a version of Python different from the one you have installed. These instructions use LXD but the equivalent steps should work with Incus.
-
Create a new container using an image that is Debian-based and supports
cloud-init
. (If you are choosing from the linuxcontainers public image server, the ‘cloud’ variants supportcloud-init
.) An Ubuntu one is given here as an example:lxc image list ubuntu:24.04
-
Configure the container using the YAML configuration:
cat rdamsc-init.yaml | lxc config set rdamsc user.user-data -
-
Start the image, then log into it:
lxc start rdamsc lxc shell rdamsc
-
In the container's shell, check that the setup completed successfully:
cloud-init status --wait # Should end up saying "status: done"
The automated installation simulates a production instance, including the steps given below for running in production and implementing maintenance mode.
-
The testing apparatus is not automatically installed. To install it, you will need to do a few final steps in the container's shell:
cd ~rdamsc sudo -su rdamsc . venv/bin/activate pip install -e ".[dev]" # This would be a good point to run the test suite. When finished... deactivate exit
-
To remove the container, run these commands outside the container:
lxc stop rdamsc lxc delete rdamsc
Use git clone
as normal to get a copy of this code folder where you want it on
your file system, then enter the folder on the command line.
Set up a virtual environment (you might need an additional package for this on a *nix system):
# *nix
python3 -m venv venv
# Windows
py -3 -m venv venv
Activate it:
# *nix
. venv/bin/activate
# Windows
venv\Scripts\activate
Optionally, upgrade your sandboxed copy of pip
and install wheel
:
pip install --upgrade pip
pip install wheel
Install the Catalog and its dependencies to your virtual environment. In a development context:
pip install -e ".[dev]"
In production, you don't need the unit testing apparatus:
pip install -e .
See the Guide for Contributors for how to run the unit tests.
Run the application like this to get development mode:
# *nix
export FLASK_APP=rdamsc; flask --debug run
# Windows
set FLASK_APP=rdamsc
flask --debug run
# Windows Powershell
$env:FLASK_APP = "rdamsc"
flask --debug run
(You may need to give the path to the executable in the virtual environment,
e.g. venv/bin/flask
).
You will get feedback on the command line about what URL to use to access the application.
These instructions are one way to go about using the Catalog in production. For other options, please refer to the deployment options documented by the Flask developers.
On the Web server, let's assume for example that you have installed the
application using the above instructions in /opt/rdamsc
.
These instructions are for mod_wsgi on Apache, so these need to be installed. On Debian or a derivative like Ubuntu, you'd do this:
sudo apt install apache2 libapache2-mod-wsgi-py3
If you need to use an upgraded Python alongside the system version and a
pre-compiled mod_wsgi is not available for it (as is the case on Ubuntu 18.04),
you will need something like this instead, assuming you already have (say)
python3.8
and python3.8-venv
:
sudo apt install apache2 apache2-dev python3.8-dev
It is recommended that you set up a non-privileged system user to run the
Catalog (say, rdamsc
) and that this user and the Apache user (www-data
on
Debian-based Linux distros) are in each other's groups. Be sure to assign
ownership of the source code directory to this user. Example:
sudo adduser --system --group rdamsc
sudo usermod -aG www-data rdamsc
sudo usermod -aG rdamsc www-data
sudo chown -R rdamsc:www-data /opt/rdamsc
You should create an instance folder where the Catalog can keep its data. The
canonical location would be /var/opt/rdamsc
but you can choose another one:
sudo mkdir /var/opt/rdamsc
sudo chown rdamsc:www-data /var/opt/rdamsc
You can now switch to the rdamsc
user:
sudo -Hsu rdamsc
Configure the Catalog to use this folder explicitly by changing the app
assignment line in rdamsc/__init__.py
to include the information:
# Create the app:
app = Flask(__name__, instance_relative_config=True, instance_path='/var/opt/rdamsc')
Commit this change so Git can reapply it over any other code changes. Doing this as your newly created user, you will need to configure Git at the same time:
git config --global user.name "RDA MSCWG"
git config --global user.email "rdamsc@localhost"
git add rdamsc/__init__.py
git commit -m "Update production instance path"
If a pre-compiled WSGI mod is not available for the Python you used in the virtual environment, then you can compile a matching version in your virtual environment at this point:
. venv/bin/activate
pip install mod_wsgi
deactivate
Now you need to create the WSGI file that will run the application for you.
Let's say you want to run your website content from /srv/
and have set this up
in your Apache configuration (/etc/apache2/apache2.conf
). Create the site
directory:
exit
sudo mkdir /srv/rdamsc
sudo chown rdamsc:www-data /srv/rdamsc
The srv
folder in this repository has a ready-made rdamsc.wsgi
file you can
copy to the directory you just created. Ensure that it is writeable by the
rdamsc
user. Alternatively, as the rdamsc
user, create the file
/srv/rdamsc/rdamsc.wsgi
with this content:
from rdamsc import create_app
application = create_app()
If you are behind an HTTP proxy, you may need to add (or uncomment) these lines as well, remembering to provide the actual proxy URLs:
import os
os.environ['http_proxy'] = 'http://proxyURL'
os.environ['https_proxy'] = 'https://proxyURL'
Now create an Apache site (e.g. /etc/apache2/sites-available/rdamsc.conf
) that
points to this file. If your system Python and WSGI Apache plugin can run the
application use something like this (check the path for your virtual
environment):
WSGIPassAuthorization On
<VirtualHost *:80>
ServerName rdamsc.example.com
WSGIDaemonProcess rdamsc user=rdamsc group=rdamsc threads=5 python-home=/opt/rdamsc/venv
WSGIScriptAlias / /srv/rdamsc/rdamsc.wsgi
AllowEncodedSlashes NoDecode
<Directory /srv/rdamsc>
WSGIProcessGroup rdamsc
WSGIApplicationGroup %{GLOBAL}
Require all granted
</Directory>
</VirtualHost>
If you compiled a custom WSGI module in your virtual environment, you will need an extra line at the top of this file to use it:
LoadModule wsgi_module "/opt/rdamsc/venv/path/to/mod_wsgi...so"
You may also want extra lines to configure logging, SSL or proxies.
You should then configure the Catalog (see below, it deserves its own section) before activating the site:
sudo a2ensite rdamsc
sudo a2dissite 000-default
sudo service apache2 graceful
The Catalog will look in the following places for configuration options, in the following order:
- the default dictionary hard-coded into the
create_app
function; - the file
instance/config.py
, unless a configuration dictionary is passed to thecreate_app
function, in which case that is used instead. (Theinstance
folder may be overridden as well, as in the instructions for production use above.) - A file specified by the environment variable MSC_SETTINGS.
Settings are applied in the order they are discovered, so later ones override earlier ones.
To set an environment variable on UNIX-like systems, you will need to include the following line in your shell profile (or issue the command from the command line):
export MSC_SETTINGS=/path/to/config.py
On Windows, you can run the following from the command prompt.
set MSC_SETTINGS=\path\to\config.py
To secure the installation, you must choose your own secret key and add it to one of your configuration files, overriding the default one:
SECRET_KEY = 'secret string'
To enable automatic updating from GitHub, you will also need to set up a webhook key and record the path to the WSGI file (so it can be reloaded):
WSGI_PATH = '/srv/rdamsc/rdamsc.wsgi'
WEBHOOK_SECRET = 'another secret string'
To be able to use Open ID Connect (OAuth), you will need to include IDs and secret codes from the Open ID providers in your configuration like this:
OAUTH_CREDENTIALS = {
'google': {
'id': 'id string',
'secret': 'secret string'},
'linkedin': {
'id': 'id string',
'secret': 'secret string'},
'twitter': {
'id': 'id string',
'secret': 'secret string'}}
I have registered a set of these for use in the official instance at https://rdamsc.bath.ac.uk. If you want to be able to log into an instance hosted elsewhere, you will have to register that instance separately with one of the supported providers. You should be able to register instances running on localhost (127.0.0.1) for testing purposes.
The Catalog uses multiple NoSQL databases, which are saved to disk in the form of JSON files. You can either supply pre-populated versions of these files, (for example, using the occasional backups of the live data) or let the Catalog create them for you:
-
Main database contains the tables for the schemes, tools, organisations, mappings, endorsements and the relationships between them.
Configuration key:
MAIN_DATABASE_PATH
Default location:
instance/data/db.json
-
Thesaurus database contains the controlled vocabulary used for subject areas.
Configuration key:
VOCAB_DATABASE_PATH
Default location:
instance/data/vocab.json
-
Terms database contains the controlled vocabulary used for data types, URL location types, ID schemes, and organisation and tool types.
Configuration key:
TERM_DATABASE_PATH
Default location:
instance/data/terms.json
-
User database contains the users registered with the application.
Configuration key:
USER_DATABASE_PATH
Default location:
instance/users/db.json
-
Open ID Connect database contains cached details for Open ID Connect providers.
Configuration key:
OAUTH_DATABASE_PATH
Default location:
instance/oauth/db.json
-
Open ID v2 folder contains cached files for Open ID v2 authentication.
Configuration key:
OPENID_FS_STORE_PATH
Default location:
instance/open-id/
You can configure the names and locations of these files and the folder by putting the respective paths in one of your configuration files:
MAIN_DATABASE_PATH = os.path.join('path', 'to', 'file.json')
The Dulwich library for working with Git is quite sensitive, and will not stage
any commits if a gitignore
file (local or global) contains lines consisting
solely of space characters.
If you have problems with authenticating through a proxy, you may need to
install the pycurl
library as well.
Place a standalone holding HTML page at, say, /srv/rdamsc/maintenance.html
.
Add a file at, say, /srv/rdamsc/exceptions.map
with a list of IP addresses
that should be allowed to see (i.e. test) the site during maintenance, putting
each one on its own line followed by OK
:
192.168.0.1 OK
Examples of both these files are provided in the srv
folder in this
repository.
In /etc/apache2/envvars
, add a definition to APACHE_ARGUMENTS
. This is
normally commented out. If you are already using this for something, you'll
probably want two lines (one with the definition and one without) and toggle
between them:
#export APACHE_ARGUMENTS='-D Maintenance'
Amend your site configuration to include the Alias
line for the maintenance
page (so it bypasses WSGI) and the IfDefine
blocks:
WSGIPassAuthorization On
<VirtualHost *:80>
ServerName rdamsc.example.com
Alias /maintenance.html /srv/rdamsc/maintenance.html
WSGIDaemonProcess rdamsc user=rdamsc group=rdamsc threads=5 python-home=/opt/rdamsc/venv
WSGIScriptAlias / /srv/rdamsc/rdamsc.wsgi
AllowEncodedSlashes NoDecode
<Directory /srv/rdamsc>
WSGIProcessGroup rdamsc
WSGIApplicationGroup %{GLOBAL}
Require all granted
</Directory>
<IfDefine Maintenance>
ErrorDocument 503 /maintenance.html
# Set Retry-After on error pages:
Header always set Retry-After 7200
Header onsuccess unset Retry-After
RewriteEngine on
RewriteMap exceptions txt:/srv/rdamsc/exceptions.map
# Allow individual IP addresses through:
RewriteCond ${exceptions:%{REMOTE_ADDR}} =OK
RewriteRule ^ - [L]
# Otherwise redirect all traffic to the maintenance page:
RewriteCond %{REQUEST_URI} !=/maintenance.html
RewriteRule ^ - [R=503,L]
</IfDefine>
<IfDefine !Maintenance>
# Redirect requests for maintenance page to home page:
RewriteEngine on
RewriteRule ^/maintenance.html$ / [R,L]
</IfDefine>
</VirtualHost>
To add a warning message about an upcoming period of downtime, add the start date and time of the downtime to your instance configuration file:
MAINTENANCE_START = "2020-02-02T20:20:20"
The value must be a valid ISO 8601 date that can be parsed by Python's
datetime.fromisoformat()
method; note that more patterns are
supported in Python 3.11+ than in 3.7–3.10 so check the documentation for the
version of Python you are using. The date will be converted to UTC, thus it is
a good idea to specify the timezone so this works correctly.
It is also a good idea to set an end date and time as well:
MAINTENANCE_END = "2020-02-02T22:20:20"
This will add some detail to the warning message and remove it once the specified time has passed.
Force a reload of the application for these settings to take effect by doing one of the following:
sudo systemctl restart apache2
sudo apachectl restart
sudo -u rdamsc touch /srv/rdamsc/rdamsc.wsgi
These settings do not currently affect the behaviour of the Catalog beyond the warning message, but might do in future.
-
Edit the file
/etc/apache2/envvars
so that the Maintenance line is active:export APACHE_ARGUMENTS='-D Maintenance'
-
Stop and start the server:
sudo systemctl restart apache2
or if not running systemd:
sudo apachectl graceful-stop sudo apachectl start
(The reason for doing a stop and start here is that an
apachectl restart
does not shut down the parentapache
process. A fresh start is needed to pick up the change in arguments.)
-
Edit the file
/etc/apache2/envvars
so that the Maintenance line is commented out:#export APACHE_ARGUMENTS='-D Maintenance'
-
Stop and start the server:
sudo systemctl restart apache2
or if not running systemd:
sudo apachectl graceful-stop sudo apachectl start