Skip to content

Commit

Permalink
1 error loading editions (#10)
Browse files Browse the repository at this point in the history
* Reworking documentation

* Readme work

* Starting database scripts

* Working on database tables

* Refining table scripts

* Table creation scripts

* Bulk of database loading scripts

* Updating readme

* Updating ISBN table population

* Refining order of creating indexes

* Communicative load process (#7)

* Communicative load process (#9)

* add command to make sure openlibrary database is being used when running the script (#6)

* create a sql file to load a temp table with file information

* create data loader than can take database files in chunks

* add new step to load process

* remove temp table after script finishes

* add file loader and make it create a loadable file for temp database table

* add command to auto load filenames into temp table

* clean up add load script

* add load scripts

* rename files to remove the 2 I added

* add python file to split up data into smaller chunks

* add some sample files to demonstrate this version quickly

* reduced the size of sample files

* add section for moving the files to the unprocessed folder

* incorporate the changes I made into the readme

* update notes

* update one of the code examples

* make loader mark files that have been loaded instead of deleting them from the loader

* rephrased message

* add time stamps and notices

* refine chunk notes

* Adjusting bulk loader to use /copy command

* Minor readme update

---------

Co-authored-by: Chloe-Meinshausen <[email protected]>
  • Loading branch information
DaveBathnes and Chloe-Meinshausen authored Feb 15, 2023
1 parent 2426fc3 commit 3ad8df5
Show file tree
Hide file tree
Showing 28 changed files with 18,388 additions and 569 deletions.
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
*.csv
*.csv
*.txt
*.xlsx
*.xlsx

data/unprocessed/ol_dump_works_*.txt.gz
data/unprocessed/ol_dump_authors_*.txt.gz
data/unprocessed/ol_dump_editions_*.txt.gz
.vscode/settings.json
create_db.bat
copy_commands.sql
226 changes: 71 additions & 155 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,170 +1,118 @@
# Open Library Searching
# Open Library database

**Task:** to use open library bibliographic data to search books based upon a pre-defined subset of ISBNs.
Open Library is an online library of bibliographic data and includes [full data dumps](https://openlibrary.org/developers/dumps) of all its data.

## Further details
This project provides instructions and scripts for importing this data into a PostgreSQL database and some sample queries to test the database.

### Getting started

The following steps should get you up and running with a working database.

1. Install the [required prerequisites](#prerequisites) so that you have the software running and database server.
2. [Download the data](#downloading-the-data) from Open Library.
3. Run the [processing the data](#processing-the-data) scripts to clean it up and make it easier to import.
4. [Import the data](#import-into-database) into the database.

## Method 1: API
### Prerequisites

- Python 3 - Tested with 3.10
- PostgreSQL - Version 15 is tested but most recent versions should work

**Abandon! The API is a bit of a nightmare to use in bulk and Open Library discourage it (even with measures in place to limit requests). See below for better alternative.**
### Downloading the data

## Method 2: Bulk download and database
Open Library offer bulk downloads on their website, available from the [Data Dumps page](https://openlibrary.org/developers/dumps)

Open Library also offer bulk downloads on their website, available from the **Data Dumps** page.
These are updated every month. The downloads available include:

[https://openlibrary.org/developers/dumps](https://openlibrary.org/developers/dumps)
- Editions (~9GB)
- Works (~2.5GB)
- Authors (~0.5GB)
- All types (~10GB)

These are updated every month.

### Import into database

Using a postgreSQL database it should be possible to import the data directly into tables and then do complex searches with SQL.

Unfortunately the downloads provided are a bit messy. The open library file always errors as the number of columns provided seem to vary. Cleaning it up is difficult as just the text file for editions is 25GB.

That means another python script to clean up the data. The file [openlibrary-data-process.py](openlibrary-data-process.py) simply reads in the CSV (python is a little more forgiving about dodgy data) and writes it out again, but only if there are 5 columns.

### Create the Open Library data tables

The data is split into 3 files:

| Data | Description | Fields | File name | Size |
|:---|:---|:---|:---|:---|
| Authors | Authors are the individuals who write the works! | Name,
| Works | The works as created by the authors, with titles, and subtitles. |
| Editions | The particular editions of the works, including ISBNs |

### Create indexes and extract from JSON

The majority of the data for works/editions/authors is in the JSON. We'll be using a few of these fields for joins so for simplicity will extract them as individual (indexed) columns.

### Works table

In the open library data a 'work' is a
For this project, I downloaded the Editions, Works, and Authors data. The latest can be downloaded using the following commands in a terminal:

```
copy works FROM 'C:\openlibrary-search\data\ol_dump_works_2016-07-31_processed.csv' DELIMITER E'\t' QUOTE '|' CSV;
```console
wget https://openlibrary.org/data/ol_dump_editions_latest.txt.gz -P ~/downloads
wget https://openlibrary.org/data/ol_dump_works_latest.txt.gz -P ~/downloads
wget https://openlibrary.org/data/ol_dump_authors_latest.txt.gz -P ~/downloads
```

To move the data from your downloads folder, use the following commands in a terminal

```console
mv ~/downloads/ol_dump_authors_*txt.gz ./data/unprocessed/ol_dump_authors_.txt.gz
mv ~/downloads/ol_dump_works_*txt.gz ./data/unprocessed/ol_dump_works_.txt.gz
mv ~/downloads/ol_dump_editions_*txt.gz ./data/unprocessed/ol_dump_editions_.txt.gz
```
create index idx_works_ginp on works using gin (data jsonb_path_ops);
```

### Authors table




```
COPY authors FROM 'C:\openlibrary-search\data\ol_dump_authors_2016-07-31_processed.csv' DELIMITER E'\t' QUOTE '|' CSV;
```


To uncompress this data, I used the following commands in a terminal:

```console
gzip -d -c data/unprocessed/ol_dump_editions_*.txt.gz > data/unprocessed/ol_dump_editions.txt
gzip -d -c data/unprocessed/ol_dump_works_*.txt.gz > data/unprocessed/ol_dump_works.txt
gzip -d -c data/unprocessed/ol_dump_authors_*.txt.gz > data/unprocessed/ol_dump_authors.txt
```
create index idx_authors_ginp on authors using gin (data jsonb_path_ops);
```


### Authorship table
### Processing the data

Unfortunately the downloads provided seem to be a bit messy, or at least don't play nicely with direct importing. The open library file errors on import as the number of columns provided varies. Cleaning it up is difficult as just the text file for editions is 25GB. _Note: Check if this is still the case and if so there could be some Linux tools to do this - maybe try `sed` and `awk`_

The relationship between works and authors is **many-to-many**. That is to say one particular work can be authored by multiple authors, and an author can have multiple works under their name.

The typical way to represent this kind of relationship in a relational database is with a separate table. This will be called **authorship** and will list a row for each instance of author and work. For example:

| author | work |
|:---|:---|
| JK Rowling | Harry Potter and the Prisoner of Azkaban |
| JK Rowling | Harry Potter and the Cursed Child |
| Jack Thorne | Harry Potter and the Cursed Child |
| Jack Thorne | Something that isn't Harry Potter |

(Of course we'll be using the IDs of works and authors rather than the names themselves.)
That means requiring another python script to clean up the data. The file [openlibrary-data-process.py](openlibrary-data-process.py) simply reads in the CSV (python is a little more forgiving about dodgy data) and writes it out again for each row, but only where there are 5 columns.

```console
python openlibrary-data-process.py
```

```
Because the download files are so huge and are only going to grow, editions is now 45gb+, you can use the `openlibrary-data-chunk-process.py` alternative file to split the data into smaller files to load sequentially. You can change the number of lines in each chuck here. I recommend 1-3 million.


All of the data to populate the table is currently held in the **works** table, which has arrays of author IDs embedded within the JSON data.
Once the files are split you should delete the 3 .txt files in the uncompressed folder because you will need around 230 Gb of freespace to load all 3 files into the database without encountering lack of space errors.

```
insert into authorship
select distinct jsonb_array_elements(data->'authors')->'author'->>'key', key from works
where key is not null
and data->'authors'->0->'author' is not null
```


### Editions table

The editions table is huge - the file is 26GB, which seems to amount to about 25 million rows of data.



```
COPY editions FROM 'C:\openlibrary-search\data\ol_dump_editions_2016-07-31_processed.csv' DELIMITER E'\t' QUOTE '|' CSV;
```

```
create index idx_editions_ginp on editions using gin (data jsonb_path_ops)
```


We really want the ISBNs and work keys out of the main JSON data and into proper individual columns.


```
update editions
set work_key = data->'works'->0->>'key'
```

Then index the work key.

lines_per_file = 5000
```

```console
python3 openlibrary-data-chunk-process.py
```

This generates multiple files into the `data/processed` directory.
One of those files will be used to access the rest of them when loading the data.

### EditionISBNs tables
### Import into database

It is then possible to import the data directly into PostgreSQL tables and do complex searches with SQL.

There are a series of database scripts whch will create the database and tables, and then import the data. These are in the [database](database) folder. The data files (created in the previous process) need to already be within the `data/processed` folder for this to work.

The command line too `psql` is used to run the scripts. The following command will create the database and tables:

```console
psql --set=sslmode=require -f openlibrary-db.sql -h localhost -p 5432 -U username postgres
```
insert into editionisbn13s
select distinct key, jsonb_array_elements(data->'isbn_13')->>'key' from editions
where key is not null
and data->'isbn_13'->0->'key' is not null
```


### Database table details

The database is split into 5 main tables

## Vacuum up the mess

PostgreSQL has a function

```
vacuum full analyze verbose
```
| Data | Description |
| :------------ | :-------------------------------------------------------------- |
| Authors | Authors are the individuals who write the works |
| Works | The works as created by the authors, with titles, and subtitles |
| Autor Works | A table linking the works with authors |
| Editions | The particular editions of the works, including ISBNs |
| Edition_ISBNS | The ISBNs for the editions |

## Query the data

That's the database set up - in can now be queried using relatively straightforward SQL.
That's the database set up - it can now be queried using relatively straightforward SQL.

Get details for a single item using the ISBN13 9781551922461 (Harry Potter and the Prisoner of Azkaban).

```
select
```sql
select
e.data->>'title' "EditionTitle",
w.data->>'title' "WorkTitle",
a.data->>'name' "Name",
e.data->>'subtitle' "EditionSubtitle",
w.data->>'subtitle' "WorkSubtitle",
e.data->>'subjects' "Subjects",
Expand All @@ -173,45 +121,13 @@ select
e.data->'notes'->>'value' "EditionNotes",
w.data->'notes'->>'value' "WorkNotes"
from editions e
join editionisbn13s ei
join edition_isbns ei
on ei.edition_key = e.key
join works w
on w.key = e.work_key
where ei.isbn13 = '9781551922461'
join author_works a_w
on a_w.work_key = w.key
join authors a
on a_w.author_key = a.key
where ei.isbn = '9781551922461'
```


```
copy (
select distinct
e.data->>'title' "EditionTitle",
w.data->>'title' "WorkTitle",
e.data->>'subtitle' "EditionSubtitle",
w.data->>'subtitle' "WorkSubtitle",
e.data->>'subjects' "EditionSubjects",
w.data->>'subjects' "WorkSubjects",
e.data->'description'->>'value' "EditionDescription",
w.data->'description'->>'value' "WorkDescription",
e.data->'notes'->>'value' "EditionNotes",
w.data->'notes'->>'value' "WorkNotes"
from editions e
join works w
on w.key = e.work_key
join editionisbn13s ei13
on ei13.edition_key = e.key
where ei13.isbn13 IN (select isbn13 from isbn13s)
and (
lower(e.data->>'title') like any (select '%' || keyword || '%' from keywords) OR
lower(w.data->>'title') like any (select '%' || keyword || '%' from keywords) OR
lower(e.data->>'subtitle') like any (select '%' || keyword || '%' from keywords) OR
lower(w.data->>'subtitle') like any (select '%' || keyword || '%' from keywords) OR
lower(e.data->>'subjects') like any (select '%' || keyword || '%' from keywords) OR
lower(w.data->>'subjects') like any (select '%' || keyword || '%' from keywords) OR
lower(e.data->'description'->>'value') like any (select '%' || keyword || '%' from keywords) OR
lower(w.data->'description'->>'value') like any (select '%' || keyword || '%' from keywords) OR
lower(e.data->'notes'->>'value') like any (select '%' || keyword || '%' from keywords) OR
lower(w.data->'notes'->>'value') like any (select '%' || keyword || '%' from keywords)
)
) to '\data\open_library_export.csv' With CSV DELIMITER E'\t';
```

1 change: 1 addition & 0 deletions copy_commands.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
\copy editions from './data/processed/ol_dump_editions.txt' delimiter E'\t' quote '|' csv;
1 change: 1 addition & 0 deletions create_db.bat.sample
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
psql --set=sslmode=require -f openlibrary-db.sql -h localhost -p 5432 -U username postgres
Empty file added data/processed/.gitkeep
Empty file.
Empty file added data/unprocessed/.gitkeep
Empty file.
Loading

0 comments on commit 3ad8df5

Please sign in to comment.