1 error loading editions (#10)

* Reworking documentation * Readme work * Starting database scripts * Working on database tables * Refining table scripts * Table creation scripts * Bulk of database loading scripts * Updating readme * Updating ISBN table population * Refining order of creating indexes * Communicative load process (#7) * Communicative load process (#9) * add command to make sure openlibrary database is being used when running the script (#6) * create a sql file to load a temp table with file information * create data loader than can take database files in chunks * add new step to load process * remove temp table after script finishes * add file loader and make it create a loadable file for temp database table * add command to auto load filenames into temp table * clean up add load script * add load scripts * rename files to remove the 2 I added * add python file to split up data into smaller chunks * add some sample files to demonstrate this version quickly * reduced the size of sample files * add section for moving the files to the unprocessed folder * incorporate the changes I made into the readme * update notes * update one of the code examples * make loader mark files that have been loaded instead of deleting them from the loader * rephrased message * add time stamps and notices * refine chunk notes * Adjusting bulk loader to use /copy command * Minor readme update --------- Co-authored-by: Chloe-Meinshausen <[email protected]>
LibrariesHacked · Feb 15, 2023 · 3ad8df5 · 3ad8df5
1 parent 2426fc3
commit 3ad8df5
Show file tree

Hide file tree

Showing 28 changed files with 18,388 additions and 569 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,11 @@
 *.csv
 *.csv
 *.txt
-*.xlsx
+*.xlsx
+
+data/unprocessed/ol_dump_works_*.txt.gz
+data/unprocessed/ol_dump_authors_*.txt.gz
+data/unprocessed/ol_dump_editions_*.txt.gz
+.vscode/settings.json
+create_db.bat
+copy_commands.sql
diff --git a/README.md b/README.md
@@ -1,170 +1,118 @@
-# Open Library Searching
+# Open Library database
 
-**Task:** to use open library bibliographic data to search books based upon a pre-defined subset of ISBNs.
+Open Library is an online library of bibliographic data and includes [full data dumps](https://openlibrary.org/developers/dumps) of all its data.
 
-## Further details
+This project provides instructions and scripts for importing this data into a PostgreSQL database and some sample queries to test the database.
 
+### Getting started
 
+The following steps should get you up and running with a working database.
 
+1. Install the [required prerequisites](#prerequisites) so that you have the software running and database server.
+2. [Download the data](#downloading-the-data) from Open Library.
+3. Run the [processing the data](#processing-the-data) scripts to clean it up and make it easier to import.
+4. [Import the data](#import-into-database) into the database.
 
-## Method 1: API
+### Prerequisites
 
+- Python 3 - Tested with 3.10
+- PostgreSQL - Version 15 is tested but most recent versions should work
 
-**Abandon!  The API is a bit of a nightmare to use in bulk and Open Library discourage it (even with measures in place to limit requests).  See below for better alternative.**  
+### Downloading the data
 
-## Method 2: Bulk download and database
+Open Library offer bulk downloads on their website, available from the [Data Dumps page](https://openlibrary.org/developers/dumps)
 
-Open Library also offer bulk downloads on their website, available from the **Data Dumps** page.
+These are updated every month. The downloads available include:
 
-[https://openlibrary.org/developers/dumps](https://openlibrary.org/developers/dumps)
+- Editions (~9GB)
+- Works (~2.5GB)
+- Authors (~0.5GB)
+- All types (~10GB)
 
-These are updated every month.
-
-### Import into database
-
-Using a postgreSQL database it should be possible to import the data directly into tables and then do complex searches with SQL.
-
-Unfortunately the downloads provided are a bit messy.  The open library file always errors as the number of columns provided seem to vary.  Cleaning it up is difficult as just the text file for editions is 25GB.
-
-That means another python script to clean up the data.  The file [openlibrary-data-process.py](openlibrary-data-process.py) simply reads in the CSV (python is a little more forgiving about dodgy data) and writes it out again, but only if there are 5 columns.
-
-### Create the Open Library data tables
-
-The data is split into 3 files:
-
-| Data | Description | Fields | File name | Size |
-|:---|:---|:---|:---|:---|
-| Authors | Authors are the individuals who write the works! | Name, 
-| Works | The works as created by the authors, with titles, and subtitles. |
-| Editions | The particular editions of the works, including ISBNs | 
-
-### Create indexes and extract from JSON
-
-The majority of the data for works/editions/authors is in the JSON. We'll be using a few of these fields for joins so for simplicity will extract them as individual (indexed) columns.
-
-### Works table
-
-In the open library data a 'work' is a
+For this project, I downloaded the Editions, Works, and Authors data. The latest can be downloaded using the following commands in a terminal:
 
-```
-copy works FROM 'C:\openlibrary-search\data\ol_dump_works_2016-07-31_processed.csv' DELIMITER E'\t' QUOTE '|' CSV;
+```console
+wget https://openlibrary.org/data/ol_dump_editions_latest.txt.gz -P ~/downloads
+wget https://openlibrary.org/data/ol_dump_works_latest.txt.gz -P ~/downloads
+wget https://openlibrary.org/data/ol_dump_authors_latest.txt.gz -P ~/downloads
 ```
 
+To move the data from your downloads folder, use the following commands in a terminal
 
+```console
+mv ~/downloads/ol_dump_authors_*txt.gz ./data/unprocessed/ol_dump_authors_.txt.gz
+mv ~/downloads/ol_dump_works_*txt.gz ./data/unprocessed/ol_dump_works_.txt.gz
+mv ~/downloads/ol_dump_editions_*txt.gz ./data/unprocessed/ol_dump_editions_.txt.gz
 ```
-create index idx_works_ginp on works using gin (data jsonb_path_ops);
-```
-
-### Authors table
-
-
-
-
-```
-COPY authors FROM 'C:\openlibrary-search\data\ol_dump_authors_2016-07-31_processed.csv' DELIMITER E'\t' QUOTE '|' CSV;
-```
-
 
+To uncompress this data, I used the following commands in a terminal:
 
+```console
+gzip -d -c data/unprocessed/ol_dump_editions_*.txt.gz > data/unprocessed/ol_dump_editions.txt
+gzip -d -c data/unprocessed/ol_dump_works_*.txt.gz > data/unprocessed/ol_dump_works.txt
+gzip -d -c data/unprocessed/ol_dump_authors_*.txt.gz > data/unprocessed/ol_dump_authors.txt
 ```
-create index idx_authors_ginp on authors using gin (data jsonb_path_ops);
-```
-
 
-### Authorship table
+### Processing the data
 
+Unfortunately the downloads provided seem to be a bit messy, or at least don't play nicely with direct importing. The open library file errors on import as the number of columns provided varies. Cleaning it up is difficult as just the text file for editions is 25GB. _Note: Check if this is still the case and if so there could be some Linux tools to do this - maybe try `sed` and `awk`_
 
-The relationship between works and authors is **many-to-many**.  That is to say one particular work can be authored by multiple authors, and an author can have multiple works under their name.
-
-The typical way to represent this kind of relationship in a relational database is with a separate table.  This will be called **authorship** and will list a row for each instance of author and work.  For example:
-
-| author | work |
-|:---|:---|
-| JK Rowling | Harry Potter and the Prisoner of Azkaban |
-| JK Rowling | Harry Potter and the Cursed Child |
-| Jack Thorne | Harry Potter and the Cursed Child |
-| Jack Thorne | Something that isn't Harry Potter |
-
-(Of course we'll be using the IDs of works and authors rather than the names themselves.)
+That means requiring another python script to clean up the data. The file [openlibrary-data-process.py](openlibrary-data-process.py) simply reads in the CSV (python is a little more forgiving about dodgy data) and writes it out again for each row, but only where there are 5 columns.
 
+```console
+python openlibrary-data-process.py
 ```
 
-```
+Because the download files are so huge and are only going to grow, editions is now 45gb+, you can use the `openlibrary-data-chunk-process.py` alternative file to split the data into smaller files to load sequentially. You can change the number of lines in each chuck here. I recommend 1-3 million.
 
-
-All of the data to populate the table is currently held in the **works** table, which has arrays of author IDs embedded within the JSON data.
+Once the files are split you should delete the 3 .txt files in the uncompressed folder because you will need around 230 Gb of freespace to load all 3 files into the database without encountering lack of space errors.
 
 ```
-insert into authorship
-select distinct jsonb_array_elements(data->'authors')->'author'->>'key', key from works
-where key is not null
-and data->'authors'->0->'author' is not null
-```
-
-
-### Editions table
-
-The editions table is huge - the file is 26GB, which seems to amount to about 25 million rows of data.
-
-
-
-```
-COPY editions FROM 'C:\openlibrary-search\data\ol_dump_editions_2016-07-31_processed.csv' DELIMITER E'\t' QUOTE '|' CSV;
-```
-
-```
-create index idx_editions_ginp on editions using gin (data jsonb_path_ops)
-```
-
-
-We really want the ISBNs and work keys out of the main JSON data and into proper individual columns.
-
-
-```
-update editions
-set work_key = data->'works'->0->>'key'
-```
-
-Then index the work key.
-
+lines_per_file = 5000
 ```
 
+```console
+python3 openlibrary-data-chunk-process.py
 ```
 
+This generates multiple files into the `data/processed` directory.
+One of those files will be used to access the rest of them when loading the data.
 
-### EditionISBNs tables
+### Import into database
 
+It is then possible to import the data directly into PostgreSQL tables and do complex searches with SQL.
 
+There are a series of database scripts whch will create the database and tables, and then import the data. These are in the [database](database) folder. The data files (created in the previous process) need to already be within the `data/processed` folder for this to work.
 
+The command line too `psql` is used to run the scripts. The following command will create the database and tables:
 
+```console
+psql --set=sslmode=require -f openlibrary-db.sql -h localhost -p 5432 -U username postgres
 ```
-insert into editionisbn13s
-select distinct key, jsonb_array_elements(data->'isbn_13')->>'key' from editions
-where key is not null
-and data->'isbn_13'->0->'key' is not null
-```
-
 
+### Database table details
 
+The database is split into 5 main tables
 
-## Vacuum up the mess
-
-PostgreSQL has a function 
-
-``` 
-vacuum full analyze verbose
-```
+| Data          | Description                                                     |
+| :------------ | :-------------------------------------------------------------- |
+| Authors       | Authors are the individuals who write the works                 |
+| Works         | The works as created by the authors, with titles, and subtitles |
+| Autor Works   | A table linking the works with authors                          |
+| Editions      | The particular editions of the works, including ISBNs           |
+| Edition_ISBNS | The ISBNs for the editions                                      |
 
 ## Query the data
 
-That's the database set up - in can now be queried using relatively straightforward SQL.
+That's the database set up - it can now be queried using relatively straightforward SQL.
 
 Get details for a single item using the ISBN13 9781551922461 (Harry Potter and the Prisoner of Azkaban).
 
-```
-select 
+```sql
+select
     e.data->>'title' "EditionTitle",
     w.data->>'title' "WorkTitle",
+	a.data->>'name' "Name",
     e.data->>'subtitle' "EditionSubtitle",
     w.data->>'subtitle' "WorkSubtitle",
     e.data->>'subjects' "Subjects",
@@ -173,45 +121,13 @@ select
     e.data->'notes'->>'value' "EditionNotes",
     w.data->'notes'->>'value' "WorkNotes"
 from editions e
-join editionisbn13s ei
+join edition_isbns ei
     on ei.edition_key = e.key
 join works w
     on w.key = e.work_key
-where ei.isbn13 = '9781551922461'
+join author_works a_w
+	on a_w.work_key = w.key
+join authors a
+	on a_w.author_key = a.key
+where ei.isbn = '9781551922461'
 ```
-
-
-```
-copy (
-	select distinct
-		e.data->>'title' "EditionTitle",
-		w.data->>'title' "WorkTitle",
-		e.data->>'subtitle' "EditionSubtitle",
-		w.data->>'subtitle' "WorkSubtitle",
-		e.data->>'subjects' "EditionSubjects",
-		w.data->>'subjects' "WorkSubjects",
-		e.data->'description'->>'value' "EditionDescription",
-		w.data->'description'->>'value' "WorkDescription",
-		e.data->'notes'->>'value' "EditionNotes",
-		w.data->'notes'->>'value' "WorkNotes"
-	from editions e
-	join works w
-		on w.key = e.work_key
-	join editionisbn13s ei13
-		on ei13.edition_key = e.key
-	where ei13.isbn13 IN (select isbn13 from isbn13s)
-	and (
-		lower(e.data->>'title') like any (select '%' || keyword || '%' from keywords) OR
-		lower(w.data->>'title') like any (select '%' || keyword || '%' from keywords) OR
-		lower(e.data->>'subtitle') like any (select '%' || keyword || '%' from keywords) OR
-		lower(w.data->>'subtitle') like any (select '%' || keyword || '%' from keywords) OR
-		lower(e.data->>'subjects') like any (select '%' || keyword || '%' from keywords) OR
-		lower(w.data->>'subjects') like any (select '%' || keyword || '%' from keywords) OR
-		lower(e.data->'description'->>'value') like any (select '%' || keyword || '%' from keywords) OR
-		lower(w.data->'description'->>'value') like any (select '%' || keyword || '%' from keywords) OR
-		lower(e.data->'notes'->>'value') like any (select '%' || keyword || '%' from keywords) OR
-		lower(w.data->'notes'->>'value') like any (select '%' || keyword || '%' from keywords)
-	)
-) to '\data\open_library_export.csv' With CSV DELIMITER E'\t';
-```
-
diff --git a/copy_commands.sql b/copy_commands.sql
@@ -0,0 +1 @@
+\copy editions from './data/processed/ol_dump_editions.txt' delimiter E'\t' quote '|' csv;
diff --git a/create_db.bat.sample b/create_db.bat.sample
@@ -0,0 +1 @@
+psql --set=sslmode=require -f openlibrary-db.sql -h localhost -p 5432 -U username postgres
diff --git a/data/processed/.gitkeep b/data/processed/.gitkeep
diff --git a/data/unprocessed/.gitkeep b/data/unprocessed/.gitkeep
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		\copy editions from './data/processed/ol_dump_editions.txt' delimiter E'\t' quote '\|' csv;
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		psql --set=sslmode=require -f openlibrary-db.sql -h localhost -p 5432 -U username postgres