Show progress when initially loading data #101

orangejulius · 2016-06-13T21:06:51Z

The importer is currently very quiet for the first few minutes when being run. In the background it is (hopefully) loading all the WOF admin data into memory, but since there's no way to judge progress, there's no way to know its working. If you aren't expecting this, it's especially concerning. Perhaps as part of #7 we should improve this.

The download script is similarly quiet.

Thanks to @easherma for bringing this to our attention.

Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Fixes #7 Connects #94

Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Connects #7 Connects #94

Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_ bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Connects #7 Connects #94

orangejulius added enhancement help wanted outreach labels Jun 13, 2016

orangejulius mentioned this issue Aug 3, 2016

Use a single stream for importing records #119

Merged

orangejulius self-assigned this Aug 3, 2016

orangejulius added in progress and removed outreach labels Aug 3, 2016

dianashk added in review and removed in progress labels Aug 5, 2016

dianashk added this to the WOF Venues milestone Aug 5, 2016

orangejulius closed this as completed in #119 Aug 8, 2016

orangejulius removed the in review label Aug 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show progress when initially loading data #101

Show progress when initially loading data #101

orangejulius commented Jun 13, 2016 •

edited by dianashk

Loading

Show progress when initially loading data #101

Show progress when initially loading data #101

Comments

orangejulius commented Jun 13, 2016 • edited by dianashk Loading

orangejulius commented Jun 13, 2016 •

edited by dianashk

Loading