Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show progress when initially loading data #101

Closed
orangejulius opened this issue Jun 13, 2016 · 0 comments · Fixed by #119
Closed

Show progress when initially loading data #101

orangejulius opened this issue Jun 13, 2016 · 0 comments · Fixed by #119

Comments

@orangejulius
Copy link
Member

orangejulius commented Jun 13, 2016

The importer is currently very quiet for the first few minutes when being run. In the background it is (hopefully) loading all the WOF admin data into memory, but since there's no way to judge progress, there's no way to know its working. If you aren't expecting this, it's especially concerning. Perhaps as part of #7 we should improve this.

The download script is similarly quiet.

Thanks to @easherma for bringing this to our attention.

orangejulius added a commit that referenced this issue Aug 3, 2016
Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:
* It requires that all data can fit into memory. While this is not
  _so_bad for WOF admin data, where a reasonably new machine can handle
  things just fine, it's horrible for venue data, where there are already
  10s of millions of records.
* Its slower: by separating the disk and network I/O sections, they
  can't be interleaved to speed things up.
* It doesn't give good feedback when running the importer that something
  is happening: the importer sits for several minutes loading records
  before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

Fixes #101
Fixes #7
Connects #94
@orangejulius orangejulius self-assigned this Aug 3, 2016
orangejulius added a commit that referenced this issue Aug 3, 2016
Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:
* It requires that all data can fit into memory. While this is not
  _so_bad for WOF admin data, where a reasonably new machine can handle
  things just fine, it's horrible for venue data, where there are already
  10s of millions of records.
* Its slower: by separating the disk and network I/O sections, they
  can't be interleaved to speed things up.
* It doesn't give good feedback when running the importer that something
  is happening: the importer sits for several minutes loading records
  before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

Fixes #101
Connects #7
Connects #94
orangejulius added a commit that referenced this issue Aug 3, 2016
Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:
* It requires that all data can fit into memory. While this is not
  _so_ bad for WOF admin data, where a reasonably new machine can handle
  things just fine, it's horrible for venue data, where there are already
  10s of millions of records.
* Its slower: by separating the disk and network I/O sections, they
  can't be interleaved to speed things up.
* It doesn't give good feedback when running the importer that something
  is happening: the importer sits for several minutes loading records
  before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

Fixes #101
Connects #7
Connects #94
orangejulius added a commit that referenced this issue Aug 4, 2016
Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:
* It requires that all data can fit into memory. While this is not
  _so_ bad for WOF admin data, where a reasonably new machine can handle
  things just fine, it's horrible for venue data, where there are already
  10s of millions of records.
* Its slower: by separating the disk and network I/O sections, they
  can't be interleaved to speed things up.
* It doesn't give good feedback when running the importer that something
  is happening: the importer sits for several minutes loading records
  before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

Fixes #101
Connects #7
Connects #94
@dianashk dianashk added this to the WOF Venues milestone Aug 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants