-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a single stream for importing records #119
Conversation
This new importer style requires records to be imported starting at the top of the heirarchy and working on down.
1bd19ca
to
5f56eb9
Compare
// how to convert WOF records to Pelias Documents | ||
var documentGenerator = peliasDocGenerators.create( | ||
hierarchyFinder.hierarchies_walker(wofRecords)); | ||
var readStream = readStream.create(directory, types, wofAdminRecords); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit confusing that you redefine readStream
here and assign readStream.create()
to it. Would be great if this variable or the one in the require block had a different name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, doh! good catch
Other than the one confusing variable name, code looks solid. |
Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_ bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Connects #7 Connects #94
5f56eb9
to
c500aaf
Compare
Variable name is fixed! |
1 similar comment
I somehow messed up the order when working on #119. Since `county` records were being loaded and processed before `macrocounty` records, its possible that some records were missing the `macrocounty` hierarchy elements.
Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.
This has several problems:
so bad for WOF admin data, where a reasonably new machine can handle
things just fine, it's horrible for venue data, where there are already
10s of millions of records and will likely be many more in the future.
can't be interleaved to speed things up.
is happening: the importer sits for several minutes loading records
before the dbclient progress logs start displaying
This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.
A change like this is necessary to support Who's on First venues, and in fact this code has already been tested by importing about 1M venues from California!
Fixes #101
Connects #7 (it doesn't quite fix it, for that we need to be able to not even store all admin areas at once, for example to import geometries)
Connects #94