Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete #1

Open
missinglink opened this issue May 4, 2015 · 0 comments
Open

Incomplete #1

missinglink opened this issue May 4, 2015 · 0 comments

Comments

@missinglink
Copy link
Member

This lib is currently incomplete, although it is not far off being worthy of publishing.

This lib stands to replace both pelias/dbclient and the older pelias/esclient modules.

The key points of differentiation from other streaming elasticsearch indexers are:

  • batching via the bulk API
  • retry failed batches
  • flooding upstream propagating downstream (most important)

Other libraries are not well suited for large datasets containing complex properties (such as country size polygons) which take some time to process on the java-side, as a result, naive indexers cause elasticsearch to fill up the bulk indexing threadpool which results in those batches being rejected and data loss.

What's left to do:

  • Write readme and explain how concurrency, retries and the cli work
  • Rethink and test the concurrency control mechanism to achieve optimum load
  • Refactor some of the code to emit events
  • Write a stats module which captures Transaction events and emits stat digests.
Module Goals:

☑ batched writes
☑ adjustable batch size
☑ partialy retry failed batches
☑ backpressure (flood control)
☑ concurrency setting, better highwatermark
☐ actionable error reporting
☑ elasticsearch client injectable
☑ well tested via unit tests & in production
☑ bin file, input streams from cli with id, type mapper
☑ minimal dependencies, dependency injection
☑ usable outside pelias project & not strictly tied to pelias config
☑ ensure no data loss due to ES errors or failure to flush batches
☐ healthcheck via threadpool status
☐ compatibility with different nodejs stream versions
☑ better logging - via winston

Issues with dbclient:

☑ badly named, doesnt describe purpose
☑ not abstracted from pelias
☑ strictly dependency on other pelias modules
☑ not generally useful to 3rd parties
☑ difficult for 3rd party developers to contribute
☑ untidy code
☑ not fully unit tested
☐ not well documented

Duplication across modules (causing confusion):

- https://github.com/geopipes/elasticsearch-backend
- https://github.com/pelias/esclient
- https://github.com/pelias/dbclient

Dependants:

- dat-elasticsearch-upload
- pelias-geonames
- pelias-openaddresses
- pelias-openstreetmap

Similar projects / implementations:

https://github.com/hmalphettes/elasticsearch-streams
https://www.npmjs.com/package/elasticstream
https://github.com/simianhacker/bunyan-elasticsearch/blob/master/index.js

running unit tests

$> npm test

running integration tests

$> npm run integration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant