This demo intends to provide an end-to-end example of deploying an NLP inference project using Apache Spark Structured Streaming. The setup includes ready-to-use local deployment for testing and cloud deployment on Databricks using Drone. Please also see accompanying Medium blog post. For the purpose of this demo, we use spaCy's open-source Named Entity Recognition model.
To clone the repo, run git clone https://github.com/rashadmoarref/spark-demo.git
- create local demo network to be used by kafka and app docker containers
docker network create demo
- create kafka service
make kafka
- run app
make app
Note: use make app FORCE=true
if need to re-build the app after changing Dockerfile
or reqiurements.txt
- send input to app's input topic
make produce-input FILE=input.json
- consume from app's result topic
make consume-result
- tear down local deployment
make app-down
make kafka-down
docker network rm demo
- Note: If running into memory issues, increase the allocated memory of your Docker Engine to 10GB.
- drone steps are defined in
.drone.yml
- deployment on databricks is handled using REST API in
deploy/databricks/deploy-job.sh
- The configs are defined in
.pre-commit-config.yaml
- black and flake8 are tools for enforcing code style
pip install black flake8 pre-commit
pre-commit install