Skip to content


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation


Spark-Jobs: http://IP-address:4040
Spark-Master: http://IP-address:8080
Spark-Worker: http://IP-address:8081
ElasticSearch: http://IP-address:9200
Kibana: http://IP-address:5601


Name version
Elasticsearch 6.2.2 (oss)
Kibana 6.2.2 (oss)
Spark 2.2.0
Hadoop 1.1.0-hadoop2.8

Use Case

Figure use case

How to use EMB_datastreaming Hub

Figure architecture

(0.) If you start from a 'clean'/'new' Centos7 VM --> You need to set up docker and docker-compose enviroment

For avoiding to introduce the password for docker/docker-compose commands, we recommend to create two Unix groups, one called docker and other called docker-compose (see notes from: note_add_sudo_docker.txt) and add users to it.

It is also recomendable to increase the virtual memory (requirement for elasticsearch)

  sysctl -w vm.max_map_count=262144
  1. Upload the dockerized architecture- (open a new tab - because the current one will be using for the docker-logs)
docker-compose up
  1. Check if all the conatiners are up and running
docker-compose ps
  1. Create the 'emb' topic for our application. Important: we recommend to use ./ , which will save you to create the topic manually. Nevertheless, here you have the script for creating a topic 'manually'.
./ kafka zookeeper:2181 emb

** Note: ./ will do the steps 3 + 4.1 together - Recommended option

  1. Simulation of readings from sensors (we are going to stream data every 3 seconds but you can change the ratio of streaming). Start producing streams to the 'emb' topic - 1 stream per line and file. We have added the sensor_id to each line, so we know from which sensor the data is been streamed from. There are two options - we recommend 4.1
  • 4.1 Using websevices - Falcon (recommended option) - This option is included inside "". In a new terminal start a feeder script POSTing messages to Falcon. With -s you can indicate which sensor you want to stream data from. We have locally stored data from 'emb3' and 'emb2' sensors.

    cd sensordata
    ./ -s emb3

    Falcoln sends the data to Kafka using a post request (you could check the post request in Falcoln/src/ )

  • 4.2 Using a python script: In a new terminal, type the following command to enter inside the spark-worker container:

    docker exec -it spark-worker bash

    Inside the container, the calls the to send data directly to Kafka (you could check pyspark_app/scripts / You could change the script to stream data from a different sensor. By default, we have choosen emb3:

    cd /scripts
  1. Check if the streams can be receive from the 'emb' topic
./ kafka emb zookeeper:2181
  1. Check if the elasticsearch has been correctly created -
    (open new tab - this one will be used for searching data in elasticsearch)
-index: emb_test
-type: emb
        "sensor_id": {"type": "text"}, (Id of the sensor: e.g. emb2, emb3)
	date" : { "type" : "date", "format":"yyyy-MM-dd"} (UTC)
        "time" : { "type" : "date", "format": "HH:mm:ss"} (UTC)
        "sec": { "type" : "integer"} ->  (micro-siemens per centimetre) 
        "ph": { "type" : "float"}
        "water_level": { "type" : "float"} --> Water Level aOD (metre) 
        "water_temp": { "type" : "float"} --> Water Temperature (Celsius) 
	"tdg": { "type" : "integer"} -->  TDG (millibar) 
	"qc": {type:text} --> QualityControl 
docker exec -it elasticsearch bash

Iniside the container

cd /opt/create-es-index
  1. Start the apache spark application (locally option /master-cluster option) that receives streams from Kafka (listening 'emb' topic) and store them in elasticsearch (index emb_test, type: emb)
docker exec -it spark-master bash

Two options to submit an application: 7.A) Inside the container:

7.A.1 cd /app/submit_scripts ( pyspark application is at /app)
7.A.2 ./ (locally version - ideal for testing) or ./ (master-cluster version - distributed)

7.B) Outside the container (using the master-cluster version) - Submiting an apache Spark appication to the Spark Master-cluster

You can always check the Spark UIs (8080, 8081, 4040, 4041) to check how the data is being streamed, and how the master and the workers perform. Here you have an example of the job history UI (ports 4040 or 4041 - depending if you have submitted the application with (port 4041) or (port 4040))

Figure streaming UI

  1. Checking/Getting data/values Elastisearch:

Two options:

Outside the container (recommended):


Inside the elasticsearch container:

 docker exec -it elasticsearch bash
 cd /opt/create-index
  1. Exploring you data in Kibana:

Open a browser and type: http://IP-address:5601. You have several options to visualize your data from elasticsearch.

Figure kibana


Sensor EMB datastreaming







No releases published


No packages published