SQY-Traffic is a real-time traffic disruption analysis based on Tweet scrapping from transporters Twitter account.
The goals of this project are:
-
Allowing people to know faster if there is any problem on his route
-
Researching a efficient way to analyze and parse traffic data on micro-blogging text, in this case Tweets (~200 characters)
-
Providing a simple & scalable solution to expose realtime data to be used by anyone
The solution we designed is divided into 4 parts:
-
PostgresSQL database: To store different information about the transporters, the tweets scrapped, and the disruptions analyzed
-
Analyzer: This part determines if a tweet IS a disruption, and extracts disruption data from it
-
Tweet Crawler: Scheduled to call the Twitter API every minute and get the Transporters' newest tweets, it stores them into the database
-
NestJS Typsescript API: Allows users to get in realtime the newest disruption information available
To run SQY-Traffic API, you there is 2 solutions:
You must have Node >= 15 installed:
$ npm install
$ npm start
You can build the docker image from source:
$ git clone https://github.com/Epitech/sqy-traffic.git sqy-traffic
$ cd sqy-traffic
$ docker build -t "sqy-traffic:latest" .
Now, You can run the API with the following command:
$ docker run --rm -it --name -p 3000:3000 "sqy_api" "sqy-traffic:latest"
You can also run all the parts with Docker Compose:
$ docker-compose up --build
The API can be configured with different environment variables:
PORT
: Port where the API is listeningTW_BEARER_TOKEN
: Token delivered by Twitter to access the Twitter APIUSERNAME_CHECKER
: Twitter Account name to test the connection to the Twitter APITW_API_URL
: URL to the twitter APIPG_CONNECTION_STRING
: URL describing the information to connect PostgresSQL database
To get realtime information, it is needed to get frequently the newest tweets to be parsed by the analyzer. We created to do a scheduled task to fetch and store them from the Twitter API. To be sure we scrapped all the tweets published, we set the delay between each fetch cycle to 1 minute.
In order to get disruptions data and update traffic information, we looked for a conventional way data format adapted to the type of data exposed. We selected GTFS-RT format (Documentation). This format has been created to handle perfectly updates from newer realtime traffic information.
The Analyzer service is the core of SQY-Traffic solution, so we imagined several way of research in order to get a result of extraction quite accurate ready for a production context.
There are several difficulties to parse tweets with the purpose of extracting several data. The main one is certainly the non-linearity of the content of the tweet.
Indeed, the content can be incomplete, can be written in different format, and can incorporate a context, only understable by human. Tweet is a human-written text, addressed to another human. The perception of all the information passed into the tweet is easier for a human, even if information is omitted. Nonetheless, it's impossible for the computer to do it alone.
The 1st parsing prototype is based on regular expression (RegExp) :
This solution was eligible because even if we don't know the context of the tweet or some information are missing, the way to write specific data such as dates, intervals or public transport line affected is limited and some patterns are repeated.
Therefore, we wrote sets of regex to handle the maximum of patterns we want to match with.
Here some visual examples of what we could parse:
Creating regular expressions for all the patterns we could is an heavy task and it may too much time to add, fix and update them frequently. Another problem comes from Twitter and micro-blogging texts : They are not always written in pure letters and grammatically correct language. It can be used some emoticons or abbreviations that make sense for a human but for detecting patterns.
These elements make regular expressions not usable for a production context.
NLP: Regular Expressions (RegExp) | NLP: Natural Language Understanding (NLU) |
---|---|
✔️ | doc |
/api/0.0.1/twitter
: get human readable disruptions (JSON format)
/api/0.0.1/disruptions
: get disruptions (GTFS-RT format)