Skip to content

Latest commit

 

History

History
executable file
·
47 lines (23 loc) · 2.33 KB

File metadata and controls

executable file
·
47 lines (23 loc) · 2.33 KB

Coursera Data Science Capstone Project

This application is the capstone project for the Coursera Data Science specialization held by Johns Hopkins University in cooperation with SwiftKey.


Objective to Project

The main goal of this capstone project is to build a shiny application that is able to predict the next word from a corpus called HC Corpora.

All text mining and natural language processing was done with the usage of a variety of well-known R packages.


Applied Methods & Models

After creating a data samples from the HC Corpora data, this samples are cleaned by removing punctuation, links, white space, numbers and all kinds of special characters etc. The sampled data are then tokenized into n-grams.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. (Source)

Those aggregated bi-,tri-gram term frequency matrices are transferred into frequency dictionaries.

The resulting data frames are used to predict the next word in connection with the text input by a user of the described application and the frequencies of the underlying n-grams table.


Usage Of The Application

The user interface of this application works as follows:
When the text (1) is entered, the field with the predicted next word (2) refreshes instantaneously and also the whole text input (3) gets displayed with suggested completion work as shown below in the diagram.

Application Screenshot


Additional Information