The use of semantic web and a number of participatory platforms, such as web / text / opinion mining networks, online social networking, blogs, wikis, and forums, offer the potential for decision-makers in the public domain, government institutions, and civil society stakeholders to bring about major improvements in the way future communities operate. The changing technical landscape is transforming the transmission of information and the exchange of expertise between participants in public administration and even within civil society. The exponential development of the internet and the World Wide Web has resulted in a large volume of computer readable electronic information. This rising text data comes from different realms and has been steadily through for many centuries. Around 85 % of company knowledge is accessible in digital text format. According to Hearst, "Text Mining is the exploration by computer of new knowledge, previously unknown, by automatically collecting information from different written materials. A main aspect is to put the derived knowledge together in order to establish new evidence or new theories to be further investigated by more traditional means of experimentation. Text Mining is a modern area in computer science study and is primarily interdisciplinary, with close links to related areas such as data processing, artificial learning, information management, and computational linguistics. Text mining differs from data mining in a way that its fascinating trends are derived from structured data (databases) in data mining, whereas text mining processes semi-structured (XML files) or unstructured data (natural language text) and captures hidden meaningful data. Having these background information in mind we decided to create a project for natural language processing. Our project is based on three different phases namely:
- Text Extraction: The required input data for this part of text mining comes in XML file format. Relevant information is extracted automatically from these input files. This text is saved to separate text files after retrieving relevant data.
- Preprocessing / Text Cleansing: The next step is to clean the extracted text and this preprocessing module ensures that data is ready now for analysis process. Following preprocessing methods are involved in text cleaning.following preprocessing methods are involved in text cleaning.Following preprocessing methods are involved in text cleaning.
- Word Expansion and Matching: A list of the keywords is created at the end of the preprocessing step. This list of words originated from a file created by a model operation. Single file cannot provide sufficient information to generate the knowledge elements we need for the ontology specific to a domain. To enrich vocabulary with the necessary elements of understanding, we extract related terms (synonyms) from the WordNet dictionary created by Princeton University.