Skip to content

ThiagoGrabe/PowerBI_report-AWS_integration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Analysis ENEM 2017

ENEM2017

This project is an analysis for ENEM 2017 data. All data are public and the information here is not confidential.

This notebook is structured as follow:

1. Loading data;
2. Data cleaning and selection using Pyspark framework;
3. Creating Postgresql database on AWS;
4. Data selection and feature engineering;
5. Presentation on Power BI

Analysis goal

In order to achieve the proposed milestones, some features will not be used or even analysed. This work consists on an analysis fo two main aspects of ENEM 2017:

* Student score on all subjects per region and city
* Social aspects and how it is realated to the performance.

With these two aspects, I intend to give an understanding of the following questions:

    • What is the average of scores by city and state?
    • Is there and bias related to the state? Can we realate to some other feature?
    • What cities present the best and the worst results?
    • What about the population size? Does it impact on the results?
    • Looking at some social aspects, how self declaration is related to the final score?
    • Is gender also an important feature when looking to the scores?
    • People from public schools perform at the same level as private schools?

At the end of this work, it will be presented some insights about the topic and the questions. Then, this work will try to answer all of them with some charts presentations to exemplify the results in a POWER BI dashboard.

Data Modeling

The data modeling for this project is simple and it is based on the following flow.

Flow

In this work, the data is prepared locally using Pyspark to have Apache Spark as a distributed framework that can handle Big Data analysis. I also use the Postgresql on a AWS RDS instance to access the data on the cloud storage system.

After some feature selection this analysis will be consisted on the final ENEM scores. In other words, the students who have not completed the exam and are eliminated from College selection process or cannot conclude the brazilian high school, they have the scored dropped and they are no longer in the database. This approach is important to cut outliers from data and examinate only people who attended to the exam.

Features Type
NU_INSCRICAO float64
SG_UF_RESIDENCIA string
CO_MUNICIPIO_RESIDENCIA int
CO_MUNICIPIO_ESC int
TP_SEXO int
TP_COR_RACA int
TP_ENSINO int
TP_ESCOLA int
TP_ANO_CONCLUIU int
TP_LINGUA int
TP_DEPENDENCIA_ADM_ESC int
NU_NOTA_CN float
NU_NOTA_CH float
NU_NOTA_LC float
NU_NOTA_MT float
NU_NOTA_REDACAO float

Database - Postgresql

The database is very simple and has a main table called enem2017 and 5 auxiliar tables to have attributes and social aspects. The database architecture is described below:

  1. enem2017 Table
Feature Type
NU_INSCRICAO bigint (PK)
SG_UF_RESIDENCIA text
CO_MUNICIPIO_RESIDENCIA numeric
CO_MUNICIPIO_ESC numeric
TP_SEXO numeric
TP_COR_RACA numeric
TP_ENSINO numeric
TP_ESCOLA numeric
TP_ANO_CONCLUIU numeric
TP_LINGUA numeric
TP_DEPENDENCIA_ADM_ESC numeric
NU_NOTA_CN numeric
NU_NOTA_CH numeric
NU_NOTA_LC numeric
NU_NOTA_MT numeric
NU_NOTA_REDACAO numeric
  1. cities Table:
Feature Type
CO_MUNICIPIO_RESIDENCIA bigint (PK)
AREA_MUNICIPIO numeric
POP_MUNICIPIO numeric
IDH_MUNICIPIO numeric
INCOME_MUNICIPIO_X1000 numeric
COST_MUNICIPIO_X1000 numeric
PIB_MUNICIPIO_PER_CAPITA numeric
  1. anoEnem Table:
Feature Type
TP_ANO_CONCLUIU numeric (PK)
DESCRICAO text
  1. ensinoTipo Table:
Feature Type
TP_ENSINO numeric (PK)
DESCRICAO text
  1. corRaca Table:
Feature Type
TP_COR_RACA numeric (PK)
DESCRICAO text
  1. _escolaTipo Table:
Feature Type
TP_DEPENDENCIA_ADM_ESC numeric (PK)
DESCRICAO text

How To

To access the data and the analysis it is necessary to download the Power BI report

ENEM 2017 - POWER BI REPORT

The database is instanciated on AWS RDS. All configuration properties are set in the config.py file.

If you want to recriate the database the following frameworks/packages are necessary:

Python 3.6 - pyspark - numpy - pandas - matplotlib - xlrd - seaborn - jupyter

Postgresql 12 - All queries are stored in the query.txt file.

Power BI Desktop

Results preview

Some results are important to present. Brazil is a huge country with many different regions and social aspects. It is a unique and beautiful country, but also uneven.

It was created a correlation heatmap where social aspects such as HDI, the total population, type of school, gender and gender identity.

corrHeatmap

The higher the HDI, better the average of ENEM score. It impacts people's life, because this exam in Brazil is the chosen one for all free universities and a great part of the population does not have condition to afford a private university.

Another important result considering the Brazil regions is that South and Southeast have a higher average ENEM score compared to other regions. It also can be seen that São Paulo, Paraná, Belo Horizonte and DF have the highest scores.

BrazilMap-scores

For further insights we invite you to download the POWER BI Dashboard and explore the analysis.

About

Data Analysis ENEM 2017

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published