Data Analysis ENEM 2017

This project is an analysis for ENEM 2017 data. All data are public and the information here is not confidential.

This notebook is structured as follow:

1. Loading data;
2. Data cleaning and selection using Pyspark framework;
3. Creating Postgresql database on AWS;
4. Data selection and feature engineering;
5. Presentation on Power BI

Analysis goal

In order to achieve the proposed milestones, some features will not be used or even analysed. This work consists on an analysis fo two main aspects of ENEM 2017:

* Student score on all subjects per region and city
* Social aspects and how it is realated to the performance.

With these two aspects, I intend to give an understanding of the following questions:

- What is the average of scores by city and state?
- Is there and bias related to the state? Can we realate to some other feature?
- What cities present the best and the worst results?
- What about the population size? Does it impact on the results?
- Looking at some social aspects, how self declaration is related to the final score?
- Is gender also an important feature when looking to the scores?
- People from public schools perform at the same level as private schools?

At the end of this work, it will be presented some insights about the topic and the questions. Then, this work will try to answer all of them with some charts presentations to exemplify the results in a POWER BI dashboard.

Data Modeling

The data modeling for this project is simple and it is based on the following flow.

In this work, the data is prepared locally using Pyspark to have Apache Spark as a distributed framework that can handle Big Data analysis. I also use the Postgresql on a AWS RDS instance to access the data on the cloud storage system.

After some feature selection this analysis will be consisted on the final ENEM scores. In other words, the students who have not completed the exam and are eliminated from College selection process or cannot conclude the brazilian high school, they have the scored dropped and they are no longer in the database. This approach is important to cut outliers from data and examinate only people who attended to the exam.

Features	Type
NU_INSCRICAO	float64
SG_UF_RESIDENCIA	string
CO_MUNICIPIO_RESIDENCIA	int
CO_MUNICIPIO_ESC	int
TP_SEXO	int
TP_COR_RACA	int
TP_ENSINO	int
TP_ESCOLA	int
TP_ANO_CONCLUIU	int
TP_LINGUA	int
TP_DEPENDENCIA_ADM_ESC	int
NU_NOTA_CN	float
NU_NOTA_CH	float
NU_NOTA_LC	float
NU_NOTA_MT	float
NU_NOTA_REDACAO	float

Database - Postgresql

The database is very simple and has a main table called enem2017 and 5 auxiliar tables to have attributes and social aspects. The database architecture is described below:

enem2017 Table

Feature	Type
NU_INSCRICAO	bigint (PK)
SG_UF_RESIDENCIA	text
CO_MUNICIPIO_RESIDENCIA	numeric
CO_MUNICIPIO_ESC	numeric
TP_SEXO	numeric
TP_COR_RACA	numeric
TP_ENSINO	numeric
TP_ESCOLA	numeric
TP_ANO_CONCLUIU	numeric
TP_LINGUA	numeric
TP_DEPENDENCIA_ADM_ESC	numeric
NU_NOTA_CN	numeric
NU_NOTA_CH	numeric
NU_NOTA_LC	numeric
NU_NOTA_MT	numeric
NU_NOTA_REDACAO	numeric

cities Table:

Feature	Type
CO_MUNICIPIO_RESIDENCIA	bigint (PK)
AREA_MUNICIPIO	numeric
POP_MUNICIPIO	numeric
IDH_MUNICIPIO	numeric
INCOME_MUNICIPIO_X1000	numeric
COST_MUNICIPIO_X1000	numeric
PIB_MUNICIPIO_PER_CAPITA	numeric

anoEnem Table:

Feature	Type
TP_ANO_CONCLUIU	numeric (PK)
DESCRICAO	text

ensinoTipo Table:

Feature	Type
TP_ENSINO	numeric (PK)
DESCRICAO	text

corRaca Table:

Feature	Type
TP_COR_RACA	numeric (PK)
DESCRICAO	text

_escolaTipo Table:

Feature	Type
TP_DEPENDENCIA_ADM_ESC	numeric (PK)
DESCRICAO	text

How To

To access the data and the analysis it is necessary to download the Power BI report

ENEM 2017 - POWER BI REPORT

The database is instanciated on AWS RDS. All configuration properties are set in the config.py file.

If you want to recriate the database the following frameworks/packages are necessary:

Python 3.6 - pyspark - numpy - pandas - matplotlib - xlrd - seaborn - jupyter

Postgresql 12 - All queries are stored in the query.txt file.

Power BI Desktop

Results preview

Some results are important to present. Brazil is a huge country with many different regions and social aspects. It is a unique and beautiful country, but also uneven.

It was created a correlation heatmap where social aspects such as HDI, the total population, type of school, gender and gender identity.

The higher the HDI, better the average of ENEM score. It impacts people's life, because this exam in Brazil is the chosen one for all free universities and a great part of the population does not have condition to afford a private university.

Another important result considering the Brazil regions is that South and Southeast have a higher average ENEM score compared to other regions. It also can be seen that São Paulo, Paraná, Belo Horizonte and DF have the highest scores.

For further insights we invite you to download the POWER BI Dashboard and explore the analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
img		img
LoadData.ipynb		LoadData.ipynb
README.md		README.md
awsCities.py		awsCities.py
awsEnem.py		awsEnem.py
config.py		config.py
enem2017_Report.pbix		enem2017_Report.pbix
query.txt		query.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Analysis ENEM 2017

Analysis goal

Data Modeling

Database - Postgresql

How To

Results preview

About

Releases

Packages

Languages

ThiagoGrabe/PowerBI_report-AWS_integration

Folders and files

Latest commit

History

Repository files navigation

Data Analysis ENEM 2017

Analysis goal

Data Modeling

Database - Postgresql

How To

Results preview

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages