This project is an analysis for ENEM 2017 data. All data are public and the information here is not confidential.
This notebook is structured as follow:
1. Loading data;
2. Data cleaning and selection using Pyspark framework;
3. Creating Postgresql database on AWS;
4. Data selection and feature engineering;
5. Presentation on Power BI
In order to achieve the proposed milestones, some features will not be used or even analysed. This work consists on an analysis fo two main aspects of ENEM 2017:
* Student score on all subjects per region and city
* Social aspects and how it is realated to the performance.
With these two aspects, I intend to give an understanding of the following questions:
-
- What is the average of scores by city and state?
-
- Is there and bias related to the state? Can we realate to some other feature?
-
- What cities present the best and the worst results?
-
- What about the population size? Does it impact on the results?
-
- Looking at some social aspects, how self declaration is related to the final score?
-
- Is gender also an important feature when looking to the scores?
-
- People from public schools perform at the same level as private schools?
At the end of this work, it will be presented some insights about the topic and the questions. Then, this work will try to answer all of them with some charts presentations to exemplify the results in a POWER BI dashboard.
The data modeling for this project is simple and it is based on the following flow.
In this work, the data is prepared locally using Pyspark to have Apache Spark as a distributed framework that can handle Big Data analysis. I also use the Postgresql on a AWS RDS instance to access the data on the cloud storage system.
After some feature selection this analysis will be consisted on the final ENEM scores. In other words, the students who have not completed the exam and are eliminated from College selection process or cannot conclude the brazilian high school, they have the scored dropped and they are no longer in the database. This approach is important to cut outliers from data and examinate only people who attended to the exam.
Features | Type |
---|---|
NU_INSCRICAO | float64 |
SG_UF_RESIDENCIA | string |
CO_MUNICIPIO_RESIDENCIA | int |
CO_MUNICIPIO_ESC | int |
TP_SEXO | int |
TP_COR_RACA | int |
TP_ENSINO | int |
TP_ESCOLA | int |
TP_ANO_CONCLUIU | int |
TP_LINGUA | int |
TP_DEPENDENCIA_ADM_ESC | int |
NU_NOTA_CN | float |
NU_NOTA_CH | float |
NU_NOTA_LC | float |
NU_NOTA_MT | float |
NU_NOTA_REDACAO | float |
The database is very simple and has a main table called enem2017 and 5 auxiliar tables to have attributes and social aspects. The database architecture is described below:
- enem2017 Table
Feature | Type |
---|---|
NU_INSCRICAO | bigint (PK) |
SG_UF_RESIDENCIA | text |
CO_MUNICIPIO_RESIDENCIA | numeric |
CO_MUNICIPIO_ESC | numeric |
TP_SEXO | numeric |
TP_COR_RACA | numeric |
TP_ENSINO | numeric |
TP_ESCOLA | numeric |
TP_ANO_CONCLUIU | numeric |
TP_LINGUA | numeric |
TP_DEPENDENCIA_ADM_ESC | numeric |
NU_NOTA_CN | numeric |
NU_NOTA_CH | numeric |
NU_NOTA_LC | numeric |
NU_NOTA_MT | numeric |
NU_NOTA_REDACAO | numeric |
- cities Table:
Feature | Type |
---|---|
CO_MUNICIPIO_RESIDENCIA | bigint (PK) |
AREA_MUNICIPIO | numeric |
POP_MUNICIPIO | numeric |
IDH_MUNICIPIO | numeric |
INCOME_MUNICIPIO_X1000 | numeric |
COST_MUNICIPIO_X1000 | numeric |
PIB_MUNICIPIO_PER_CAPITA | numeric |
- anoEnem Table:
Feature | Type |
---|---|
TP_ANO_CONCLUIU | numeric (PK) |
DESCRICAO | text |
- ensinoTipo Table:
Feature | Type |
---|---|
TP_ENSINO | numeric (PK) |
DESCRICAO | text |
- corRaca Table:
Feature | Type |
---|---|
TP_COR_RACA | numeric (PK) |
DESCRICAO | text |
- _escolaTipo Table:
Feature | Type |
---|---|
TP_DEPENDENCIA_ADM_ESC | numeric (PK) |
DESCRICAO | text |
To access the data and the analysis it is necessary to download the Power BI report
The database is instanciated on AWS RDS. All configuration properties are set in the config.py file.
If you want to recriate the database the following frameworks/packages are necessary:
Python 3.6 - pyspark - numpy - pandas - matplotlib - xlrd - seaborn - jupyter
Postgresql 12 - All queries are stored in the query.txt file.
Power BI Desktop
Some results are important to present. Brazil is a huge country with many different regions and social aspects. It is a unique and beautiful country, but also uneven.
It was created a correlation heatmap where social aspects such as HDI, the total population, type of school, gender and gender identity.
The higher the HDI, better the average of ENEM score. It impacts people's life, because this exam in Brazil is the chosen one for all free universities and a great part of the population does not have condition to afford a private university.
Another important result considering the Brazil regions is that South and Southeast have a higher average ENEM score compared to other regions. It also can be seen that São Paulo, Paraná, Belo Horizonte and DF have the highest scores.
For further insights we invite you to download the POWER BI Dashboard and explore the analysis.