-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathinteract_rmd.Rmd
157 lines (123 loc) · 12.7 KB
/
interact_rmd.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
title: "Study of Interaction Patterns in Primary School Children Network"
author: "Utsab Shrestha"
date: "May 25,2020"
output: html_document
knit: (function(input_file, encoding) {
out_dir <- 'docs';
rmarkdown::render(input_file,
encoding=encoding,
output_file=file.path(dirname(input_file), out_dir, 'index.html'))})
---
```{r global_options, include = FALSE}
library(knitr)
setwd("~/Github Project R/InteractPattern")
read_chunk("proj_data.R")
knitr::opts_chunk$set(echo = FALSE,warning = FALSE,message = FALSE, fig.align="center")
```
## Introduction
Human interactions and human behaviors has been a fascinating and challenging subject of study in recent times. Social network companies, consumer markets, medical industries have been trying to study human behavior to predict the consumer needs, recommend goods, optimize marketing strategies, or in case of medical industry diagnose and study disease patterns. Child psychology and child interaction behavior is very different from an adult behavior. Study of children interaction pattern is very important to understand and improve the development of children. These studies have been used to understand the development in children as well as spreading of highly communicable disease like influenza, hepatitis and measles. Children are prone to these diseases and the pattern study of interaction of children helps to find the propagation and evaluation of the control measures of these diseases. Through this study, we want to find how age, gender, grade affect the interaction pattern in children.
## Data
Data for this project was obtained from the Socio patterns website (sociopatterns.org). The dataset had two networks, one for each day, of face-to-face interaction between students and teachers. These interactions were collected from a French school in 2009 using radio frequency identification devices that recognizes an interaction by proximity sensors. The students involved in the data are ages 6 to 12 attending grades 1 to 5. There were 232 children and 10 teachers who were involved in 77,602 contacts amongst each other in course of two days. There were 10 classes, grades 1 to 5 with section A and B, and in each grade, there were around 25 students. In average, each child had around 165 interactions and spent an average of 176 minutes in interaction per day. The gender and grade were provided for the children and also the count and duration of interactions were provided. For a contact to occur between two individuals, they must be in certain proximity for at least 20 seconds. A packet of information is sent after every 20 second. A contact is broken once they are further than the defined proximity and if they come in contact again new contact is added and time is added to previous duration.
## Analysis
Analysis for this project was focused on finding patterns of interaction by grades and gender to study if the students show homophile behavior. Do children of opposite sex interact as much as same sex? Which grades/age children are more social? How popular are the most popular student? In following analysis we will be trying to answer these quesitons.
### Data Exploration
The files from the data source website were in GEXF gephi format, one file for each day. I loaded the files in gephi and explored different layouts to visualize the network data.
![](images/day 1 node by degree.png)
<center><em>Degree Distribution for Day 1</em></center>
<br><br>
![](images/node size by degree day2 a.png)
<center><em>Degree Distribution for Day 2</em></center>
<br>
Using the gephi community detection, there were 8 communities detected for both days with modularity of 0.75. The algorithm performs well on separating communities by grades. Same grade students form a community. This measure shows how well the network can decomposed into modular communities. Grades 2A and 2B are in same community, grade 5A and grade 5B are in same communities while Grades 1A and 1B, 3A and 3B, and 4A and 4B form separate communities. This also shows sign of homophile behavior since there is high interaction between same grade students which resulted in same grade students being in same community.
![](images/modularity.png)
<center><em>Communities detected by Gephi Modularity Algorithm</em></center>
<br>
```{r C1}
```
```{r C2}
```
```{r C3}
```
```{r C4}
```
```{r C5}
```
<br>
```{r C6}
```
<center><em>Degree Distribution by Grade</em></center>
<br>
```{r C6.1}
```
<center><em>Degree Distribution by Gender</em></center>
<br>
![](images/Number of contacts by grades.jpeg)
<br>
<br>
<em>Over the course of two days, grade 1B had the most interactions. Comparing to the other section, the same grade (1A) students had half the number of contacts. Teachers seem to have the least contacts and among the classes grade 4B had the lowest number of contacts.</em>
<br><br>
```{r C7}
```
<center><em>Heatmap of interactions by Grade</em></center>
```{r C8}
```
<br>
```{r C9}
```
<br>
The average weighted degree was 10 and maximum was 23. This means in average a student interacted with 10 individual and the student who had the most interactions connected with 23 other students in the given day.
In the histogram, we see the distribution of weighted degree by gender. For both male and female students, the histogram looks normally distributed with most students towards the middle of the chart. We can say, the chart is a bit skewed to the right since there are few students with high degree.
When comparing boys versus girls, looks like boys are more interactive. Girls have degree 10 as the highest count which means degree 10 is the most common amongst the girls. Whereas for boys degree 14 is the most common degree. Also the person with highest degree(23) is a boy.
<br><br>
![](images/screenshot_210431.png)
<center><em>Weighted degree by Grade</em></center>
<br>
In the above graph, the nodes are sized by weighted degree and the label size by node size. The color of the nodes are partitioned by the grades. There are grades 1 to 5 with 2 sections for each grade. Teacher nodes are colored red.
In this network, students in same grade are grouped together because they are more likely to form their own committee with most interaction happening with students in same grade. Grade 1B clearly pop out with most students in the class with higher weighted degree. Grade 1B students are highly interactive with larger nodes. Furthermore,grades 2B, 4A and 3B have one student each with relatively big nodes. These individual students are likely to be the most social student in those classes. In contrast, grades 4B and 2A don’t seem to have high degree nodes. Few dark edges are visible explaning they might have had interaction between fewer students but those interactions were significant ( longer or multiple interaction between same students).
Interestingly, teacher do not show high weighted degree. One teacher in the middle of 1B student nodes look to be more interactive than others.
<br>
<br>
![](images/Rplot.jpeg)
<center><em>Centrality Measure correlations</em></center>
<br>
There is a strong correlation between degree and pagerank. This means important nodes are high degree nodes. Another interesting insight is betweenness centrality and degree centrality are not strongly correlated. This means high degree nodes are not the connections for most interactions. This proves that the most interactive student may not be the fastest connection between two students.
In addition, there seem to be no correlation between eigenvector and betweenness centrality. With means most influencing students in this network are not the in-between connections for most interaction. There are many students and many interactions happening so even the most influencing students would not necessarily be the connections for other students interactions.
<br>
<br>
![](images/EV.png)
<center><em>Eigenvector Centrality</em></center>
<br>
To look at the most influencing node we check the eigenvector centrality measure. High degree nodes do not necessarily the most important nodes. It uses adjacency matrix of the graph to calculate the eigenvalues.
The nodes and label size is sized by the eigenvector centrality measure. We again see nodes from grade 1B with high eigenvector centrality. They are the most influencing nodes in the network.
<br>
<br>
![](images/Rplot03.jpeg)
<center><em>Highest Weighted Degree nodes</em></center>
<br>
The graph above shows the top three highest degree nodes. These are nodes 1697, 1890 and 1688. They are from Grades 1B and 2B. They have weighted degree of 23 and 22. The color of the nodes are based on the male and female gender and the red node is the highest degree node.
The students from smaller grades have higher degree interaction and are more influencing. As the kids get older they seem to have smaller friend circle and less interactions between them.
<br>
<br>
![](images/Rplot03.jpeg)
<br>
CUP and QAP tests further show both gender and grade display assortative patterns that suggests these attributes were not mere products of random association. While, gender and grade both exhibit high assortative behaviour with strong confidence with p-values of 0.83 and 0.93, students of same grade are likely to associate than students of same gender. In the figure above, the assortativity by gender is scaled higher than all the randomly generated networks. Furthermore, the assortativity of the network is confirmed by QAP test which rearranges the nodes to check if the assortativity of the nodes were random rather than generating random networks like in CUG tests. The CUG test validats the assortativity of the network with high confidence of 93% and 82% for grade and gender proving that the assortativity for grade is higher than gender in this network.
## Models
Exponential Random Graph Models is used to study the interaction behaviors in the network. In the ERGM, we can experiment with various parameters to better understand the network. While we have seen that both grade and gender determine the probability of children interaction, the ERGM model is effective in showing the network behavior with high confidence. The final model for Day 1 proved to be much better in terms of Statistical P-value and MCM diagnostics. Even introducing high burn in computation and lowering the intervals and improving on the goodness of fit the Day 2 model did not perform well. Final model for day 1 takes gender mixes and grade matches into account as well as several degrees. All the parameters have low p-value which means they are statistically significant and the AIC and BIC criteria are low compared to other models.
<br>
<br>
![](images/gof.png)
<br>
![](images/ac.png)
<br>
The goodness of fit diagnostics shows that the parameters all perform well with observed network denoted by the thicker line that fall within the simulated network quantiles. The degree distribution also follows the general path of the observed network. It is seen that degree 5 and degree 6 fall outside the simulated network but they do ascend like the observed network. When trying to introduce the degree 5 and degree 6 parameters to the ERGM model, AIC increased and the p-values for the statistical significance were not available. The sample statistic showed auto-correlation with high value which is not good showing the simulated networks had correlation, the individual and joint p-value was high which tells that the model performed well.
<br>
<br>
<center>![](images/condprob.png)</center>
<br>
<em>Using the model parameter coefficients, the conditional probability of specific type of edges being created can be determined.</em>
## Conclusion
The model strongly supports our analysis of homophile behavior in students interacting between same grade student. Further, there is inclination to interact between to same gender rather than opposite sex between these students. The children do not have the same mentality and relation towards the opposite sex as adults, so they would probably behave differently than an adult would. The analysis on the network data provides important insights into child interaction behaviors. Through the contact counts, degree distribution, modularity community detection, QAC and CUP tests, network visualization, and ERGM modelling we can see that children show homophilic behavior which means they are more likely to interacts with someone like themselves be it the grade, age or gender. In cases of disease outbreak in a school environment this information can be of great importance in managing and controlling the disease from spreading further. If there were more attributes available such as race, nationality, height etc. maybe we could build a better model and have more interesting findings.
For details on <a href = "https://sutsabs.github.io/InteractPattern/interactcode.html">data exploration and modelling.</a>
<br>
PDF version of the <a href = "https://sutsabs.github.io/InteractPattern/pdf.html">report</a>