Methods

Exam Similarity Score (ESS)

Let $Q$ denote the set of all questions on the exam, and let $\lvert Q\rvert$ denote the total number of questions on the exam. Let $w\left(q\right)$ denote the number of students who submitted an incorrect response for a given question $q\in Q$. For a given pair of students $x$ and $y$ who submitted the same incorrect response for a given question $q\in Q$, let $d\left(q,x,y\right)$ denote the number of students who submitted a different incorrect response for $q\in Q$. Let $p\left(q,x,y\right)=\frac{d\left(q,x,y\right)}{w\left(q\right)}$ if students $x$ and $y$ submitted the same incorrect response for question $q\in Q$, otherwise $p\left(q,x,y\right)=0$ (e.g. if $x$ or $y$ submitted correct responses, or if $x$ and $y$ submitted different incorrect responses). We define the Exam Similarity Score (ESS) between students $x$ and $y$ as follows:

$$S\left(x,y\right) = \frac{\sum_{q\in Q}{p\left(q,x,y\right)}}{\lvert Q\rvert}$$

The ESS has a minimum of 0, which implies no identical incorrect responses, and a maximum that approaches 1, which implies identical and unique incorrect responses on all exam questions.

Similarity Significance Detection

The ESS motivates a simple systematic approach for detecting exam similarity: compute the ESS across all pairs of students, sort pairs of students in descending order of ESS, and select pairs of students with statistically significantly large scores to investigate further as suspected cases of potential collaboration. However, a critical question arises: how does one determine an ESS threshold above which scores will be deemed as "significantly large"?

In a hypothetical course with infinite students, assuming that the vast majority of the possible pairs of students did not collaborate on an exam, we would expect the distribution of ESSs computed across all pairs of students in the class to fit closely to the null distribution, with pairs of students who did collaborate appearing as outliers with larger ESSs than one would expect. As the number of students decreases, the fit would worsen, especially at the ends of the distribution due to reduced sampling. When Kernel Density Estimates (KDEs) of the distributions of Exam Similarity Scores for a given exam are plotted in log-scale, as expected, the ends of the distribution are noisy, but there is a consistent close-to-linear stretch for central values of the ESS distribution. The Probability Density Function (PDF) of an Exponential distribution with rate parameter λ and location parameter μ is the following:

$$f_{X}\left(x\right) = \lambda e^{-\lambda\left(x-\mu\right)}$$

Therefore, the log of the PDF of an Exponential distribution with rate parameter $\lambda$ and location parameter $\mu$ is the following:

$$\ln\left(f_{X}\left(x\right)\right) = \ln\left(\lambda e^{-\lambda\left(x-\mu\right)}\right) = \ln\left(\lambda\right)-\lambda\left(x-\mu\right)$$

Thus, given a line $y=mx+b$ regressed from the log of the KDE (log-KDE) of samples from an unknown Exponential distribution, the parameters of the Exponential can be estimated as follows:

$$\lambda=-m$$

$$\mu=\frac{\ln\left(\lambda\right)-b}{m}$$

This motivates a simple approach for computing theoretical p-values for ESS values computed from all pairs of students:

Compute the KDE of the distribution of ESSs
Regress a line from the near-linear segment of the log-KDE
Estimate the Exponential parameters from the line
Compute p-values from the Exponential distribution

The statistical test is one-sided: specifically, the p-value associated with a given ESS x is the probability of observing an ESS greater than or equal to x purely by chance. Therefore, the p-value for a given ESS x is simply the area under the Cumulative Distribution Function (CDF) of Exponential(λ,μ) for the range X ≥ x.

Multiple Hypothesis Test Correction

When we compute a p-value for every possible pair of students and check each p-value for statistical significance, we are performing multiple simultaneous hypothesis tests. To control the False Discovery Rate (FDR), we can perform a correction (e.g. Bonferroni or Benjamini-Hochberg) to compute an FDR-controlled adjusted p-value, also known as a q-value. The resulting q-values can be compared against a statistical significance threshold, e.g. q ≤ 0.05, to provide an automated similarity detection algorithm.

Niema Moshiri 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methods

Exam Similarity Score (ESS)

Similarity Significance Detection

Multiple Hypothesis Test Correction

Clone this wiki locally