We consider a large set of CLCT student usage data, collected in 2010. Although the tutor was used in several thousand schools across the United States, its full logging capability was activated for only about 20% of schools in which it was used. Our initial dataset covered 144,080 registered students in 899 schools with close to 473 million records overall, including activity unrelated to problem-solving, like signing in, signing out, and solving practice problems. After extracting targeted, substantive, problem-solving activity, we arrived at a dataset that included 342 schools, 72,082 students, and 88.6 million problem-solving actions.
We queried the National Center for Education Statistics (NCES) and internal data for school metadata that included the number of students enrolled (as a proxy of school’s relative size), student-teacher ratio, number of students eligible to receive free or reduced price lunch (as a proxy for socio-economic status), and roughly the setting of the school’s location: rural, suburban, or urban. Although some of the school metadata from NCES and internal records were from the year 2011, we assume that fluctuations in the numbers are negligible for our analyses. We matched full NCES and internal records for a subset of 232 schools, narrowing our selection to, 55,012 students with substantive usage (i.e., attempting one than one unit of instruction) and 67.3 million problem-solving actions.
In addition to the school metadata, we computed student performance statistics from our logs. For each school we have computed the average number of distinct units students were attempting and the standard error of the number of units attempted. To further characterize schools we ran a mixed effects logistic regression model on the data (see Eq. (1) and Eq. (2)). Here, θi represents the ability of student i, and βj is a problem complexity intercept. For each skill k relevant to problem j, δk is general skill easiness (i.e., a skill intercept), and γk represents skill k’s learning rate; tik captures student i’s number of prior attempts at skill k.
In this regression model, we treat the student- and problem-intercepts as random factors. From the regression coefficients, we calculated the following values to describe schools: average student intercept, per school, denotes relative prior preparation of students; average skill intercept to capture each school’s general level of skill difficulties on top of student preparation; and average school skill slope to denote relative speed of learning for the school students.
Thus, overall we have collected four school metadata descriptors and four school’s student performance descriptors.
We propose to use accuracy of student modeling as a measure of similarity in learning between groups of schools and between schools themselves. We seek to determine if, based on one or more descriptive factors, it is possible to effectively separate schools in our dataset into groups...