# Time series analysis tools in Visual Process Analytics: Cross correlation

 Two time series and their cross-correlation functions
In a previous post, I showed you what autocorrelation function (ACF) is and how it can be used to detect temporal patterns in student data. The ACF is the correlation of a signal with itself. We are certainly interested in exploring the correlations among different signals.

The cross-correlation function (CCF) is a measure of similarity of two time series as a function of the lag of one relative to the other. The CCF can be imagined as a procedure of overlaying two series printed on transparency films and sliding them horizontally to find possible correlations. For this reason, it is also known as a "sliding dot product."

The upper graph in the figure to the right shows two time series from a student's engineering design process, representing about 45 minutes of her construction (white line) and analysis (green line) activities while trying to design an energy-efficient house with the goal to cut down the net energy consumption to zero. At first glance, you probably have no clue about what these lines represent and how they may be related.

But their CCFs reveal something that appears to be more outstanding. The lower graph shows two curves that peak at some points. I know you have a lot of questions at this point. Let me try to see if I can provide more explanations below.

Why are there two curves for depicting the correlation of two time series, say, A and B? This is because there is a difference between "A relative to B" and "B relative to A." Imagine that you print the series on two transparency films and slide one on top of the other. Which one is on the top matters. If you are looking for cause-effect relationships using the CCF, you can treat the antecedent time series as the cause and the subsequent time series as the effect.

What does a peak in the CCF mean, anyways? It guides you to where more interesting things may lie. In the figure of this post, the construction activities of this particular student were significantly followed by analysis activities about four times (two of them are within 10 minutes), but the analysis activities were significantly followed by construction activities only once (after 10 minutes).

# Time series analysis tools in Visual Process Analytics: Autocorrelation

 Autocorrelation reveals a three-minute periodicity
Digital learning tools such as computer games and CAD software emit a lot of temporal data about what students do when they are deeply engaged in the learning tools. Analyzing these data may shed light on whether students learned, what they learned, and how they learned. In many cases, however, these data look so messy that many people are skeptical about their meaning. As optimists, we believe that there are likely learning signals buried in these noisy data. We just need to use or invent some mathematical tricks to figure them out.

In Version 0.2 of our Visual Process Analytics (VPA), I added a few techniques that can be used to do time series analysis so that researchers can find ways to characterize a learning process from different perspectives. Before I show you these visual analysis tools, be aware that the purpose of these tools is to reveal the temporal trends of a given process so that we can better describe the behavior of the student at that time. Whether these traits are "good" or "bad" for learning likely depends on the context, which often necessitates the analysis of other co-variables.

 Correlograms reveal similarity of two time series.
The first tool for time series analysis added to VPA is the autocorrelation function (ACF), a mathematical tool for finding repeating patterns obscured by noise in the data. The shape of the ACF graph, called the correlogram, is often more revealing than just looking at the shape of the raw time series graph. In the extreme case when the process is completely random (i.e., white noise), the ACF will be a Dirac delta function that peaks at zero time lag. In the extreme case when the process is completely sinusoidal, the ACF will be similar to a damped oscillatory cosine wave with a vanishing tail.

An interesting question relevant to learning science is whether the process is autoregressive (or under what conditions the process can be autoregressive). The quality of being autoregressive means that the current value of a variable is influenced by its previous values. This could be used to evaluate whether the student learned from the past experience -- in the case of engineering design, whether the student's design action was informed by previous actions. Learning becomes more predictable if the process is autoregressive (just to be careful, note that I am not saying that more predictable learning is necessarily better learning). Different autoregression models, denoted as AR(n) with n indicating the memory length, may be characterized by their ACFs. For example, the ACF of AR(2) decays more slowly than that of AR(1), as AR(2) depends on more previous points. (In practice, partial autocorrelation function, or PACF, is often used to detect the order of an AR model.)

The two figures in this post show that the ACF in action within VPA, revealing temporal periodicity and similarity in students' action data that are otherwise obscure. The upper graphs of the figures plot the original time series for comparison.

# Seeing student learning with visual analytics

Technology allows us to record almost everything happening in the classroom. The fact that students' interactions with learning environments can be logged in every detail raises the interesting question about whether or not there is any significant meaning and value in those data and how we can make use of them to help students and teachers, as pointed out in a report sponsored by the U.S. Department of Education:
New technologies thus bring the potential of transforming education from a data-poor to a data-rich enterprise. Yet while an abundance of data is an advantage, it is not a solution. Data do not interpret themselves and are often confusing — but data can provide evidence for making sound decisions when thoughtfully analyzed.” — Expanding Evidence Approaches for Learning in a Digital World, Office of Educational Technology, U.S. Department of Education, 2013
 A radar chart of design space exploration.
 A histogram of action intensity.
Here we are not talking about just analyzing students' answers to some multiple-choice questions, or their scores in quizzes and tests, or their frequencies of logging into a learning management system. We are talking about something much more fundamental, something that runs deep in cognition and learning, such as how students conduct a scientific experiment, solve a problem, or design a product. As learning goes deeper in those directions, data produced by students grows bigger. It is by no means an easy task to analyze large volumes of learner data, which contain a lot of noisy elements that cast uncertainty to assessment. The validity of an assessment inference rests on  the strength of evidence. Evidence construction often relies on the search for relations, patterns, and trends in student data.With a lot of data, this mandates some sophisticated computation similar to cognitive computing.

Data gathered from highly open-ended inquiry and design activities, key to authentic science and engineering practices that we want students to learn, are often intensive and “messy.” Without analytic tools that can discern systematic learning from random walk, what is provided to researchers and teachers is nothing but a DRIP (“data rich, information poor”) problem.

 A scatter plot of action timeline.
Recognizing the difficulty in analyzing the sheer volume of messy student data, we turned to visual analytics, a whole category of techniques extensively used in cutting-edge business intelligence systems such as software developed by SAS, IBM, and others. We see interactive, visual process analytics key to accelerating the analysis procedures so that researchers can adjust mining rules easily, view results rapidly, and identify patterns clearly. This kind of visual analytics optimally combines the computational power of the computer, the graphical user interface of the software, and the pattern recognition power of the brain to support complex data analyses in data-intensive educational research.

 A digraph of action transition.
So far, I have written four interactive graphs and charts that can be used to study four different aspects of the design action data that we collected from our Energy3D CAD software. Recording several weeks of student work on complex engineering design challenges, these datasets are high-dimensional, meaning that it is improper to treat them from a single point of view. For each question we are interested in getting answers from student data, we usually need a different representation to capture the outstanding features specific to the question. In many cases, multiple representations are needed to address a question.

In the long run, our objective is to add as many graphic representations as possible as we move along in answering more and more research questions based on our datasets. Given time, this growing library of visual analytics would develop sufficient power to the point that it may also become useful for teachers to monitor their students' work and thereby conduct formative assessment. To guarantee that our visual analytics runs on all devices, this library is written in JavaScript/HTML/CSS. A number of touch gestures are also supported for users to use the library on a multi-touch screen. A neat feature of this library is that multiple graphs and charts can be grouped together so that when you are interacting with one of them, the linked ones also change at the same time. As the datasets are temporal in nature, you can also animate these graphs to reconstruct and track exactly what students do throughout.

# On the instructional sensitivity of computer-aided design logs

 Figure 1: Hypothetical student responses to an intervention.
In the fourth issue this year, the International Journal of Engineering Education published our 19-page-long paper on the instructional sensitivity of computer-aided design (CAD) logs. This study was based on our Energy3D software, which supports students to learn science and engineering concepts and skills through creating sustainable buildings using a variety of built-in design and analysis tools related to Earth science, heat transfer, and solar energy. This paper proposed an innovative approach of using response functions -- a concept borrowed from electrical engineering -- to measure instructional sensitivity from data logs (Figure 1).

Many researchers are interested in studying what students learn through complex engineering design projects. CAD logs provide fine-grained empirical data of student activities for assessing learning in engineering design projects. However, the instructional sensitivity of CAD logs, which describes how students respond to interventions with CAD actions, has never been examined, to the best of our knowledge.
 Figure 2. An indicator of statistical reliability.

For the logs to be used as reliable data sources for assessments, they must be instructionally sensitive. Our paper reports the results of our systematic research on this important topic. To guide the research, we first propose a theoretical framework for computer-based assessments based on signal processing. This framework views assessments as detecting signals from the noisy background often present in large temporal learner datasets due to many uncontrollable factors and events in learning processes. To measure instructional sensitivity, we analyzed nearly 900 megabytes of process data logged by Energy3D as collections of time series. These time-varying data were gathered from 65 high school students who solved a solar urban design challenge using Energy3D in seven class periods, with an intervention occurred in the middle of their design projects.

Our analyses of these data show that the occurrence of the design actions unrelated to the intervention were not affected by it, whereas the occurrence of the design actions that the intervention targeted reveals a continuum of reactions ranging from no response to strong response (Figure 2). From the temporal patterns of these student responses, persistent effect and temporary effect (with different decay rates) were identified. Students’ electronic notes taken during the design processes were used to validate their learning trajectories. These results show that an intervention occurring outside a CAD tool can leave a detectable trace in the CAD logs, suggesting that the logs can be used to quantitatively determine how effective an intervention has been for each individual student during an engineering design project.

# The first paper on learning analytics for assessing engineering design?

 Figure 1
The International Journal of Engineering Education published our paper ("A Time Series Analysis Method for Assessing Engineering Design Processes Using a CAD Tool") on learning analytics and educational data mining for assessing student performance in complex engineering design projects. I believe this is the first time learning analytics was applied to the study of engineering design -- an extremely complicated process that is very difficult to assess using traditional methodologies because of its open-ended and practical nature.

 Figure 2
This paper proposes a novel computational approach based on time series analysis to assess engineering design processes using our Energy3D CAD tool. To collect research data without disrupting a design learning process, design actions and artifacts are continuously logged as time series by the CAD tool behind the scenes, while students are working on an engineering design project such as a solar urban design challenge. These "atomically" fine-grained data can be used to reconstruct, visualize, and analyze the entire design process of a student with extremely high resolution. Results of a pilot study in a high school engineering class suggest that these data can be used to measure the level of student engagement, reveal the gender differences in design behaviors, and distinguish the iterative (Figure 1) and non-iterative (Figure 2) cycles in a design process.

From the perspective of engineering education, this paper contributes to the emerging fields of educational data mining and learning analytics that aim to expand evidence approaches for learning in a digital world. We are working on a series of papers to advance this research direction and expect to help with the "landscaping" of  those fields.

# Visual learning analytics based on graph theory: Part I

All educational research and assessment are based on inference from evidence. Evidence is constructed from learner data. The quality of this construction is, therefore, fundamentally important. Many educational measurements have relied on eliciting, analyzing, and interpreting students' constructed responses to assessment questions. New types of data may engender new opportunities for improving the validity and reliability of educational measurements. In this series of articles, I will show how graph theory can be applied to educational research.

The process of inquiry-based learning with an interactive computer model can be imagined as a trajectory of exploring in the problem space spanned by the user interface of the model. Students use various widgets to control different variables, observe the corresponding emergent behaviors, take some data, and then reason with the data to draw a conclusion. This sounds obvious. But exactly how do we capture, visualize, and analyze this process?

From the point of view of computational science, the learning space is enormous: If we have 10 controls in the user interface and each control has five inputs, there are potentially 100,000 different ways of interacting with the model. To be able to tackle a problem of this magnitude, we can use some mathematics. Graph theory is a trick that we are building into our process analytics. The publication of Leonhard Euler's Seven Bridges of Königsberg in 1736 is commonly considered as the birth of graph theory.

 Figure 1: A learning graph made of two subgraphs representing two ideas.
In graph theory, a graph is a collection of vertices connected by edges: G = (V, E). When applied to learning, a vertex represents an indicator that may be related to certain competency of a student, which can be logged by software. An edge represents the transition from one indicator to another. We call a graph that represents a learning process as a learning graph.

A learning graph is always a digraph G = (V, A) -- namely, it always has directed edges or arrows -- because of the temporal nature of learning. Most likely, it is a multigraph that has multiple directed edges between one or more than one pair of vertices (it is sometimes called a multidigraph) because the student often needs multiple transitions between indicators to learn their connections. A learning graph often has loops, edges that connect back to the same vertex, because the student may perform multiple actions related to an indicator consecutively before making a transition. Figure 1 shows a learning graph that includes two sets of indicators, each for an idea.

 Figure 2. The adjacency matrix of the graph in Figure 1.
The size of a learning graph is defined as the number of its arrows, denoted by |A(G)|. The size represents the number of actions the student takes during learning. The multiplicity of an arrow is the number of multiple arrows sharing the same vertices; the multiplicity of a graph, the maximum multiplicity of its arrows. The multiplicity represents the most frequent transition between two indicators in a learning process. The degree dG(v) of a vertex v in a graph G is the number of edges incident to v, with loops being counted twice. A vertex of degree 0 is an isolated vertex. A vertex of degree 1 is a leaf. The degree of a vertex represents the times the action related to the corresponding indicator is performed. The maximum degree Δ(G) of a graph G is the largest degree over all vertices; the minimum degree δ(G), the smallest.

The distance dG(u, v) between two vertices u and v in a graph G is the length of a shortest path between them. When u and v are identical, their distance is 0. When u and v are unreachable from each other, their distance is defined to be infinity ∞. The distance between two indicators may reveal how the related constructs are connected in the learning process.

 Figure 3. A more crosscutting learning trajectory between two ideas.
Two vertices u and v are called adjacent if an edge exists between them, denoted by u ~ v. The square adjacency matrix is a means of representing which vertices of a graph are adjacent to which other vertices. Figure 2 is the adjacency matrix of the graph in Figure 1, the trace (the sum of all the diagonal elements in the matrix) of which represents the number of loops in the graph. Having known the adjacency matrix, we can apply the spectral graph theory to study the properties of a graph in relationship to the characteristic polynomial, eigenvalues, and eigenvectors of the matrix (because the adjacency matrix of a learning graph is a digraph, the eigenvalues are often complex numbers). For example, the eigenvalues of the adjacency matrix may be used to reduce the dimensionality of the dataset into clusters.

 Figure 4. The adjacency matrix of the graph in Figure 3.
How might learning graphs be useful for analyzing student learning? Figure 3 gives an example that shows a different behavior of exploration between two ideas (such as heat and temperature or pressure and temperature). In this hypothetical case, the student has more transitions between two subgraphs that represent the two ideas and their indicator domains. This pattern can potentially result in better understanding of the connections between the ideas. The adjacency matrix shown in Figure 4 has different block structures than that shown in Figure 2: The blocks A-B and B-A are much sparser in Figure 2 than in Figure 4. The spectra of these two matrices may be quite different and could be used to characterize the knowledge integration process that fosters the linkage between the two ideas.

Go to Part II.