Tag Archives: Big data

Time series analysis tools in Visual Process Analytics: Cross correlation

Two time series and their cross-correlation functions
In a previous post, I showed you what autocorrelation function (ACF) is and how it can be used to detect temporal patterns in student data. The ACF is the correlation of a signal with itself. We are certainly interested in exploring the correlations among different signals.

The cross-correlation function (CCF) is a measure of similarity of two time series as a function of the lag of one relative to the other. The CCF can be imagined as a procedure of overlaying two series printed on transparency films and sliding them horizontally to find possible correlations. For this reason, it is also known as a "sliding dot product."

The upper graph in the figure to the right shows two time series from a student's engineering design process, representing about 45 minutes of her construction (white line) and analysis (green line) activities while trying to design an energy-efficient house with the goal to cut down the net energy consumption to zero. At first glance, you probably have no clue about what these lines represent and how they may be related.

But their CCFs reveal something that appears to be more outstanding. The lower graph shows two curves that peak at some points. I know you have a lot of questions at this point. Let me try to see if I can provide more explanations below.

Why are there two curves for depicting the correlation of two time series, say, A and B? This is because there is a difference between "A relative to B" and "B relative to A." Imagine that you print the series on two transparency films and slide one on top of the other. Which one is on the top matters. If you are looking for cause-effect relationships using the CCF, you can treat the antecedent time series as the cause and the subsequent time series as the effect.

What does a peak in the CCF mean, anyways? It guides you to where more interesting things may lie. In the figure of this post, the construction activities of this particular student were significantly followed by analysis activities about four times (two of them are within 10 minutes), but the analysis activities were significantly followed by construction activities only once (after 10 minutes).

Time series analysis tools in Visual Process Analytics: Autocorrelation

Autocorrelation reveals a three-minute periodicity
Digital learning tools such as computer games and CAD software emit a lot of temporal data about what students do when they are deeply engaged in the learning tools. Analyzing these data may shed light on whether students learned, what they learned, and how they learned. In many cases, however, these data look so messy that many people are skeptical about their meaning. As optimists, we believe that there are likely learning signals buried in these noisy data. We just need to use or invent some mathematical tricks to figure them out.

In Version 0.2 of our Visual Process Analytics (VPA), I added a few techniques that can be used to do time series analysis so that researchers can find ways to characterize a learning process from different perspectives. Before I show you these visual analysis tools, be aware that the purpose of these tools is to reveal the temporal trends of a given process so that we can better describe the behavior of the student at that time. Whether these traits are "good" or "bad" for learning likely depends on the context, which often necessitates the analysis of other co-variables.

Correlograms reveal similarity of two time series.
The first tool for time series analysis added to VPA is the autocorrelation function (ACF), a mathematical tool for finding repeating patterns obscured by noise in the data. The shape of the ACF graph, called the correlogram, is often more revealing than just looking at the shape of the raw time series graph. In the extreme case when the process is completely random (i.e., white noise), the ACF will be a Dirac delta function that peaks at zero time lag. In the extreme case when the process is completely sinusoidal, the ACF will be similar to a damped oscillatory cosine wave with a vanishing tail.

An interesting question relevant to learning science is whether the process is autoregressive (or under what conditions the process can be autoregressive). The quality of being autoregressive means that the current value of a variable is influenced by its previous values. This could be used to evaluate whether the student learned from the past experience -- in the case of engineering design, whether the student's design action was informed by previous actions. Learning becomes more predictable if the process is autoregressive (just to be careful, note that I am not saying that more predictable learning is necessarily better learning). Different autoregression models, denoted as AR(n) with n indicating the memory length, may be characterized by their ACFs. For example, the ACF of AR(2) decays more slowly than that of AR(1), as AR(2) depends on more previous points. (In practice, partial autocorrelation function, or PACF, is often used to detect the order of an AR model.)

The two figures in this post show that the ACF in action within VPA, revealing temporal periodicity and similarity in students' action data that are otherwise obscure. The upper graphs of the figures plot the original time series for comparison.

Visual Process Analytics (VPA) launched


Visual Process Analytics (VPA) is an online analytical processing (OLAP) program that we are developing for visualizing and analyzing student learning from complex, fine-grained process data collected by interactive learning software such as computer-aided design tools. We envision a future in which every classroom would be powered by informatics and infographics such as VPA to support day-to-day learning and teaching at a highly responsive level. In a future when every business person relies on visual analytics every day to stay in business, it would be a shame that teachers still have to read through tons of paper-based work from students to make instructional decisions. The research we are conducting with the support of the National Science Foundation is paving the road to a future that would provide the fair support for our educational systems that is somehow equivalent to business analytics and intelligence.

This is the mission of VPA. Today we are announcing the launch of this cyberinfrastructure. We decided that its first version number should be 0.1. This is just a way to indicate that the research and development on this software system will continue as a very long-term effort and what we have done is a very small step towards a very ambitious goal.


VPA is written in plain JavaScript/HTML/CSS. It should run within most browsers -- best on Chrome and Firefox -- but it looks and works like a typical desktop app. This means that while you are in the middle of mining the data, you can save what we call "the perspective" as a file onto your disk (or in the cloud) so that you can keep track of what you have done. Later, you can load the perspective back into VPA. Each perspective opens the datasets that you have worked on, with your latest settings and results. So if you are half way through your data mining, your work can be saved for further analyses.

So far Version 0.1 has seven analysis and visualization tools, each of which shows a unique aspect of the learning process with a unique type of interactive visualization. We admit that, compared with the daunting high dimension of complex learning, this is a tiny collection. But we will be adding more and more tools as we go. At this point, only one repository -- our own Energy3D process data -- is connected to VPA. But we expect to add more repositories in the future. Meanwhile, more computational tools will be added to support in-depth analyses of the data. This will require a tremendous effort in designing a smart user interface to support various computational tasks that researchers may be interested in defining.

Eventually, we hope that VPA will grow into a versatile platform of data analytics for cutting-edge educational research. As such, VPA represents a critically important step towards marrying learning science with data science and computational science.

Seeing student learning with visual analytics

Technology allows us to record almost everything happening in the classroom. The fact that students' interactions with learning environments can be logged in every detail raises the interesting question about whether or not there is any significant meaning and value in those data and how we can make use of them to help students and teachers, as pointed out in a report sponsored by the U.S. Department of Education:
New technologies thus bring the potential of transforming education from a data-poor to a data-rich enterprise. Yet while an abundance of data is an advantage, it is not a solution. Data do not interpret themselves and are often confusing — but data can provide evidence for making sound decisions when thoughtfully analyzed.” — Expanding Evidence Approaches for Learning in a Digital World, Office of Educational Technology, U.S. Department of Education, 2013
A radar chart of design space exploration.
A histogram of action intensity.
Here we are not talking about just analyzing students' answers to some multiple-choice questions, or their scores in quizzes and tests, or their frequencies of logging into a learning management system. We are talking about something much more fundamental, something that runs deep in cognition and learning, such as how students conduct a scientific experiment, solve a problem, or design a product. As learning goes deeper in those directions, data produced by students grows bigger. It is by no means an easy task to analyze large volumes of learner data, which contain a lot of noisy elements that cast uncertainty to assessment. The validity of an assessment inference rests on  the strength of evidence. Evidence construction often relies on the search for relations, patterns, and trends in student data.With a lot of data, this mandates some sophisticated computation similar to cognitive computing.

Data gathered from highly open-ended inquiry and design activities, key to authentic science and engineering practices that we want students to learn, are often intensive and “messy.” Without analytic tools that can discern systematic learning from random walk, what is provided to researchers and teachers is nothing but a DRIP (“data rich, information poor”) problem.

A scatter plot of action timeline.
Recognizing the difficulty in analyzing the sheer volume of messy student data, we turned to visual analytics, a whole category of techniques extensively used in cutting-edge business intelligence systems such as software developed by SAS, IBM, and others. We see interactive, visual process analytics key to accelerating the analysis procedures so that researchers can adjust mining rules easily, view results rapidly, and identify patterns clearly. This kind of visual analytics optimally combines the computational power of the computer, the graphical user interface of the software, and the pattern recognition power of the brain to support complex data analyses in data-intensive educational research.

A digraph of action transition.
So far, I have written four interactive graphs and charts that can be used to study four different aspects of the design action data that we collected from our Energy3D CAD software. Recording several weeks of student work on complex engineering design challenges, these datasets are high-dimensional, meaning that it is improper to treat them from a single point of view. For each question we are interested in getting answers from student data, we usually need a different representation to capture the outstanding features specific to the question. In many cases, multiple representations are needed to address a question.

In the long run, our objective is to add as many graphic representations as possible as we move along in answering more and more research questions based on our datasets. Given time, this growing library of visual analytics would develop sufficient power to the point that it may also become useful for teachers to monitor their students' work and thereby conduct formative assessment. To guarantee that our visual analytics runs on all devices, this library is written in JavaScript/HTML/CSS. A number of touch gestures are also supported for users to use the library on a multi-touch screen. A neat feature of this library is that multiple graphs and charts can be grouped together so that when you are interacting with one of them, the linked ones also change at the same time. As the datasets are temporal in nature, you can also animate these graphs to reconstruct and track exactly what students do throughout.

The first paper on learning analytics for assessing engineering design?

Figure 1
The International Journal of Engineering Education published our paper ("A Time Series Analysis Method for Assessing Engineering Design Processes Using a CAD Tool") on learning analytics and educational data mining for assessing student performance in complex engineering design projects. I believe this is the first time learning analytics was applied to the study of engineering design -- an extremely complicated process that is very difficult to assess using traditional methodologies because of its open-ended and practical nature.

Figure 2
This paper proposes a novel computational approach based on time series analysis to assess engineering design processes using our Energy3D CAD tool. To collect research data without disrupting a design learning process, design actions and artifacts are continuously logged as time series by the CAD tool behind the scenes, while students are working on an engineering design project such as a solar urban design challenge. These "atomically" fine-grained data can be used to reconstruct, visualize, and analyze the entire design process of a student with extremely high resolution. Results of a pilot study in a high school engineering class suggest that these data can be used to measure the level of student engagement, reveal the gender differences in design behaviors, and distinguish the iterative (Figure 1) and non-iterative (Figure 2) cycles in a design process.

From the perspective of engineering education, this paper contributes to the emerging fields of educational data mining and learning analytics that aim to expand evidence approaches for learning in a digital world. We are working on a series of papers to advance this research direction and expect to help with the "landscaping" of  those fields.

Computational process analytics: Compute-intensive educational research and assessment

Trajectories of building movement (good)
Computational process analytics (CPA) differs from traditional research and assessment methods in that it is not only data-intensive, but also compute-intensive. A unique feature of CPA is that it automatically analyzes the performance of student artifacts (including all the intermediate products) using the same set of science-based computational engines that students used to solve problems. The computational engines encompass every single details in the artifacts and their complex interactions that are highly relevant to the nature of the problems students solved. They also recreate the scenarios and contexts of student learning (e.g., the calculated results in such a post-processing analysis are exactly the same as those presented as feedback to students while they were solving the problems). As such, the computational engines provide holistic, high-fidelity assessments of students' work that no human evaluator can ever beat -- while no one can track numerous variables students might have created in long and deep learning processes in a short evaluation time, a computer program can easily do the job. Utilizing disciplinarily intelligent computational engines to do performance assessment was a major breakthrough in CPA as this approach really has the potential to revolutionize computer-based assessment.

No building movement (bad)
To give an example, this weekend I am busy running all the analysis jobs on my computer to process 1 GB of data logged by our Energy3D CAD software. I am trying to reconstruct and visualize the learning and design trajectories of all the students, projected onto many
different axes and planes of the state space. To do that, an estimate of 30-40 hours of CPU time on my Lenovo X230 tablet, which is a pretty fast machine, is needed. Each step loads up a sequence of artifacts, runs a solar simulation for each artifact, and analyzes the results (since I have automated the entire process, this is actually not as bad as it sounds). Our assumption is that the time evolution of the performance of these artifacts would approximately reflect the time evolution of the performance of their designers. We should be able to tell how well a student was learning by examining if the performance of her artifacts shows a systematic trend of improvement, or is just random. This is way better than the performance assessment based on just looking at students' final products.

After all the intermediate performance data were retrieved through post-processing the artifacts, we can then analyze them using our Process Analyzer -- a visual mining tool being developed to show the analysis results in various visualizations (it is our hope that the Process Analyzer will eventually become a powerful assessment assistant to teachers as it would free teachers from having to deal with an enormous amount of raw data or complicated data mining algorithms). For example, the two images in this post show that one student went through a lot of optimization in her design and the other did not (there is no trajectory in the second image).

National Science Foundation funds research that puts engineering design processes under a big data "microscope"

The National Science Foundation has awarded us $1.5 million to advance big data research on engineering design. In collaboration with Professors Şenay Purzer and Robin Adams at Purdue University, we will conduct a large-scale study involving over 3,000 students in Indiana and Massachusetts in the next five years.

This research will be based on our Energy3D CAD software that can automatically collect large process data behind the scenes while students are working on their designs. Fine-grained CAD logs possess all four characteristics of big data defined by IBM:
  1. High volume: Students can generate a large amount of process data in a complex open-ended engineering design project that involves many building blocks and variables; 
  2. High velocity: The data can be collected, processed, and visualized in real time to provide students and teachers with rapid feedback; 
  3. High variety: The data encompass any type of information provided by a rich CAD system such as all learner actions, events, components, properties, parameters, simulation data, and analysis results; 
  4. High veracity: The data must be accurate and comprehensive to ensure fair and trustworthy assessments of student performance.
These big data provide a powerful "microscope" that can reveal direct, measurable evidence of learning with extremely high resolution and at a statistically significant scale. Automation will make this research approach highly cost-effective and scalable. Automatic process analytics will also pave the road for building adaptive and predictive software systems for teaching and learning engineering design. Such systems, if successful, could become useful assistants to K-12 science teachers.

Why is big data needed in educational research and assessment? Because we all want students to learn more deeply and deep learning generates big data.

In the context of K-12 science education, engineering design is a complex cognitive process in which students learn and apply science concepts to solve open-ended problems with constraints to meet specified criteria. The complexity, open-endedness, and length of an engineering design process often create a large quantity of learner data that makes learning difficult to discern using traditional assessment methods. Engineering design assessment thus requires big data analytics that can track and analyze student learning trajectories over a significant period of time.
Deep learning generates big data.

This differs from research that does not require sophisticated computation to understand the data. For example, in typical pre/post-tests using multiple-choice assessment, the selection data of individual students are directly used as performance indices -- there is basically no depth in these self-evident data. I call this kind of data usage "data picking" -- analyzing them is just like picking up apples already fallen to the ground (as opposed to data mining that requires some computational efforts).

Process data, on the other hand, contain a lot of details that may be opaque to researchers at first glance. In the raw form, they often appear to be stochastic. But any seasoned teacher can tell you that they are able to judge learning by carefully watching how students solve problems. So here is the challenge: How can computer-based assessment accomplish what experienced teachers (human intelligence plus disciplinary knowledge plus some patience) can do based on observation data? This is the thesis of computational process analytics, an emerging subject that we are spearheading to transform educational research and assessment using computation. Thanks to NSF, we are now able to advance this subject.