A time series is a collection of observations made sequentially over time. Such series naturally occur in a wide range of domains due to the human desire to measure and monitor phenomena that inherently change with time. Examples include:
But that's not all! The application of time series analysis extends beyond traditional time-dependent data. We leverage this technique to analyze data that isn't inherently temporal. For instance, we can transform handwriting and images into a time series format, unlocking new possibilities for image data mining.
Time series data offers a rich playground for various analytical tasks that can provide valuable insights across multiple domains.
The possibilities are virtually limitless, and the growing need for specialized techniques for manipulating and interpreting time series data is more crucial than ever.
Subjectivity: Much of time series analysis hinges on the concept of similarity between different time series. However, defining 'similarity' is not straightforward due to the inherent subjectivity involved. The challenges posed by this subjectivity include
Sampling rates: Dealing with time series data that have different sampling rates can be a challenging problem. Sampling rate refers to the frequency at which data points are collected or recorded. For example, one time series may have data points collected every minute, while another may have data points collected every hour
Noise and missing values: These are common issues in time series data that can significantly impact the quality of analysis and predictions. The challenges include:
Now that we can identify the challenges, it’s time to look over some cool techniques used to find time series similarity.
The Euclidean distance is one of the most straightforward and commonly used methods for calculating the similarity between two time series. However, it can be computationally expensive, especially for long sequences or large datasets. This is primarily because the calculation involves taking the square root of the sum of squared differences between corresponding points in the two time series.
The Squared Euclidean distance is simply the Euclidean distance without taking the square root:
Euclidean distance and Squared Euclidean distance are equivalent in the sense that they return the same rankings, clusterings and classifications. Squared Euclidean distance is computationally cheaper because it avoids the square root operation.
Dynamic Time Warping (DTW) is a technique used for measuring similarity between two time series which may vary in time or speed. Unlike the Euclidean distance, which assumes a one-to-one mapping between points in the two series being compared, DTW allows for "warping" of the time dimension. This means that the algorithm can find an optimal match between two time series even if they are out of phase.
DTW generally yields superior results across a variety of datasets but with a greater computational cost. DTW is two to three orders of magnitude slower than Euclidean distance.
Preprocessing techniques like offset translation, amplitude scaling, linear trend removal, and noise reduction can significantly improve the performance and accuracy of distance-based similarity algorithms like Euclidean distance or Dynamic Time Warping (DTW). Here's why each of these preprocessing steps is important:
Time series data often come with a baseline offset that can distort the distance metrics. An offset can add a constant bias in distance calculations, making genuinely similar patterns appear dissimilar. Removing the offset ensures that the focus is on the shape of the series, not on their absolute values.
Different time series may operate on different scales, affecting the distance metrics. High-amplitude series can dominate distance calculations when compared with low-amplitude series. Scaling the amplitude makes it possible to compare the inherent patterns in the time series independent of their magnitudes.
Some time series have underlying trends that can overshadow shorter-term features or patterns. The presence of a trend can make two similar time series appear dissimilar if one has a trend and the other does not. Detrending the data ensures that the similarity measure focuses on the intrinsic patterns in the data rather than the overall trend.
Real-world time series data often contain noise that can make similarity detection difficult. Noise can introduce spurious dissimilarities, making it harder to match series effectively. Smoothing out noise focuses the similarity measure on the essential features of the data, improving both accuracy and robustness.
Why don't we use all these techniques for digital analytics anomaly detection?
You might wonder, with the array of advanced techniques available for time series analysis, why aren't they commonly applied for daily anomaly detection in digital analytics data? The answer lies in the unique challenges posed by this specific type of data.
Conducting daily similarity checks for anomaly detection in the realm of digital analytics data presents a multitude of challenges that make the process computationally expensive and resource-intensive. One of the foremost issues is the granular nature of the data.
Digital analytics platforms often collect data at an extremely fine level of detail, sometimes down to the millisecond. This granularity means that even a single day's worth of data can contain millions of data points, requiring substantial computational power to aggregate and analyze.
Moreover, digital analytics data often exhibit strong temporal patterns and seasonality. For example, web traffic may spike during certain hours of the day or on specific days of the week, requiring algorithms that can adapt to these cyclical changes. Simple distance metrics may not suffice, and more complex models that account for these temporal fluctuations are needed. This again increases the computational burden.
Then there's the issue of high dimensionality. Digital analytics data typically consist of multiple metrics and dimensions—ranging from page views and click-through rates to more advanced metrics like user engagement scores. Computing similarities in such high-dimensional spaces is not only computationally expensive but also risks diluting the meaningful signals in a sea of noise, a phenomenon known as the "curse of dimensionality."
Finally, the sheer computational resources required to process, clean, and analyze this data can be significant, especially if you aim to do this on a daily basis. Even with optimized algorithms and high-performance computing environments, the task can be time-consuming and expensive, requiring specialized hardware and software resources.
It's worth noting that the field is vast and continuously evolving, with other advanced methods like Long Short-Term Memory (LSTM) networks for deep learning, Fourier Transform for frequency domain analysis, and various ensemble methods combining multiple similarity metrics. Due to the breadth and depth of the subject, it's challenging to dive into every technique in a single post, but the current state of the field offers a rich set of tools for various applications, from finance and healthcare to engineering and beyond.
While traditional methods of time series analysis may pose challenges for daily anomaly detection in digital analytics, solutions like the Baresquare platform are designed to address these complexities. Our platform performs 24/7 anomaly detection, tracks site or campaign performance, and surfaces issues before they become problems, providing you with a robust and hassle-free way to keep your digital analytics on track.