Time Series data capture the variable's value repeatedly over time resulting in a series of data points indexed in time order. In time series data has natural temporal ordering i.e. the value of a variable at a particular time is dependent on past values.
Traditional machine learning algorithms are not designed to capture the temporal ordering of time series data. A data scientist needs to perform feature engineering to capture important characteristics of the data into a few metrics. Generating a lot of time series features and extracting the relevant ones from those is time taking and tedious task.
Here tsfresh package comes into the picture, which can generate standard hundreds of generic features for your time series data. In this article, we will discuss the in-depth usage and implementation of the tsfresh package.
tsfresh:
tsfresh is an open-source package that can generate hundreds of relevant time series features, fit to train a machine learning model. The features generated from tsfresh can be used to solve Classification, Forecasting, and Outlier Detection use-case.
Getting Started:
tsfresh package offers various capabilities to perform feature engineering on time series data including:
- Feature Generation
- Feature Selection
- Compatibility with large data
Installation & Usage:
tsfresh is an open-sourced Python package that can be installed using:
pip install -U tsfresh
# or
conda install -c conda-forge tsfresh
1) Feature Generation:
tsfresh package offers an automated features generation API that can generate 750+ relevant features from 1 time series variable. The generated features include a wide range of spectrum including:
- Descriptive Statistics (mean, max, correlation, etc)
- Physics-based indicators for nonlinearity and complexity
- Digital signal processing-related features
- History compressed features
Usage:
A data scientist doesn’t need to waste time on feature engineering. tsfresh.extract_features()
function generated 789 features from multiple domains for 1 time-series variable.
One can go through the tsfresh documentation to get an overview of extracted features.
2) Feature Selection:
tsfresh package also offers hypothesis test-based feature selection implementation that identifies relevant features for the target variable. To limit the number of irrelevant features tsfresh deploys the fresh algorithm (fresh stands for FeatuRe Extraction based on Scalable Hypothesis tests).
tsfresh.select_features()
function the user can implement the feature selection.
3) Compatibility with Large Data:
For the conditions when we have a bunch of very large multiple time series data. tsfresh also offers APIs to scale the feature generation/extraction, and feature selection implementation for a large amount of data:
- Multiprocessing: tsfresh package by default can parallelize the execution of feature generation/extraction and feature selection implementation to multiple cores.
- tsfresh’s own distributed framework to scale the implementation for the data that fits into a single machine and distribute the feature calculation over multiple machines to speed up the calculation.
- Apache spark or Dask for the data that does not fit into a single machine.
Here’s a wonderful articles by Nils Braun explaining implementation of tsfresh using Dask (Article part 1, Article part 2).
Conclusion:
tsfresh is a handy package to generate and select relevant features for a time-series feature in a few lines of Python code. It automatically extracts and selects 750+ field-tested features from multiple domains on your time-based data sample. It reduces a lot of work time of a data scientist that was been wasted on feature engineering.
Usually, time-series data is quite large, and the tsfresh package also comes to the rescue for the same. tsfresh APIs can be applied to large data samples using multiprocessing, dask, or spark.
References:
[1] tsfresh documentation: https://tsfresh.readthedocs.io/en/latest/
[2] Nils Braun GitHub Article: https://nils-braun.github.io/tsfresh-on-cluster-1/
No comments:
Post a Comment