This notebook presents basic usage examples of the XPandas package.
Example dataset¶
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
import numpy as np
import pandas as pd
import os, sys
import requests
sys.path.insert(0, '..')
from xpandas.data_container import *
from xpandas.transformers import TimeSeriesTransformer, TimeSeriesWindowTransformer
/Users/iwitaly/anaconda/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
from pandas.core import datetools
The usage example shown is based on open source time series data set.
The first thing we need to do is to read data. Here, we use the
urlopen
function from Python’s built-in urllib to download data set
and limit the length of each data series.
url = "http://timeseriesclassification.com/Downloads/FordA.zip"
series_offset = 505
url = urlopen(url)
zipfile = ZipFile(BytesIO(url.read()))
lines = zipfile.open('FordA/FordA.csv').readlines()
lines = [l.decode('utf-8') for l in lines]
lines = lines[series_offset:]
lines
is now a list of strings representing timeseries in a comma
separated format that we can convert into floats
lines = [list(map(float, l.split(','))) for l in lines]
lines[0][:10]
[1.1871,
0.4096,
-0.43154,
-1.231,
-1.9055,
-2.3824,
-2.588,
-2.5018,
-2.1353,
-1.574]
Let’s convert each embedded list into more convenient pandas.Series
object.
lines = [pd.Series(l) for l in lines]
lines[0][:10]
0 1.18710
1 0.40960
2 -0.43154
3 -1.23100
4 -1.90550
5 -2.38240
6 -2.58800
7 -2.50180
8 -2.13530
9 -1.57400
dtype: float64
XPandas: Data structures¶
XSeries¶
XSeries
is a 1d data container that can store any objects inside.
Using the pandas.Series
objects we can encapsulate the list
lines
into XSeries
object. The object has a global index of
series and an sub-index for each pandas.Series
.
X = XSeries(lines)
X.head()
0 0 1.187100
1 0.409600
2 -0.43154...
1 0 0.094261
1 0.310310
2 0.53060...
2 0 -1.157000
1 -1.592600
2 -1.50960...
3 0 0.356960
1 0.300850
2 0.24314...
4 0 0.307980
1 0.370350
2 0.26015...
dtype: object
data_type: <class 'pandas.core.series.Series'>
The output reveals the data_type
property of the XSeries
object
which contains the type of the contained objects, in this case,
pandas.Series
. The XSeries
is thus build up of
pandas.Series
. Specifically, X
supports all methods of its
containing object pandas.Series
.
XDataFrame¶
XDataFrame
is an abstract 2d container that is based on
pandas.DataFrame
and stores XSeries
objects.
The main feature of the XDataFrame
are columns of XSeries
that
can contain and manage any data_type. For example, one may have a
data set consisting of series, images, texts, plain numbers, or even
custom objects. Ideally, we would want to handle such different data
types in a unified 2d data container, e.g. a chain of transformers to
create a simple 2d matrix of training data.
The following examples illustrates such a XDataFrame
workflow.
Let Y
be a vector of labels for each row.
Y = np.random.binomial(1, 0.5, X.shape[0])
Y = XSeries(Y)
df = XDataFrame({
'X': X,
'Y': Y
})
df.head()
X | Y | |
---|---|---|
0 | 0 1.187100 1 0.409600 2 -0.43154... | 1 |
1 | 0 0.094261 1 0.310310 2 0.53060... | 1 |
2 | 0 -1.157000 1 -1.592600 2 -1.50960... | 0 |
3 | 0 0.356960 1 0.300850 2 0.24314... | 1 |
4 | 0 0.307980 1 0.370350 2 0.26015... | 1 |
Add new column to XDataFrame:
df['X_1'] = XSeries([
pd.Series(np.random.normal(size=100))
for _ in range(X.shape[0])
])
XPandas: Transformers¶
A major motivation for this project is the common data science task of extracting features from some complex objects (for example series) before proceeding with the machine learning.
Given a XSeries
of pandas.Series
one would, for instance, like
to extract features from each series. That’s where Transformers play a
vital role.
Each Transformer
object support fit, transform
methods just like
scikit-learn
transformers.
Let’s explore some examples.
TimeSeriesWindowTransformer¶
This transformer calculates moving average with given window size.
tr = TimeSeriesWindowTransformer(windows_size=5)
tr.fit(X)
transformed_series = tr.transform(X)
transformed_series.head()
0 4 -0.394268
5 -1.108168
6 -1.70768...
1 4 0.509686
5 0.680500
6 0.80574...
2 4 -1.098344
5 -0.755320
6 -0.21608...
3 4 0.234223
5 0.165730
6 0.09269...
4 4 0.202701
5 0.154336
6 0.14082...
dtype: object
data_type: <class 'pandas.core.series.Series'>
Of course, with a windows_size = 5 first 4 elements are NaN.
transformed_series[0].head(10)
4 -0.394268
5 -1.108168
6 -1.707688
7 -2.121740
8 -2.302600
9 -2.236300
10 -1.942152
11 -1.469980
12 -0.891442
13 -0.287676
dtype: float64
TimeSeriesTransformer¶
Let’s try another transformer, probably the most common one. It extract
several quantitative features from each pandas.Series like mean, std,
quantiles. You can also pass you own list of features. As a result we
retrieve a XDataFrame
object.
tr = TimeSeriesTransformer()
tr.fit(X)
transformed_series = tr.transform(X)
type(transformed_series)
xpandas.data_container.data_container.XDataFrame
transformed_series.head().iloc[:, :3]
None_TimeSeriesTransformer_max | None_TimeSeriesTransformer_mean | None_TimeSeriesTransformer_median | |
---|---|---|---|
0 | 2.5263 | 0.001995 | 0.011186 |
1 | 2.6291 | 0.001997 | -0.024726 |
2 | 2.6072 | -0.001996 | 0.060685 |
3 | 2.6431 | -0.001997 | -0.022668 |
4 | 3.2398 | -0.001995 | -0.048518 |
We can also make use of the TSFresh transformer
from xpandas.transformers import TsFreshSeriesTransformer
tr = TsFreshSeriesTransformer()
tr.fit(X.head())
transformed_series = tr.transform(X.head())
transformed_series.head().iloc[:, :3]
None__abs_energy | None__absolute_sum_of_changes | None__agg_autocorrelation__f_agg_"mean" | |
---|---|---|---|
0 | 500.000126 | 134.513280 | -0.012049 |
1 | 499.999290 | 114.289925 | 0.003075 |
2 | 500.001514 | 164.089622 | -0.013172 |
3 | 499.999445 | 103.510040 | -0.005639 |
4 | 500.003011 | 154.299542 | 0.001552 |
Custom inline Transformer¶
One can also create inline CustomTransfomer
like this
from xpandas.transformers import XSeriesTransformer
my_awesome_transfomer = XSeriesTransformer(transform_function=lambda x: x.std())
my_awesome_transfomer.fit(X)
XSeriesTransformer(data_types=None, name='XSeriesTransformer',
transform_function=<function <lambda> at 0x11929ad90>)
my_awesome_transfomer.transform(X).head()
0 0.999998
1 0.999997
2 1.000000
3 0.999997
4 1.000001
dtype: float64
data_type: <class 'numpy.float64'>
If you want to create your custom transformer with any complex logic, please take a look at internal implementation of transformers.
To transform a XDataFrame one has to specify the transformation logic for the columns that should be transformed using a XDataFrameTransformer.
The constructor of XDataFrameTransformer input mapping dictionary of {col_name: XSeries transformer}.
For example, let’s apply TimeSeriesWindowTransformer to the X column and TimeSeriesTransformer to the X_1 column.
When apply transformation to the column, it’s replaced with transformed.
from xpandas.transformers import XDataFrameTransformer
df_transformer = XDataFrameTransformer({
'X': TimeSeriesWindowTransformer(windows_size=4),
'X_1': TimeSeriesTransformer()
})
df_transformer.fit(df)
XDataFrameTransformer(transformations={'X': [TimeSeriesWindowTransformer(windows_size=4)], 'X_1': [TimeSeriesTransformer(features=None)]})
transformed_df = df_transformer.transform(df)
transformed_df.head().iloc[:, :3]
X_TimeSeriesWindowTransformer | Y | X_1_TimeSeriesTransformer_max | |
---|---|---|---|
0 | 3 -0.016460 4 -0.789610 5 -1.48761... | 1 | 2.383478 |
1 | 3 0.416408 4 0.613542 5 0.77304... | 1 | 2.451725 |
2 | 3 -1.315175 4 -1.083680 5 -0.54600... | 0 | 2.164009 |
3 | 3 0.268788 4 0.203539 5 0.13194... | 1 | 2.951486 |
4 | 3 0.255629 4 0.176381 5 0.10033... | 1 | 2.453836 |
Well, that’s a nice transformer, but can I create pipelines as in scikit-learn?
Sure! Let’s see on example where we combine TimeSeriesTransformer
and TimeSeriesWindowTransformer
into a combined pipeline using a
PipeLineChain
.
First let’s see example of PipeLineChain
with XSeries
and then
with XDataFrame
.
from xpandas.transformers import PipeLineChain
chain = PipeLineChain([
('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
('extract features', TimeSeriesTransformer())
])
chain.fit(X)
PipeLineChain(steps=[('moving average trans', TimeSeriesWindowTransformer(windows_size=5)), ('extract features', TimeSeriesTransformer(features=None))])
chain.get_params
<bound method Pipeline.get_params of PipeLineChain(steps=[('moving average trans', TimeSeriesWindowTransformer(windows_size=5)), ('extract features', TimeSeriesTransformer(features=None))])>
transformed_X = chain.transform(X)
transformed_X.head().iloc[:, :2]
None_TimeSeriesWindowTransformer_TimeSeriesTransformer_max | None_TimeSeriesWindowTransformer_TimeSeriesTransformer_mean | |
---|---|---|
0 | 2.16144 | 0.002078 |
1 | 2.39636 | -0.002229 |
2 | 2.32512 | 0.005656 |
3 | 2.44430 | 0.000632 |
4 | 2.64094 | -0.001295 |
All right! Let’s try to add scikit-learn transformer to the PipeLineChain. For example, let’s do PCA on transformed_X.
from sklearn.decomposition import PCA
chain = PipeLineChain([
('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
('extract features', TimeSeriesTransformer()),
('pca', PCA(n_components=5))
])
chain.fit(X)
PipeLineChain(steps=[('moving average trans', TimeSeriesWindowTransformer(windows_size=5)), ('extract features', TimeSeriesTransformer(features=None)), ('pca', PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
svd_solver='auto', tol=0.0, whiten=False))])
transformed_X = chain.transform(X)
transformed_X.head()
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | -0.133152 | -0.242552 | 0.097523 | -0.004435 | -0.009747 |
1 | -0.125413 | 0.076021 | -0.089267 | 0.010531 | 0.017437 |
2 | -0.028607 | -0.088828 | 0.205043 | 0.098009 | 0.032338 |
3 | 0.071478 | -0.058813 | -0.247669 | -0.023550 | -0.052968 |
4 | 0.200611 | 0.110884 | 0.064200 | 0.012187 | -0.038497 |
Let’s do even more interesting things! Adding a scikit-learn estimator at the end of PipeLineChain!
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, Y)
Be sure that types of X_train and X_test are XSeries.
print(type(X_train))
print(type(X_test))
<class 'xpandas.data_container.data_container.XSeries'>
<class 'xpandas.data_container.data_container.XSeries'>
chain = PipeLineChain([
('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
('extract features', TimeSeriesTransformer()),
('pca', PCA(n_components=5)),
('logit_regression', LogisticRegression())
])
chain = chain.fit(X_train, y_train)
prediction = chain.predict(X_test)
accuracy_score(y_test, prediction)
0.5004061738424046
Let’s now try PipeLineChain
with XDataFrameTransformer
.
Imagine data set of feature columns gender (0 or 1), age (int), series(
pandas.Series), target (0 or 1). Let’s try to create PipeLineChain
that extracts features from series and performs PCA
over all feature
set and then performs LogitRegression classification.
n = 100
df_features = XDataFrame({
'gender': XSeries(np.random.binomial(1, 0.7, n)),
'age': XSeries(np.random.poisson(25, n)),
'series': XSeries([
pd.Series(np.random.normal(size=500))
] * n)
})
target = XSeries(np.random.binomial(1, 0.45, n))
features_transformer = XDataFrameTransformer({
'series': TimeSeriesTransformer()
})
pipe_line = PipeLineChain([
('extract_from_series', features_transformer),
('pca', PCA(n_components=5)),
('logit_regression', LogisticRegression())
])
df_features_train, df_features_test, \
y_train, y_test = train_test_split(df_features, target)
pipe_line.fit(df_features_train, y_train)
PipeLineChain(steps=[('extract_from_series', XDataFrameTransformer(transformations={'series': [TimeSeriesTransformer(features=None)]})), ('pca', PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('logit_regression', LogisticRegression(C=1.0, cla...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False))])
pipe_line.predict(df_features_test)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0])