This notebook presents basic usage examples of the XPandas package.

Example dataset¶

from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

import numpy as np
import pandas as pd
import os, sys
import requests

sys.path.insert(0, '..')

from xpandas.data_container import *
from xpandas.transformers import TimeSeriesTransformer, TimeSeriesWindowTransformer

/Users/iwitaly/anaconda/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

The usage example shown is based on open source time series data set.

The first thing we need to do is to read data. Here, we use the urlopen function from Python’s built-in urllib to download data set and limit the length of each data series.

url = "http://timeseriesclassification.com/Downloads/FordA.zip"
series_offset = 505

url = urlopen(url)
zipfile = ZipFile(BytesIO(url.read()))
lines = zipfile.open('FordA/FordA.csv').readlines()
lines = [l.decode('utf-8') for l in lines]
lines = lines[series_offset:]

lines is now a list of strings representing timeseries in a comma separated format that we can convert into floats

lines = [list(map(float, l.split(','))) for l in lines]

lines[0][:10]

[1.1871,
 0.4096,
 -0.43154,
 -1.231,
 -1.9055,
 -2.3824,
 -2.588,
 -2.5018,
 -2.1353,
 -1.574]

Let’s convert each embedded list into more convenient pandas.Series object.

lines = [pd.Series(l) for l in lines]

lines[0][:10]

  1.18710
  0.40960
 -0.43154
 -1.23100
 -1.90550
 -2.38240
 -2.58800
 -2.50180
 -2.13530
 -1.57400
dtype: float64

XPandas: Data structures¶

XSeries¶

XSeries is a 1d data container that can store any objects inside.

Using the pandas.Series objects we can encapsulate the list lines into XSeries object. The object has a global index of series and an sub-index for each pandas.Series.

X = XSeries(lines)

X.head()

  0      1.187100
    0.409600
   -0.43154...
  0      0.094261
    0.310310
    0.53060...
  0     -1.157000
   -1.592600
   -1.50960...
  0      0.356960
    0.300850
    0.24314...
  0      0.307980
    0.370350
    0.26015...
dtype: object
data_type: <class 'pandas.core.series.Series'>

The output reveals the data_type property of the XSeries object which contains the type of the contained objects, in this case, pandas.Series. The XSeries is thus build up of pandas.Series. Specifically, X supports all methods of its containing object pandas.Series.

XDataFrame¶

XDataFrame is an abstract 2d container that is based on pandas.DataFrame and stores XSeries objects.

The main feature of the XDataFrame are columns of XSeries that can contain and manage any data_type. For example, one may have a data set consisting of series, images, texts, plain numbers, or even custom objects. Ideally, we would want to handle such different data types in a unified 2d data container, e.g. a chain of transformers to create a simple 2d matrix of training data.

The following examples illustrates such a XDataFrame workflow.

Let Y be a vector of labels for each row.

Y = np.random.binomial(1, 0.5, X.shape[0])
Y = XSeries(Y)

df = XDataFrame({
    'X': X,
    'Y': Y
})

df.head()

	X	Y
0	0 1.187100 1 0.409600 2 -0.43154...	1
1	0 0.094261 1 0.310310 2 0.53060...	1
2	0 -1.157000 1 -1.592600 2 -1.50960...	0
3	0 0.356960 1 0.300850 2 0.24314...	1
4	0 0.307980 1 0.370350 2 0.26015...	1

Add new column to XDataFrame:

df['X_1'] = XSeries([
    pd.Series(np.random.normal(size=100))
    for _ in range(X.shape[0])
])

XPandas: Transformers¶

A major motivation for this project is the common data science task of extracting features from some complex objects (for example series) before proceeding with the machine learning.

Given a XSeries of pandas.Series one would, for instance, like to extract features from each series. That’s where Transformers play a vital role.

Each Transformer object support fit, transform methods just like scikit-learn transformers.

Let’s explore some examples.

TimeSeriesWindowTransformer¶

This transformer calculates moving average with given window size.

tr = TimeSeriesWindowTransformer(windows_size=5)
tr.fit(X)
transformed_series = tr.transform(X)

transformed_series.head()

  4     -0.394268
   -1.108168
   -1.70768...
  4      0.509686
    0.680500
    0.80574...
  4     -1.098344
   -0.755320
   -0.21608...
  4      0.234223
    0.165730
    0.09269...
  4      0.202701
    0.154336
    0.14082...
dtype: object
data_type: <class 'pandas.core.series.Series'>

Of course, with a windows_size = 5 first 4 elements are NaN.

transformed_series[0].head(10)

  -0.394268
  -1.108168
  -1.707688
  -2.121740
  -2.302600
  -2.236300
 -1.942152
 -1.469980
 -0.891442
 -0.287676
dtype: float64

TimeSeriesTransformer¶

Let’s try another transformer, probably the most common one. It extract several quantitative features from each pandas.Series like mean, std, quantiles. You can also pass you own list of features. As a result we retrieve a XDataFrame object.

tr = TimeSeriesTransformer()
tr.fit(X)
transformed_series = tr.transform(X)

type(transformed_series)

xpandas.data_container.data_container.XDataFrame

transformed_series.head().iloc[:, :3]

	None_TimeSeriesTransformer_max	None_TimeSeriesTransformer_mean	None_TimeSeriesTransformer_median
0	2.5263	0.001995	0.011186
1	2.6291	0.001997	-0.024726
2	2.6072	-0.001996	0.060685
3	2.6431	-0.001997	-0.022668
4	3.2398	-0.001995	-0.048518

We can also make use of the TSFresh transformer

from xpandas.transformers import TsFreshSeriesTransformer

tr = TsFreshSeriesTransformer()
tr.fit(X.head())
transformed_series = tr.transform(X.head())

transformed_series.head().iloc[:, :3]

	None__abs_energy	None__absolute_sum_of_changes	None__agg_autocorrelation__f_agg_"mean"
0	500.000126	134.513280	-0.012049
1	499.999290	114.289925	0.003075
2	500.001514	164.089622	-0.013172
3	499.999445	103.510040	-0.005639
4	500.003011	154.299542	0.001552

Custom inline Transformer¶

One can also create inline CustomTransfomer like this

from xpandas.transformers import XSeriesTransformer

my_awesome_transfomer = XSeriesTransformer(transform_function=lambda x: x.std())

my_awesome_transfomer.fit(X)

XSeriesTransformer(data_types=None, name='XSeriesTransformer',
          transform_function=<function <lambda> at 0x11929ad90>)

my_awesome_transfomer.transform(X).head()

  0.999998
  0.999997
  1.000000
  0.999997
  1.000001
dtype: float64
data_type: <class 'numpy.float64'>

If you want to create your custom transformer with any complex logic, please take a look at internal implementation of transformers.

To transform a XDataFrame one has to specify the transformation logic for the columns that should be transformed using a XDataFrameTransformer.

The constructor of XDataFrameTransformer input mapping dictionary of {col_name: XSeries transformer}.

For example, let’s apply TimeSeriesWindowTransformer to the X column and TimeSeriesTransformer to the X_1 column.

When apply transformation to the column, it’s replaced with transformed.

from xpandas.transformers import XDataFrameTransformer

df_transformer = XDataFrameTransformer({
    'X': TimeSeriesWindowTransformer(windows_size=4),
    'X_1': TimeSeriesTransformer()
})

df_transformer.fit(df)

XDataFrameTransformer(transformations={'X': [TimeSeriesWindowTransformer(windows_size=4)], 'X_1': [TimeSeriesTransformer(features=None)]})

transformed_df = df_transformer.transform(df)

transformed_df.head().iloc[:, :3]

	X_TimeSeriesWindowTransformer	Y	X_1_TimeSeriesTransformer_max
0	3 -0.016460 4 -0.789610 5 -1.48761...	1	2.383478
1	3 0.416408 4 0.613542 5 0.77304...	1	2.451725
2	3 -1.315175 4 -1.083680 5 -0.54600...	0	2.164009
3	3 0.268788 4 0.203539 5 0.13194...	1	2.951486
4	3 0.255629 4 0.176381 5 0.10033...	1	2.453836

Well, that’s a nice transformer, but can I create pipelines as in scikit-learn?

Sure! Let’s see on example where we combine TimeSeriesTransformer and TimeSeriesWindowTransformer into a combined pipeline using a PipeLineChain.

First let’s see example of PipeLineChain with XSeries and then with XDataFrame.

from xpandas.transformers import PipeLineChain

chain = PipeLineChain([
    ('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
    ('extract features', TimeSeriesTransformer())
])
chain.fit(X)

PipeLineChain(steps=[('moving average trans', TimeSeriesWindowTransformer(windows_size=5)), ('extract features', TimeSeriesTransformer(features=None))])

chain.get_params

<bound method Pipeline.get_params of PipeLineChain(steps=[('moving average trans', TimeSeriesWindowTransformer(windows_size=5)), ('extract features', TimeSeriesTransformer(features=None))])>

transformed_X = chain.transform(X)

transformed_X.head().iloc[:, :2]

	None_TimeSeriesWindowTransformer_TimeSeriesTransformer_max	None_TimeSeriesWindowTransformer_TimeSeriesTransformer_mean
0	2.16144	0.002078
1	2.39636	-0.002229
2	2.32512	0.005656
3	2.44430	0.000632
4	2.64094	-0.001295

All right! Let’s try to add scikit-learn transformer to the PipeLineChain. For example, let’s do PCA on transformed_X.

from sklearn.decomposition import PCA

chain = PipeLineChain([
    ('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
    ('extract features', TimeSeriesTransformer()),
    ('pca', PCA(n_components=5))
])
chain.fit(X)

PipeLineChain(steps=[('moving average trans', TimeSeriesWindowTransformer(windows_size=5)), ('extract features', TimeSeriesTransformer(features=None)), ('pca', PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False))])

transformed_X = chain.transform(X)

transformed_X.head()

	0	1	2	3	4
0	-0.133152	-0.242552	0.097523	-0.004435	-0.009747
1	-0.125413	0.076021	-0.089267	0.010531	0.017437
2	-0.028607	-0.088828	0.205043	0.098009	0.032338
3	0.071478	-0.058813	-0.247669	-0.023550	-0.052968
4	0.200611	0.110884	0.064200	0.012187	-0.038497

Let’s do even more interesting things! Adding a scikit-learn estimator at the end of PipeLineChain!

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, Y)

Be sure that types of X_train and X_test are XSeries.

print(type(X_train))
print(type(X_test))

<class 'xpandas.data_container.data_container.XSeries'>
<class 'xpandas.data_container.data_container.XSeries'>

chain = PipeLineChain([
    ('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
    ('extract features', TimeSeriesTransformer()),
    ('pca', PCA(n_components=5)),
    ('logit_regression', LogisticRegression())

])
chain = chain.fit(X_train, y_train)

prediction = chain.predict(X_test)

accuracy_score(y_test, prediction)

0.5004061738424046

Let’s now try PipeLineChain with XDataFrameTransformer.

Imagine data set of feature columns gender (0 or 1), age (int), series( pandas.Series), target (0 or 1). Let’s try to create PipeLineChain that extracts features from series and performs PCA over all feature set and then performs LogitRegression classification.

n = 100

df_features = XDataFrame({
    'gender': XSeries(np.random.binomial(1, 0.7, n)),
    'age': XSeries(np.random.poisson(25, n)),
    'series': XSeries([
        pd.Series(np.random.normal(size=500))
    ] * n)
})

target = XSeries(np.random.binomial(1, 0.45, n))

features_transformer = XDataFrameTransformer({
    'series': TimeSeriesTransformer()
})

pipe_line = PipeLineChain([
    ('extract_from_series', features_transformer),
    ('pca', PCA(n_components=5)),
    ('logit_regression', LogisticRegression())
])

df_features_train, df_features_test, \
        y_train, y_test = train_test_split(df_features, target)

pipe_line.fit(df_features_train, y_train)

PipeLineChain(steps=[('extract_from_series', XDataFrameTransformer(transformations={'series': [TimeSeriesTransformer(features=None)]})), ('pca', PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('logit_regression', LogisticRegression(C=1.0, cla...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

pipe_line.predict(df_features_test)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0])