My talk on scikit-learn at Statistical Programming DC

Modeling with scikit-learn: common workflows for reproducible results

Overview

  • Describe an example problem
  • Build a couple basic models
  • Common sklearn workflows:
    • Evaluate performance
    • Tools to improve the model (data transformations, parameter search, etc)
    • Solidifying the experiments into a repeatable data pipeline
  • Using outside tools

Motivation

What we want our tools to be

Legos

What we have right now

R packages

The way ahead for R?

Carrot R library

The `caret` package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. [...] There are many different modeling functions in R. Some have different syntax for model training and/or prediction. The package started off as a way to provide a uniform interface the functions themselves, as well as a way to standardize common tasks (such parameter tuning and variable importance).

scikit-learn's approach

Scikit Learn logo

Consistent API with a cohesive "grammar" of modeling actions:

Nouns

  • Model classes
  • Metrics classes
  • Preprocessing classes
  • Data (as numpy arrays)
  • ...

Verbs

  • fit
  • transform
  • fit_transform
  • score
  • predict
  • ...

Classifier comparison

Example problem description:

We'll be using data on blood donations from the UC Irvine Machine Learning Repository. Given some data on some blood donors' previous donations, we'll be trying to predict a binary outcome of whether the donor gave blood in a certain time period.

Here's the official description:

To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).

Source: https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center

Get the data

From beginning to end, we'll treat the raw data as immutable and base all of our analyses on views and transformations of this "ground truth" data set.

In [1]:
!wget -nc --directory-prefix data \
    https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data
File ‘data/transfusion.data’ already there; not retrieving.

In [2]:
!head data/transfusion.data
Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),"whether he/she donated blood in March 2007"
2 ,50,12500,98 ,1
0 ,13,3250,28 ,1
1 ,16,4000,35 ,1
2 ,20,5000,45 ,1
1 ,24,6000,77 ,0
4 ,4,1000,4 ,0
2 ,7,1750,14 ,1
1 ,12,3000,35 ,0
2 ,9,2250,22 ,1

Import the data as a pandas DataFrame

In [3]:
import numpy as np
import pandas as pd

df = pd.read_csv('data/transfusion.data')
df.head()
Out[3]:
Recency (months) Frequency (times) Monetary (c.c. blood) Time (months) whether he/she donated blood in March 2007
0 2 50 12500 98 1
1 0 13 3250 28 1
2 1 16 4000 35 1
3 2 20 5000 45 1
4 1 24 6000 77 0
In [4]:
df.shape
Out[4]:
(748, 5)
In [5]:
df.dtypes
Out[5]:
Recency (months)                              int64
Frequency (times)                             int64
Monetary (c.c. blood)                         int64
Time (months)                                 int64
whether he/she donated blood in March 2007    int64
dtype: object
In [6]:
# save the current names in case we need them later
original_column_names = df.columns

# make the names less ugly
names = ['recency', 'frequency', 'cc', 'time', 'donated']
df.columns = names

df.head()
Out[6]:
recency frequency cc time donated
0 2 50 12500 98 1
1 0 13 3250 28 1
2 1 16 4000 35 1
3 2 20 5000 45 1
4 1 24 6000 77 0

Some exploratory visualization

In [7]:
# import our graphics tools
%matplotlib inline
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns  # nice defaults for matplotlib styles
set2 = sns.color_palette('Set2')

# add on some settings from 'Bayesian Methods for Hackers'
plt.style.use('bmh')

# set larger default fonts for presentation-friendliness
mpl.rc('figure', figsize=(10, 8))
mpl.rc('axes', labelsize=16, titlesize=20)
In [8]:
from pandas.tools.plotting import scatter_matrix

axeslist = scatter_matrix(df, alpha=0.8, figsize=(10, 10))
for ax in axeslist.flatten():
    ax.grid(False)
In [9]:
import numpy as np

# create a figure with 4 subplots
fig, axs = plt.subplots(nrows=2, ncols=2)

feature_column_names = df.columns[:-1]
label_column_name = df.columns[-1]

for i, col in enumerate(feature_column_names):
    
    # get the current subplot to work on
    ax = axs.ravel()[i]
    
    # create some random y jitter to add
    jitter = np.random.uniform(low=-0.05, high=0.05, size=len(df))
    
    # plot the data
    ax.scatter(x=df[col], y=df[label_column_name] + jitter,
               c=df.donated, cmap='coolwarm', alpha=0.5)
    
    # label the axes
    ax.set_xlabel(col)
    ax.set_ylabel(label_column_name)
    
plt.tight_layout()
plt.show()

Common tasks with scikit-learn:

Split the data into train and test

In [10]:
from sklearn.cross_validation import train_test_split

# using conventional sklearn variable names
X = df[feature_column_names].astype(float)
y = df.donated.ravel()

# break up the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
In [11]:
X_train.shape
Out[11]:
(374, 4)
In [12]:
y_train.shape
Out[12]:
(374,)
In [13]:
X_test.shape
Out[13]:
(374, 4)
In [14]:
y_test.shape
Out[14]:
(374,)

Fit a simple model using the training data

The basic modeling workflow in sklearn involves instantiating a model object with the desired settings, and then fitting it to given data.

Decision tree

In [15]:
from sklearn.tree import DecisionTreeClassifier

clf_tree = DecisionTreeClassifier(max_depth=3)
clf_tree.fit(X_train, y_train)
Out[15]:
DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=3, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            random_state=None, splitter='best')
In [16]:
print 'Score:', clf_tree.score(X_test, y_test)
Score: 0.794117647059
In [17]:
# viz adapted from http://scikit-learn.org/stable/modules/tree.html
import pydot
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz

dot_data = StringIO() 
export_graphviz(clf_tree, out_file=dot_data) 
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
graph.write_png('output/decision_tree.png')
Out[17]:
True

Decision tree

Logistic regression

In [18]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(penalty='l2', fit_intercept=True)
clf.fit(X_train, y_train)
Out[18]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)
In [19]:
print 'Score:', clf.score(X_test, y_test)
Score: 0.772727272727

Trying a bunch of different models

sklearn tries to use the same syntax for most modeling tasks, so it's pretty easy to plug and play.

Note: in this example we're not putting much thought into which model to use and we're sticking with the default model parameters. The point is not that we would *want* to do this. The point is that all of these very different models have such similar structure and conventions that we can iterate through a bunch of them, fit, and evaluate them while changing none of the syntax.

In [20]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC

other_clfs = {
    'Random forest': RandomForestClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'Naive Bayes': MultinomialNB(),
    'Linear SVC': LinearSVC(),
    'KNN': KNeighborsClassifier(5),
}

# iterating through all of these models we want to fit ...
for name, other_clf in other_clfs.iteritems():
    
    # fit the model with the training data
    print other_clf.fit(X_train, y_train)
    
    # cross validation score
    print '---\nScore:', other_clf.score(X_test, y_test)
    print '\n'
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=5, p=2, weights='uniform')
---
Score: 0.756684491979


LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.0001, verbose=0)
---
Score: 0.299465240642


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
---
Score: 0.729946524064


RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)
---
Score: 0.756684491979


AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
---
Score: 0.775401069519


Anatomy of a scikit-learn model

The object itself:

In [21]:
clf
Out[21]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

Coefficients (if fit):

In [22]:
clf.coef_
Out[22]:
array([[ -9.74764830e-02,   1.49194232e-06,   3.72985587e-04,
         -1.17108498e-02]])

Intercept (if fit):

In [23]:
clf.intercept_
Out[23]:
array([-0.53074433])

Model parameters:

In [24]:
clf.get_params()
Out[24]:
{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'penalty': 'l2',
 'random_state': None,
 'tol': 0.0001}

Making predictions for $\hat y$:

In [25]:
clf.predict(X_test)
Out[25]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0])

Predicting probabilities, $p(y_i = 1)$

In [26]:
pd.DataFrame(clf.predict_proba(X_test))\
    .head(10)
Out[26]:
0 1
0 0.691294 0.308706
1 0.952584 0.047416
2 0.726055 0.273945
3 0.770998 0.229002
4 0.680134 0.319866
5 0.705628 0.294372
6 0.637364 0.362636
7 0.888941 0.111059
8 0.559386 0.440614
9 0.936608 0.063392

Challenge: evaluating model performance

The default metric for logistic regression is mean accuracy:

$$\mathrm{accuracy}(y, \hat y) = \frac{1}{n} \sum_{i=1}^n \mathbb{I}(\hat y_i = y_i)$$
In [27]:
clf.score(X_test, y_test)
Out[27]:
0.77272727272727271

But what if we want a more holistic view of how well this model works?

Common solution: use cross validation tools and other metrics

Cross validation

Our data set here is fairly small $(N=748)$, so what if we happened to randomly pick a non-representative chunk for testing?

Instead of summarizing the performance just on the test data we want to use $k$-fold cross-validation, splitting the data randomly into chunks — "folds" — and evaluating on each to get an idea of the variation in performance.

In [28]:
from sklearn import cross_validation

# come up with random folds of the data
kf = cross_validation.KFold(len(X), n_folds=5, shuffle=True)

def plot_scores(scores):
    N = len(scores)
    plt.bar(np.arange(1, N + 1) - 0.4, scores, color=set2[2])
    plt.title('{}-fold cross-validation scores'.format(N), fontsize=18)
    plt.xlabel('fold', fontsize=14)
    plt.ylabel('score', fontsize=14)
    plt.xlim(0.5, N + 0.5)
    plt.ylim(0, 1)
    plt.show()

# evaluate the fitted model on each fold in turn, returns a score for each fold
scores = cross_validation.cross_val_score(clf, X, y, cv=kf, n_jobs=1)

print 'scores:', scores
print 'average score:', np.mean(scores)
plot_scores(scores)
scores: [ 0.78        0.76666667  0.81333333  0.75167785  0.73154362]
average score: 0.768644295302

Other metrics

We can try all sorts of other metrics, though, such as log loss:

$$\textrm{LogLoss}(y, \hat y) = - \frac{1}{n} \sum_{i=1}^n \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right]$$
In [29]:
from sklearn.metrics import log_loss

log_loss(y_test, clf.predict_proba(X_test))
Out[29]:
0.45710527540996404

Or the F1 score:

$$\textrm{F1} = 2 \frac{\textrm{precision} * \textrm{recall}}{\textrm{precision} + \textrm{recall}}$$
In [30]:
from sklearn.metrics import f1_score

f1_score(y_test, clf.predict(X_test))
Out[30]:
0.12371134020618556

And if we want more than a single quantity summarizing score, sklearn comes with all sorts of helpful tools like confusion_matrix:

In [31]:
from itertools import permutations
from sklearn.metrics import confusion_matrix

# get the raw confusion matrix
cm = confusion_matrix(y_test, clf.predict(X_test))

# create a dataframe
cmdf = pd.DataFrame(cm)
cmdf.columns = map(lambda x: 'pred {}'.format(x), cmdf.columns)
cmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index)
cmdf
Out[31]:
pred 0 pred 1
actual 0 283 5
actual 1 80 6

there are many available ...

metrics.accuracy_score, metrics.auc, metrics.average_precision_score, metrics.classification_report, metrics.confusion_matrix, metrics.f1_score, metrics.fbeta_score, metrics.hamming_loss, metrics.hinge_loss, metrics.jaccard_similarity_score, metrics.log_loss, metrics.matthews_corrcoef, metrics.precision_recall_curve, metrics.precision_recall_fscore_support, metrics.precision_score, metrics.recall_score, metrics.roc_auc_score, metrics.roc_curve, metrics.zero_one_loss, metrics.explained_variance_score, metrics.mean_absolute_error, metrics.mean_squared_error, metrics.r2_score, metrics.adjusted_mutual_info_score, metrics.adjusted_rand_score, metrics.completeness_score, metrics.homogeneity_completeness_v_measure, metrics.homogeneity_score, metrics.mutual_info_score, metrics.normalized_mutual_info_score, metrics.silhouette_score, metrics.silhouette_samples, metrics.v_measure_score, metrics.consensus_score, metrics.pairwise.additive_chi2_kernel, metrics.pairwise.chi2_kernel, metrics.pairwise.distance_metrics, metrics.pairwise.euclidean_distances, metrics.pairwise.kernel_metrics, metrics.pairwise.linear_kernel, metrics.pairwise.manhattan_distances, metrics.pairwise.pairwise_distances, metrics.pairwise.pairwise_kernels, metrics.pairwise.polynomial_kernel, metrics.pairwise.rbf_kernel, metrics.pairwise_distances, metrics.pairwise_distances_argmin, metrics.pairwise_distances_argmin_min

... and they all use the same basic syntax.


Refining the model and preparing the data

Challenge: finding reasonable model parameters

The regularization constant, $C$, is unknown. Let's say we want to try some different orders of magnitude as well experimenting with L1 regularization (default for sklearn logistic regression is L2):

$$C \in \{0.01, 0.1, 1, 10, 100\}$$$$\textrm{penalty} \in \{\textrm{L1}, \textrm{L2}\}$$

We know that the outcome can be different depending on what settings we choose.

From sklearn, we get built in grid search for exploring the space. Using grid search cross validation, we can try all combinations, and automatically keep the most performant.

In [33]:
from sklearn.grid_search import GridSearchCV

params_to_try = {
    'C': [0.01, 0.1, 1, 10, 100, 100],
    'penalty': ['l1', 'l2']
}

gs = GridSearchCV(clf, param_grid=params_to_try, cv=5)
gs.fit(X, y)
Out[33]:
GridSearchCV(cv=5,
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100, 100]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)
In [34]:
print "Best parameters:", gs.best_params_
print "Best score:", gs.best_score_
Best parameters: {'penalty': 'l1', 'C': 0.01}
Best score: 0.791443850267

Challenge: data transformations often needed

One thing we might want to do is scale all of the data before fitting any models. As the sklearn docs point out:

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit: they might behave badly if the individual feature do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

Common solution: use plug-and-play data prep tools

It's painful and error prone to wing it every time we want to do common transformations. The sklearn toolbox comes with many useful tools for doing this in a consistent way.

Scaling

In [35]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_standardized = scaler.fit_transform(X_train.astype(np.float))
X_train_standardized
Out[35]:
array([[ 1.15376999,  0.35655147,  0.35655147,  2.38280942],
       [-0.56644764,  6.41792654,  6.41792654,  2.50245107],
       [ 0.22749896, -0.60050775, -0.60050775, -0.9671566 ],
       ..., 
       [-0.03714991, -0.12197814, -0.12197814, -0.64811222],
       [ 0.62447226, -0.44099788, -0.44099788, -0.01002345],
       [-0.83109651,  0.03753173,  0.03753173, -0.56835112]])
In [36]:
print 'column means:', np.round(X_train_standardized.mean(axis=0))
print 'column variances:', np.round(X_train_standardized.var(axis=0))
column means: [-0.  0.  0.  0.]
column variances: [ 1.  1.  1.  1.]

Now that the scaler is fit on the training data, we can use it to transform new data:

In [37]:
X_new = np.array([25., 35., 9200., 90.])

scaler.transform(X_new)
Out[37]:
array([ 2.08004102,  4.66331797,  4.95043573,  2.18340669])

Dimensionality reduction

We could also reduce the dimensionality with PCA and whiten the data:

In [38]:
from sklearn.decomposition import PCA

# instantiate the PCA transformation object
pca = PCA(n_components=2, whiten=True)

# fit the PCA object on and transform the training data
X_train_pca = pca.fit_transform(X_train_standardized)

# create a 3d figure
fig = plt.figure()
ax = fig.add_subplot(111)

# scatterplot the PCA points
ax.scatter(*np.hsplit(X_train_pca, 2), c=y_train, s=40,
           cmap='coolwarm')

# annotate and show the figure
ax.set_xlabel('component 1')
ax.set_ylabel('component 2')

plt.show()

Challenge: all the different steps add up into a big jumbled mess

We have all seen the data science version of this:

Cluttered desk photo

But there's no reason for **data_aug_13_v2_scaled_pca_3_dims_WHITENED_plus_some_tweaks_final_V4_FINAL.csv**, right?

We shouldn't have to do that.

Common solution: putting all the steps together with a Pipeline

One big takeaway: it's all about documented, repeatable data-to-model pipelines.

At some point, especially for tasks we will be doing more than once, we might want to codify all of the exploratory work we've done. The standard way to express a beginning-to-end data transformation and model fitting workflow in sklearn is the Pipeline.

In [39]:
from sklearn.pipeline import Pipeline

# define a pipeline with some transforms and a simple classifier
pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('reduce_dim', PCA()),
    ('clf', LogisticRegression()),
])
m
# enumerate all of the different settings we wish to try out
parameters = {
    'reduce_dim__n_components': (1, 2, 3),
    'reduce_dim__whiten': (True, False),
    'clf__penalty': ('l1', 'l2'),
    'clf__C': (1e-3, 1e-2, 1, 1e1, 1e2, 1e3),
}

# grid search the parameter space
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)

print("Performing grid search...\n")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
print(parameters)
grid_search.fit(X, y)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  50 jobs       | elapsed:    0.2s
[Parallel(n_jobs=1)]: Done 200 jobs       | elapsed:    0.7s
[Parallel(n_jobs=1)]: Done 216 out of 216 | elapsed:    0.8s finished
Performing grid search...

('pipeline:', ['scale', 'reduce_dim', 'clf'])
parameters:
{'clf__penalty': ('l1', 'l2'), 'clf__C': (0.001, 0.01, 1, 10.0, 100.0, 1000.0), 'reduce_dim__whiten': (True, False), 'reduce_dim__n_components': (1, 2, 3)}
Fitting 3 folds for each of 72 candidates, totalling 216 fits
Best score: 0.777
Best parameters set:
	clf__C: 0.01
	clf__penalty: 'l2'
	reduce_dim__n_components: 2
	reduce_dim__whiten: True

Lego brick

Takeaways for using sklearn in the real world

  • API consistency is the "killer feature" of sklearn
  • Treat the original data as immutable, everything later is a view into or a transformation of the original and all the steps are obvious
  • To that end, building transparent and repeatable dataflows using Pipelines cuts down on black magic

Public service announcement: using other tools

Why choose? Use whichever one is best for the task at hand.

In [40]:
# load the %R cell magic extension
%load_ext rmagic

# send the dataframe over to the R instance
%Rpush df
In [41]:
%%R
library(ggplot2)
qplot(log(time), log(cc), data=df, color=donated)
In [42]:
%%R
blood.glm <- glm(donated ~ log(cc) + log(time), data=df, family="binomial")
print(summary(blood.glm))

par(mfrow=c(2, 2))
plot(blood.glm)
Call:
glm(formula = donated ~ log(cc) + log(time), family = "binomial", 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5214  -0.7665  -0.5352  -0.2782   2.5447  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -7.9523     0.8749  -9.089  < 2e-16 ***
log(cc)       1.4448     0.1693   8.535  < 2e-16 ***
log(time)    -1.0278     0.1454  -7.069 1.56e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 820.89  on 747  degrees of freedom
Residual deviance: 728.63  on 745  degrees of freedom
AIC: 734.63

Number of Fisher Scoring iterations: 5

... and we can pull data back from R to Python:

In [43]:
r_coeffs = %R coef(blood.glm)
r_coeffs
Out[43]:
array([-7.95228363,  1.44481591, -1.02784654])

Even better for those in the Julia community ... you can replicate this whole experience when working in an IJulia notebook:

IJulia logo

In fact - IPython notebooks are not just for Python anymore. The core functionality has been generalized as Project Jupyter, and language kernels are available or under active development for Julia, Haskell, and R.

Jupyter logo


Questions?

Revolution against petal sepal data meme