# Modeling with scikit-learn: common workflows for reproducible results¶

## Overview¶

• Describe an example problem
• Build a couple basic models
• Common sklearn workflows:
• Evaluate performance
• Tools to improve the model (data transformations, parameter search, etc)
• Solidifying the experiments into a repeatable data pipeline
• Using outside tools

## Motivation¶

### The way ahead for R?¶

The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. [...] There are many different modeling functions in R. Some have different syntax for model training and/or prediction. The package started off as a way to provide a uniform interface the functions themselves, as well as a way to standardize common tasks (such parameter tuning and variable importance).

### scikit-learn's approach¶

Consistent API with a cohesive "grammar" of modeling actions:

Nouns

• Model classes
• Metrics classes
• Preprocessing classes
• Data (as numpy arrays)
• ...

Verbs

• fit
• transform
• fit_transform
• score
• predict
• ...

## Example problem description:¶

We'll be using data on blood donations from the UC Irvine Machine Learning Repository. Given some data on some blood donors' previous donations, we'll be trying to predict a binary outcome of whether the donor gave blood in a certain time period.

Here's the official description:

To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).

### Get the data¶

From beginning to end, we'll treat the raw data as immutable and base all of our analyses on views and transformations of this "ground truth" data set.

In [1]:
!wget -nc --directory-prefix data \
https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data

File ‘data/transfusion.data’ already there; not retrieving.


In [2]:
!head data/transfusion.data

Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),"whether he/she donated blood in March 2007"
2 ,50,12500,98 ,1
0 ,13,3250,28 ,1
1 ,16,4000,35 ,1
2 ,20,5000,45 ,1
1 ,24,6000,77 ,0
4 ,4,1000,4 ,0
2 ,7,1750,14 ,1
1 ,12,3000,35 ,0
2 ,9,2250,22 ,1


### Import the data as a pandas DataFrame¶

In [3]:
import numpy as np
import pandas as pd


Out[3]:
Recency (months) Frequency (times) Monetary (c.c. blood) Time (months) whether he/she donated blood in March 2007
0 2 50 12500 98 1
1 0 13 3250 28 1
2 1 16 4000 35 1
3 2 20 5000 45 1
4 1 24 6000 77 0
In [4]:
df.shape

Out[4]:
(748, 5)
In [5]:
df.dtypes

Out[5]:
Recency (months)                              int64
Frequency (times)                             int64
Monetary (c.c. blood)                         int64
Time (months)                                 int64
whether he/she donated blood in March 2007    int64
dtype: object
In [6]:
# save the current names in case we need them later
original_column_names = df.columns

# make the names less ugly
names = ['recency', 'frequency', 'cc', 'time', 'donated']
df.columns = names


Out[6]:
recency frequency cc time donated
0 2 50 12500 98 1
1 0 13 3250 28 1
2 1 16 4000 35 1
3 2 20 5000 45 1
4 1 24 6000 77 0

### Some exploratory visualization¶

In [7]:
# import our graphics tools
%matplotlib inline
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns  # nice defaults for matplotlib styles
set2 = sns.color_palette('Set2')

# add on some settings from 'Bayesian Methods for Hackers'
plt.style.use('bmh')

# set larger default fonts for presentation-friendliness
mpl.rc('figure', figsize=(10, 8))
mpl.rc('axes', labelsize=16, titlesize=20)

In [8]:
from pandas.tools.plotting import scatter_matrix

axeslist = scatter_matrix(df, alpha=0.8, figsize=(10, 10))
for ax in axeslist.flatten():
ax.grid(False)

In [9]:
import numpy as np

# create a figure with 4 subplots
fig, axs = plt.subplots(nrows=2, ncols=2)

feature_column_names = df.columns[:-1]
label_column_name = df.columns[-1]

for i, col in enumerate(feature_column_names):

# get the current subplot to work on
ax = axs.ravel()[i]

# create some random y jitter to add
jitter = np.random.uniform(low=-0.05, high=0.05, size=len(df))

# plot the data
ax.scatter(x=df[col], y=df[label_column_name] + jitter,
c=df.donated, cmap='coolwarm', alpha=0.5)

# label the axes
ax.set_xlabel(col)
ax.set_ylabel(label_column_name)

plt.tight_layout()
plt.show()


## Common tasks with scikit-learn:¶

### Split the data into train and test¶

In [10]:
from sklearn.cross_validation import train_test_split

# using conventional sklearn variable names
X = df[feature_column_names].astype(float)
y = df.donated.ravel()

# break up the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

In [11]:
X_train.shape

Out[11]:
(374, 4)
In [12]:
y_train.shape

Out[12]:
(374,)
In [13]:
X_test.shape

Out[13]:
(374, 4)
In [14]:
y_test.shape

Out[14]:
(374,)

### Fit a simple model using the training data¶

The basic modeling workflow in sklearn involves instantiating a model object with the desired settings, and then fitting it to given data.

#### Decision tree¶

In [15]:
from sklearn.tree import DecisionTreeClassifier

clf_tree = DecisionTreeClassifier(max_depth=3)
clf_tree.fit(X_train, y_train)

Out[15]:
DecisionTreeClassifier(compute_importances=None, criterion='gini',
max_depth=3, max_features=None, max_leaf_nodes=None,
min_density=None, min_samples_leaf=1, min_samples_split=2,
random_state=None, splitter='best')
In [16]:
print 'Score:', clf_tree.score(X_test, y_test)

Score: 0.794117647059

In [17]:
# viz adapted from http://scikit-learn.org/stable/modules/tree.html
import pydot
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz

dot_data = StringIO()
export_graphviz(clf_tree, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_png('output/decision_tree.png')

Out[17]:
True

#### Logistic regression¶

In [18]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(penalty='l2', fit_intercept=True)
clf.fit(X_train, y_train)

Out[18]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)
In [19]:
print 'Score:', clf.score(X_test, y_test)

Score: 0.772727272727


### Trying a bunch of different models¶

sklearn tries to use the same syntax for most modeling tasks, so it's pretty easy to plug and play.

Note: in this example we're not putting much thought into which model to use and we're sticking with the default model parameters. The point is not that we would *want* to do this. The point is that all of these very different models have such similar structure and conventions that we can iterate through a bunch of them, fit, and evaluate them while changing none of the syntax.

In [20]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC

other_clfs = {
'Random forest': RandomForestClassifier(),
'Naive Bayes': MultinomialNB(),
'Linear SVC': LinearSVC(),
'KNN': KNeighborsClassifier(5),
}

# iterating through all of these models we want to fit ...
for name, other_clf in other_clfs.iteritems():

# fit the model with the training data
print other_clf.fit(X_train, y_train)

# cross validation score
print '---\nScore:', other_clf.score(X_test, y_test)
print '\n'

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_neighbors=5, p=2, weights='uniform')
---
Score: 0.756684491979

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
random_state=None, tol=0.0001, verbose=0)
---
Score: 0.299465240642

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
---
Score: 0.729946524064

RandomForestClassifier(bootstrap=True, compute_importances=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
min_samples_split=2, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0)
---
Score: 0.756684491979

learning_rate=1.0, n_estimators=50, random_state=None)
---
Score: 0.775401069519



### Anatomy of a scikit-learn model¶

The object itself:

In [21]:
clf

Out[21]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

Coefficients (if fit):

In [22]:
clf.coef_

Out[22]:
array([[ -9.74764830e-02,   1.49194232e-06,   3.72985587e-04,
-1.17108498e-02]])

Intercept (if fit):

In [23]:
clf.intercept_

Out[23]:
array([-0.53074433])

Model parameters:

In [24]:
clf.get_params()

Out[24]:
{'C': 1.0,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'penalty': 'l2',
'random_state': None,
'tol': 0.0001}

Making predictions for $\hat y$:

In [25]:
clf.predict(X_test)

Out[25]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0])

Predicting probabilities, $p(y_i = 1)$

In [26]:
pd.DataFrame(clf.predict_proba(X_test))\

Out[26]:
0 1
0 0.691294 0.308706
1 0.952584 0.047416
2 0.726055 0.273945
3 0.770998 0.229002
4 0.680134 0.319866
5 0.705628 0.294372
6 0.637364 0.362636
7 0.888941 0.111059
8 0.559386 0.440614
9 0.936608 0.063392

### Challenge: evaluating model performance¶

The default metric for logistic regression is mean accuracy:

$$\mathrm{accuracy}(y, \hat y) = \frac{1}{n} \sum_{i=1}^n \mathbb{I}(\hat y_i = y_i)$$
In [27]:
clf.score(X_test, y_test)

Out[27]:
0.77272727272727271

But what if we want a more holistic view of how well this model works?

### Common solution: use cross validation tools and other metrics¶

#### Cross validation¶

Our data set here is fairly small $(N=748)$, so what if we happened to randomly pick a non-representative chunk for testing?

Instead of summarizing the performance just on the test data we want to use $k$-fold cross-validation, splitting the data randomly into chunks — "folds" — and evaluating on each to get an idea of the variation in performance.

In [28]:
from sklearn import cross_validation

# come up with random folds of the data
kf = cross_validation.KFold(len(X), n_folds=5, shuffle=True)

def plot_scores(scores):
N = len(scores)
plt.bar(np.arange(1, N + 1) - 0.4, scores, color=set2[2])
plt.title('{}-fold cross-validation scores'.format(N), fontsize=18)
plt.xlabel('fold', fontsize=14)
plt.ylabel('score', fontsize=14)
plt.xlim(0.5, N + 0.5)
plt.ylim(0, 1)
plt.show()

# evaluate the fitted model on each fold in turn, returns a score for each fold
scores = cross_validation.cross_val_score(clf, X, y, cv=kf, n_jobs=1)

print 'scores:', scores
print 'average score:', np.mean(scores)
plot_scores(scores)

scores: [ 0.78        0.76666667  0.81333333  0.75167785  0.73154362]
average score: 0.768644295302


#### Other metrics¶

We can try all sorts of other metrics, though, such as log loss:

$$\textrm{LogLoss}(y, \hat y) = - \frac{1}{n} \sum_{i=1}^n \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right]$$
In [29]:
from sklearn.metrics import log_loss

log_loss(y_test, clf.predict_proba(X_test))

Out[29]:
0.45710527540996404

Or the F1 score:

$$\textrm{F1} = 2 \frac{\textrm{precision} * \textrm{recall}}{\textrm{precision} + \textrm{recall}}$$
In [30]:
from sklearn.metrics import f1_score

f1_score(y_test, clf.predict(X_test))

Out[30]:
0.12371134020618556

And if we want more than a single quantity summarizing score, sklearn comes with all sorts of helpful tools like confusion_matrix:

In [31]:
from itertools import permutations
from sklearn.metrics import confusion_matrix

# get the raw confusion matrix
cm = confusion_matrix(y_test, clf.predict(X_test))

# create a dataframe
cmdf = pd.DataFrame(cm)
cmdf.columns = map(lambda x: 'pred {}'.format(x), cmdf.columns)
cmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index)
cmdf

Out[31]:
pred 0 pred 1
actual 0 283 5
actual 1 80 6

there are many available ...

metrics.accuracy_score, metrics.auc, metrics.average_precision_score, metrics.classification_report, metrics.confusion_matrix, metrics.f1_score, metrics.fbeta_score, metrics.hamming_loss, metrics.hinge_loss, metrics.jaccard_similarity_score, metrics.log_loss, metrics.matthews_corrcoef, metrics.precision_recall_curve, metrics.precision_recall_fscore_support, metrics.precision_score, metrics.recall_score, metrics.roc_auc_score, metrics.roc_curve, metrics.zero_one_loss, metrics.explained_variance_score, metrics.mean_absolute_error, metrics.mean_squared_error, metrics.r2_score, metrics.adjusted_mutual_info_score, metrics.adjusted_rand_score, metrics.completeness_score, metrics.homogeneity_completeness_v_measure, metrics.homogeneity_score, metrics.mutual_info_score, metrics.normalized_mutual_info_score, metrics.silhouette_score, metrics.silhouette_samples, metrics.v_measure_score, metrics.consensus_score, metrics.pairwise.additive_chi2_kernel, metrics.pairwise.chi2_kernel, metrics.pairwise.distance_metrics, metrics.pairwise.euclidean_distances, metrics.pairwise.kernel_metrics, metrics.pairwise.linear_kernel, metrics.pairwise.manhattan_distances, metrics.pairwise.pairwise_distances, metrics.pairwise.pairwise_kernels, metrics.pairwise.polynomial_kernel, metrics.pairwise.rbf_kernel, metrics.pairwise_distances, metrics.pairwise_distances_argmin, metrics.pairwise_distances_argmin_min

... and they all use the same basic syntax.

## Refining the model and preparing the data¶

### Challenge: finding reasonable model parameters¶

The regularization constant, $C$, is unknown. Let's say we want to try some different orders of magnitude as well experimenting with L1 regularization (default for sklearn logistic regression is L2):

$$C \in \{0.01, 0.1, 1, 10, 100\}$$$$\textrm{penalty} \in \{\textrm{L1}, \textrm{L2}\}$$

We know that the outcome can be different depending on what settings we choose.

From sklearn, we get built in grid search for exploring the space. Using grid search cross validation, we can try all combinations, and automatically keep the most performant.

In [33]:
from sklearn.grid_search import GridSearchCV

params_to_try = {
'C': [0.01, 0.1, 1, 10, 100, 100],
'penalty': ['l1', 'l2']
}

gs = GridSearchCV(clf, param_grid=params_to_try, cv=5)
gs.fit(X, y)

Out[33]:
GridSearchCV(cv=5,
estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001),
fit_params={}, iid=True, loss_func=None, n_jobs=1,
param_grid={'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100, 100]},
pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
verbose=0)
In [34]:
print "Best parameters:", gs.best_params_
print "Best score:", gs.best_score_

Best parameters: {'penalty': 'l1', 'C': 0.01}
Best score: 0.791443850267


### Challenge: data transformations often needed¶

One thing we might want to do is scale all of the data before fitting any models. As the sklearn docs point out:

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit: they might behave badly if the individual feature do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

### Common solution: use plug-and-play data prep tools¶

It's painful and error prone to wing it every time we want to do common transformations. The sklearn toolbox comes with many useful tools for doing this in a consistent way.

#### Scaling¶

In [35]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_standardized = scaler.fit_transform(X_train.astype(np.float))
X_train_standardized

Out[35]:
array([[ 1.15376999,  0.35655147,  0.35655147,  2.38280942],
[-0.56644764,  6.41792654,  6.41792654,  2.50245107],
[ 0.22749896, -0.60050775, -0.60050775, -0.9671566 ],
...,
[-0.03714991, -0.12197814, -0.12197814, -0.64811222],
[ 0.62447226, -0.44099788, -0.44099788, -0.01002345],
[-0.83109651,  0.03753173,  0.03753173, -0.56835112]])
In [36]:
print 'column means:', np.round(X_train_standardized.mean(axis=0))
print 'column variances:', np.round(X_train_standardized.var(axis=0))

column means: [-0.  0.  0.  0.]
column variances: [ 1.  1.  1.  1.]


Now that the scaler is fit on the training data, we can use it to transform new data:

In [37]:
X_new = np.array([25., 35., 9200., 90.])

scaler.transform(X_new)

Out[37]:
array([ 2.08004102,  4.66331797,  4.95043573,  2.18340669])

#### Dimensionality reduction¶

We could also reduce the dimensionality with PCA and whiten the data:

In [38]:
from sklearn.decomposition import PCA

# instantiate the PCA transformation object
pca = PCA(n_components=2, whiten=True)

# fit the PCA object on and transform the training data
X_train_pca = pca.fit_transform(X_train_standardized)

# create a 3d figure
fig = plt.figure()

# scatterplot the PCA points
ax.scatter(*np.hsplit(X_train_pca, 2), c=y_train, s=40,
cmap='coolwarm')

# annotate and show the figure
ax.set_xlabel('component 1')
ax.set_ylabel('component 2')

plt.show()


### Challenge: all the different steps add up into a big jumbled mess¶

We have all seen the data science version of this:

But there's no reason for **data_aug_13_v2_scaled_pca_3_dims_WHITENED_plus_some_tweaks_final_V4_FINAL.csv**, right?

We shouldn't have to do that.

### Common solution: putting all the steps together with a Pipeline¶

One big takeaway: it's all about documented, repeatable data-to-model pipelines.

At some point, especially for tasks we will be doing more than once, we might want to codify all of the exploratory work we've done. The standard way to express a beginning-to-end data transformation and model fitting workflow in sklearn is the Pipeline.

In [39]:
from sklearn.pipeline import Pipeline

# define a pipeline with some transforms and a simple classifier
pipeline = Pipeline([
('scale', StandardScaler()),
('reduce_dim', PCA()),
('clf', LogisticRegression()),
])
m
# enumerate all of the different settings we wish to try out
parameters = {
'reduce_dim__n_components': (1, 2, 3),
'reduce_dim__whiten': (True, False),
'clf__penalty': ('l1', 'l2'),
'clf__C': (1e-3, 1e-2, 1, 1e1, 1e2, 1e3),
}

# grid search the parameter space
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)

print("Performing grid search...\n")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
print(parameters)
grid_search.fit(X, y)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  50 jobs       | elapsed:    0.2s
[Parallel(n_jobs=1)]: Done 200 jobs       | elapsed:    0.7s
[Parallel(n_jobs=1)]: Done 216 out of 216 | elapsed:    0.8s finished

Performing grid search...

('pipeline:', ['scale', 'reduce_dim', 'clf'])
parameters:
{'clf__penalty': ('l1', 'l2'), 'clf__C': (0.001, 0.01, 1, 10.0, 100.0, 1000.0), 'reduce_dim__whiten': (True, False), 'reduce_dim__n_components': (1, 2, 3)}
Fitting 3 folds for each of 72 candidates, totalling 216 fits
Best score: 0.777
Best parameters set:
clf__C: 0.01
clf__penalty: 'l2'
reduce_dim__n_components: 2
reduce_dim__whiten: True


## Takeaways for using sklearn in the real world¶

• API consistency is the "killer feature" of sklearn
• Treat the original data as immutable, everything later is a view into or a transformation of the original and all the steps are obvious
• To that end, building transparent and repeatable dataflows using Pipelines cuts down on black magic

## Public service announcement: using other tools¶

Why choose? Use whichever one is best for the task at hand.

In [40]:
# load the %R cell magic extension

# send the dataframe over to the R instance
%Rpush df

In [41]:
%%R
library(ggplot2)
qplot(log(time), log(cc), data=df, color=donated)

In [42]:
%%R
blood.glm <- glm(donated ~ log(cc) + log(time), data=df, family="binomial")
print(summary(blood.glm))

par(mfrow=c(2, 2))
plot(blood.glm)

Call:
glm(formula = donated ~ log(cc) + log(time), family = "binomial",
data = df)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.5214  -0.7665  -0.5352  -0.2782   2.5447

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  -7.9523     0.8749  -9.089  < 2e-16 ***
log(cc)       1.4448     0.1693   8.535  < 2e-16 ***
log(time)    -1.0278     0.1454  -7.069 1.56e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 820.89  on 747  degrees of freedom
Residual deviance: 728.63  on 745  degrees of freedom
AIC: 734.63

Number of Fisher Scoring iterations: 5



... and we can pull data back from R to Python:

In [43]:
r_coeffs = %R coef(blood.glm)
r_coeffs

Out[43]:
array([-7.95228363,  1.44481591, -1.02784654])

Even better for those in the Julia community ... you can replicate this whole experience when working in an IJulia notebook:

In fact - IPython notebooks are not just for Python anymore. The core functionality has been generalized as Project Jupyter, and language kernels are available or under active development for Julia, Haskell, and R.

# Questions?¶

Any comments or suggestions? Let me know.