Using LogisticRegression in a Pipeline

This notebook shows how to use LogisticRegression in a scikit-learn Pipeline with preprocessing, such as StandardScaler, to handle binary classification.

Setup

We use a synthetic dataset for reproducibility.

[1]:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from glmpynet import LogisticRegression

# Generate synthetic dataset
X, y = make_classification(n_samples=200, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logistic_net', LogisticRegression())
])
pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Pipeline Accuracy: {accuracy:.2f}")

Pipeline Accuracy: 0.88

Explanation

The pipeline combines StandardScaler for feature scaling and LogisticRegression for classification.
The dataset is the same as in the basic example, ensuring consistency.
Accuracy is similar to the basic example but may improve slightly due to scaling.
With glmnet, expect comparable integration but potentially better performance on high-dimensional data.