Using LogisticRegression in a Pipeline

This notebook shows how to use LogisticRegression in a scikit-learn Pipeline with preprocessing, such as StandardScaler, to handle binary classification.

Setup

We use a synthetic dataset for reproducibility.

[1]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from glmpynet import LogisticRegression

# Generate synthetic dataset
X, y = make_classification(n_samples=200, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logistic_net', LogisticRegression())
])
pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Pipeline Accuracy: {accuracy:.2f}")
Pipeline Accuracy: 0.88

Explanation

  • The pipeline combines StandardScaler for feature scaling and LogisticRegression for classification.

  • The dataset is the same as in the basic example, ensuring consistency.

  • Accuracy is similar to the basic example but may improve slightly due to scaling.

  • With glmnet, expect comparable integration but potentially better performance on high-dimensional data.