Handling Sparse Data with LogisticRegression

This notebook shows how LogisticRegression handles sparse input data, such as text features or high-dimensional datasets.

Setup

We generate a sparse synthetic dataset for testing.

[1]:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from scipy.sparse import csr_matrix
import numpy as np
from glmpynet import LogisticRegression

# Generate synthetic dataset and convert to sparse
X, y = make_classification(n_samples=200, n_features=100, n_classes=2, random_state=42)
X_sparse = csr_matrix(X)
X_train, X_test, y_train, y_test = train_test_split(X_sparse, y, test_size=0.2, random_state=42)

# Fit and predict with sparse data
model = LogisticRegression(penalty='l1')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Sparse Data Accuracy: {accuracy:.2f}")

# Check sparsity of coefficients
print(f"Number of non-zero coefficients: {np.sum(model.coef_ != 0)}")

Sparse Data Accuracy: 0.85
Number of non-zero coefficients: 100

/home/rolf/anaconda3/envs/glmpynet/lib/python3.11/site-packages/sklearn/linear_model/_sag.py:348: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(

Explanation

The dataset is converted to a sparse CSR matrix to simulate high-dimensional data.
LogisticRegression with penalty='l1' promotes sparsity in coefficients.
Accuracy is similar to dense data, but coefficient sparsity is key for glmnet comparison.
With glmnet, expect enhanced sparsity and potentially better performance on sparse data.