Handling Sparse Data with LogisticRegression
This notebook shows how LogisticRegression handles sparse input data, such as text features or high-dimensional datasets.
Setup
We generate a sparse synthetic dataset for testing.
[1]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from scipy.sparse import csr_matrix
import numpy as np
from glmpynet import LogisticRegression
# Generate synthetic dataset and convert to sparse
X, y = make_classification(n_samples=200, n_features=100, n_classes=2, random_state=42)
X_sparse = csr_matrix(X)
X_train, X_test, y_train, y_test = train_test_split(X_sparse, y, test_size=0.2, random_state=42)
# Fit and predict with sparse data
model = LogisticRegression(penalty='l1')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Sparse Data Accuracy: {accuracy:.2f}")
# Check sparsity of coefficients
print(f"Number of non-zero coefficients: {np.sum(model.coef_ != 0)}")
Sparse Data Accuracy: 0.85
Number of non-zero coefficients: 100
/home/rolf/anaconda3/envs/glmpynet/lib/python3.11/site-packages/sklearn/linear_model/_sag.py:348: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
warnings.warn(
Explanation
The dataset is converted to a sparse CSR matrix to simulate high-dimensional data.
LogisticRegressionwithpenalty='l1'promotes sparsity in coefficients.Accuracy is similar to dense data, but coefficient sparsity is key for
glmnetcomparison.With
glmnet, expect enhanced sparsity and potentially better performance on sparse data.