{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Handling Sparse Data with LogisticRegression\n", "\n", "This notebook shows how `LogisticRegression` handles sparse input data, such as text features or high-dimensional datasets.\n", "\n", "## Setup\n", "We generate a sparse synthetic dataset for testing." ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-07-22T14:42:45.160492Z", "start_time": "2025-07-22T14:42:43.737676Z" } }, "source": [ "from sklearn.datasets import make_classification\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import accuracy_score\n", "from scipy.sparse import csr_matrix\n", "import numpy as np\n", "from glmpynet import LogisticRegression\n", "\n", "# Generate synthetic dataset and convert to sparse\n", "X, y = make_classification(n_samples=200, n_features=100, n_classes=2, random_state=42)\n", "X_sparse = csr_matrix(X)\n", "X_train, X_test, y_train, y_test = train_test_split(X_sparse, y, test_size=0.2, random_state=42)\n", "\n", "# Fit and predict with sparse data\n", "model = LogisticRegression(penalty='l1')\n", "model.fit(X_train, y_train)\n", "y_pred = model.predict(X_test)\n", "accuracy = accuracy_score(y_test, y_pred)\n", "print(f\"Sparse Data Accuracy: {accuracy:.2f}\")\n", "\n", "# Check sparsity of coefficients\n", "print(f\"Number of non-zero coefficients: {np.sum(model.coef_ != 0)}\")" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sparse Data Accuracy: 0.85\n", "Number of non-zero coefficients: 100\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/rolf/anaconda3/envs/glmpynet/lib/python3.11/site-packages/sklearn/linear_model/_sag.py:348: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n", " warnings.warn(\n" ] } ], "execution_count": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explanation\n", "- The dataset is converted to a sparse CSR matrix to simulate high-dimensional data.\n", "- `LogisticRegression` with `penalty='l1'` promotes sparsity in coefficients.\n", "- Accuracy is similar to dense data, but coefficient sparsity is key for `glmnet` comparison.\n", "- With `glmnet`, expect enhanced sparsity and potentially better performance on sparse data." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 2 }