{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Handling Sparse Data with LogisticRegression\n",
    "\n",
    "This notebook shows how `LogisticRegression` handles sparse input data, such as text features or high-dimensional datasets.\n",
    "\n",
    "## Setup\n",
    "We generate a sparse synthetic dataset for testing."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-07-22T14:42:45.160492Z",
     "start_time": "2025-07-22T14:42:43.737676Z"
    }
   },
   "source": [
    "from sklearn.datasets import make_classification\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import accuracy_score\n",
    "from scipy.sparse import csr_matrix\n",
    "import numpy as np\n",
    "from glmpynet import LogisticRegression\n",
    "\n",
    "# Generate synthetic dataset and convert to sparse\n",
    "X, y = make_classification(n_samples=200, n_features=100, n_classes=2, random_state=42)\n",
    "X_sparse = csr_matrix(X)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X_sparse, y, test_size=0.2, random_state=42)\n",
    "\n",
    "# Fit and predict with sparse data\n",
    "model = LogisticRegression(penalty='l1')\n",
    "model.fit(X_train, y_train)\n",
    "y_pred = model.predict(X_test)\n",
    "accuracy = accuracy_score(y_test, y_pred)\n",
    "print(f\"Sparse Data Accuracy: {accuracy:.2f}\")\n",
    "\n",
    "# Check sparsity of coefficients\n",
    "print(f\"Number of non-zero coefficients: {np.sum(model.coef_ != 0)}\")"
   ],
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sparse Data Accuracy: 0.85\n",
      "Number of non-zero coefficients: 100\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/rolf/anaconda3/envs/glmpynet/lib/python3.11/site-packages/sklearn/linear_model/_sag.py:348: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "execution_count": 1
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explanation\n",
    "- The dataset is converted to a sparse CSR matrix to simulate high-dimensional data.\n",
    "- `LogisticRegression` with `penalty='l1'` promotes sparsity in coefficients.\n",
    "- Accuracy is similar to dense data, but coefficient sparsity is key for `glmnet` comparison.\n",
    "- With `glmnet`, expect enhanced sparsity and potentially better performance on sparse data."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}