Phase 3: Scikit-learn Compatible API

Objective: To create a high-level, user-friendly Python class that matches the Scikit-learn API and uses the C++ binding for its computations.

Model Name: LogisticRegression

The primary user-facing class will be named LogisticRegression. This decision is a conscious choice to signal to users that the class is intended as a direct, high-performance, drop-in replacement for sklearn.linear_model.LogisticRegression.

While this creates the potential for a name conflict if both are imported into the same namespace, this is a standard and well-understood aspect of the Python import system. Users who need to compare both can use a standard aliasing convention:

from glmpynet import LogisticRegression as GlmnetLogisticRegression
from sklearn.linear_model import LogisticRegression as SklearnLogisticRegression

The benefit of immediate user familiarity and seamless integration into existing Scikit-learn workflows far outweighs this manageable risk.

API Design Philosophy: A Hybrid Approach

A key design challenge for this project is balancing the user familiarity of the Scikit-learn API with the unique, high-performance capabilities of the underlying glmnet engine. This project will adopt a hybrid API to provide the best of both worlds: simplicity by default, with power on demand.

Rationale:

This approach is superior because it serves two distinct user groups without compromising the experience for either:

  1. The Scikit-learn User: For the majority of users, the goal is seamless integration. They want to use our LogisticRegression class in their existing Pipeline and GridSearchCV workflows. By accepting standard parameters like C and penalty, we provide a frictionless, “drop-in” experience.

  2. The `glmnet` Power User: A user familiar with the R glmnet package knows that its real power lies in computing the entire regularization path efficiently. Our API provides an “escape hatch” for these users, allowing them to bypass the Scikit-learn conventions and pass glmnet-native parameters directly to the C++ engine for maximum performance and control.

Implementation:

The internal fit method will be responsible for translating the user-provided parameters into the format required by the C++ binding, prioritizing the glmnet-native parameters if they are provided.

API Contract

The class will implement the standard Scikit-learn estimator interface.

  • ``__init__(self, …)``: The constructor will accept both Scikit-learn and glmnet-style parameters.

    def __init__(self,
                 # --- Scikit-learn Style Parameters (The Default) ---
                 penalty='l2',
                 C=1.0,
    
                 # --- Glmnet-Style Parameters (The "Escape Hatch") ---
                 alpha=None,
                 lambda_path=None,
                 nlambda=100,
    
                 # --- Other Glmnet Features ---
                 standardize=True,
                 # ... other glmnet parameters
                 ):
        # ...
    
  • Core Methods: The class will implement all standard methods:
    • fit(self, X, y)

    • predict(self, X)

    • predict_proba(self, X)

    • get_params() / set_params()

Future Expansion

The glmnet library is capable of more than just logistic regression (e.g., linear, Poisson, Cox regression). The API will be designed with this in mind.

A potential future architecture would involve:

  • A base class, GlmNetEstimator, that handles the common logic of

    parameter translation and interaction with the C++ binding.

  • Specific child classes for different models, such as LogisticRegression,

    ElasticNet, etc., that inherit from the base class.

This ensures that as we expand the library’s functionality, we can do so in a clean, modular, and maintainable way.