glmpynet: Development Roadmap

Project Vision

glmpynet is a Python package delivering a high-performance LogisticRegression implementation using the glmnetpp C++ library, designed to mirror scikit-learn’s LogisticRegression API for user familiarity.

Key objectives:

Performance: Utilize glmnetpp’s optimized C++ solvers for fast logistic regression.
Reliability: Ensure robustness through comprehensive testing, error handling, and bug isolation.
Reproducibility: Use Bazel and Conda for consistent builds and environments.
User Familiarity: Provide a scikit-learn-like API that works with minimal configuration.

Development Strategy

The project will follow a phased, incremental development model that prioritizes a stable foundation and clear design before implementation.

Foundation First: We will begin by rigorously analyzing, building, and testing the core glmnetpp C++ engine to ensure a reliable foundation.
API-First Design: For each subsequent layer (the Python binding and the Scikit-learn API), we will first create documentation that defines the API contract. This ensures the implementation is guided by a clear and well-understood design.
Incremental Implementation: Each phase is broken down into small, testable tasks that can be completed and verified independently.

Phased Roadmap

Phase 1: glmnetpp Foundation Verification

Goal: To systematically verify, debug, and confirm that the core C++ engine can be reliably built, tested, and benchmarked.
Key Activities:
- Perform a static analysis of the Bazel build system to uncover all true dependencies and requirements.
- Follow a detailed action plan to verify the environment, generate the .bazelrc, and achieve a successful build.
- Execute the full C++ unit test suite to confirm correctness.
- Execute the C++ benchmark suite to establish a performance baseline.
Deliverable: A stable, reproducible build of the glmnetpp library and a complete set of passing tests.
Detailed Plan: See Phase 1: glmnetpp Foundation Analysis and Phase 1: glmnetpp Foundation Action Plan.

Phase 2: Python-C++ Binding

Goal: To create a stable, low-level binding that exposes the necessary glmnetpp functions to Python.
Key Activities:
- Design the binding’s API, focusing initially on the functions required for logistic regression.
- Implement the binding using pybind11, with a focus on efficient data marshaling between NumPy and Eigen.
- Write unit tests to verify that the bound C++ functions can be called correctly from Python.
Deliverable: A compiled Python extension module that wraps the core glmnetpp solver.
Detailed Plan: See Phase 2: Python-C++ Binding.

Phase 3: Scikit-learn API Implementation

Goal: To create a high-level, user-friendly LogisticNet class that is a drop-in replacement for scikit-learn’s LogisticRegression.
Key Activities:
- Implement the fit, predict, and predict_proba methods, using the C++ binding for all computations.
- Ensure full compatibility with the Scikit-learn ecosystem by passing the check_estimator tests.
- Implement robust error handling and input validation.
Deliverable: A fully tested and documented LogisticNet Python class.
Detailed Plan: See Phase 3: Scikit-learn Compatible API.

Phase 4: CI/CD and Public Release

Goal: To automate the build and test process and deliver a stable, installable package to users.

Key Activities:

Configure a CircleCI pipeline to replicate the Conda/Bazel environment and run all C++ and Python tests on every commit.

Package the project for distribution on PyPI.

Publish the first official version.

Deliverable: A publicly available package on PyPI and a green CI pipeline.

Known Risks & Open Questions

This project involves integrating with a complex C++ library whose build system has already proven to be incompletely documented. This introduces several known risks and open questions that will be addressed during Phase 1.

Build System Instability: The initial build failures demonstrate that the Bazel configuration is fragile and may contain other undiscovered issues.
Test Suite Integrity: A foundational assumption is that the glmnetpp test suite is complete and correct. We may discover that the tests themselves are broken or that the most critical logic is not covered.
Binding Complexity: The exact glmnetpp functions to bind for logistic regression are not yet identified. This will require code-level analysis of the C++ library.
API Alignment: A key challenge in Phase 3 will be aligning glmnetpp’s internal parameters (e.g., for regularization) with scikit-learn’s user-facing parameters (e.g., C).

Future Enhancements

Once the core functionality is delivered, future work will focus on expanding the library’s capabilities.

Full API Support: Extend the binding to support all scikit-learn LogisticRegression parameters (e.g., C, penalty).
Multi-Class Support: Implement support for multi-class classification.
Additional Models: Add support for other models available in glmnet, such as linear or Poisson regression.