glmpynet: Development Roadmap

Project Vision

glmpynet is a Python package delivering a high-performance LogisticRegression implementation using the glmnetpp C++ library, designed to mirror scikit-learn’s LogisticRegression API for user familiarity.

Key objectives:

  • Performance: Utilize glmnetpp’s optimized C++ solvers for fast logistic regression.

  • Reliability: Ensure robustness through comprehensive testing, error handling, and bug isolation.

  • Reproducibility: Use Bazel and Conda for consistent builds and environments.

  • User Familiarity: Provide a scikit-learn-like API that works with minimal configuration.

Development Strategy

The project will follow a phased, incremental development model that prioritizes a stable foundation and clear design before implementation.

  1. Foundation First: We will begin by rigorously analyzing, building, and testing the core glmnetpp C++ engine to ensure a reliable foundation.

  2. API-First Design: For each subsequent layer (the Python binding and the Scikit-learn API), we will first create documentation that defines the API contract. This ensures the implementation is guided by a clear and well-understood design.

  3. Incremental Implementation: Each phase is broken down into small, testable tasks that can be completed and verified independently.

Phased Roadmap

Phase 1: glmnetpp Foundation Verification

  • Goal: To systematically verify, debug, and confirm that the core C++ engine can be reliably built, tested, and benchmarked.

  • Key Activities:
    • Perform a static analysis of the Bazel build system to uncover all true dependencies and requirements.

    • Follow a detailed action plan to verify the environment, generate the .bazelrc, and achieve a successful build.

    • Execute the full C++ unit test suite to confirm correctness.

    • Execute the C++ benchmark suite to establish a performance baseline.

  • Deliverable: A stable, reproducible build of the glmnetpp library and a complete set of passing tests.

  • Detailed Plan: See Phase 1: glmnetpp Foundation Analysis and Phase 1: glmnetpp Foundation Action Plan.

Phase 2: Python-C++ Binding

  • Goal: To create a stable, low-level binding that exposes the necessary glmnetpp functions to Python.

  • Key Activities:
    • Design the binding’s API, focusing initially on the functions required for logistic regression.

    • Implement the binding using pybind11, with a focus on efficient data marshaling between NumPy and Eigen.

    • Write unit tests to verify that the bound C++ functions can be called correctly from Python.

  • Deliverable: A compiled Python extension module that wraps the core glmnetpp solver.

  • Detailed Plan: See Phase 2: Python-C++ Binding.

Phase 3: Scikit-learn API Implementation

  • Goal: To create a high-level, user-friendly LogisticNet class that is a drop-in replacement for scikit-learn’s LogisticRegression.

  • Key Activities:
    • Implement the fit, predict, and predict_proba methods, using the C++ binding for all computations.

    • Ensure full compatibility with the Scikit-learn ecosystem by passing the check_estimator tests.

    • Implement robust error handling and input validation.

  • Deliverable: A fully tested and documented LogisticNet Python class.

  • Detailed Plan: See Phase 3: Scikit-learn Compatible API.

Phase 4: CI/CD and Public Release

  • Goal: To automate the build and test process and deliver a stable, installable package to users.

  • Key Activities:
    • Configure a CircleCI pipeline to replicate the Conda/Bazel environment and run all C++ and Python tests on every commit.

    • Package the project for distribution on PyPI.

    • Publish the first official version.

  • Deliverable: A publicly available package on PyPI and a green CI pipeline.

Known Risks & Open Questions

This project involves integrating with a complex C++ library whose build system has already proven to be incompletely documented. This introduces several known risks and open questions that will be addressed during Phase 1.

  • Build System Instability: The initial build failures demonstrate that the Bazel configuration is fragile and may contain other undiscovered issues.

  • Test Suite Integrity: A foundational assumption is that the glmnetpp test suite is complete and correct. We may discover that the tests themselves are broken or that the most critical logic is not covered.

  • Binding Complexity: The exact glmnetpp functions to bind for logistic regression are not yet identified. This will require code-level analysis of the C++ library.

  • API Alignment: A key challenge in Phase 3 will be aligning glmnetpp’s internal parameters (e.g., for regularization) with scikit-learn’s user-facing parameters (e.g., C).

Future Enhancements

Once the core functionality is delivered, future work will focus on expanding the library’s capabilities.

  • Full API Support: Extend the binding to support all scikit-learn LogisticRegression parameters (e.g., C, penalty).

  • Multi-Class Support: Implement support for multi-class classification.

  • Additional Models: Add support for other models available in glmnet, such as linear or Poisson regression.