glmpynet: Development Roadmap
Project Vision
glmpynet is a Python package delivering a high-performance LogisticRegression implementation using the glmnetpp C++ library, designed to mirror scikit-learn’s LogisticRegression API for user familiarity.
Key objectives:
Performance: Utilize
glmnetpp’s optimized C++ solvers for fast logistic regression.Reliability: Ensure robustness through comprehensive testing, error handling, and bug isolation.
Reproducibility: Use Bazel and Conda for consistent builds and environments.
User Familiarity: Provide a
scikit-learn-like API that works with minimal configuration.
Development Strategy
The project will follow a phased, incremental development model that prioritizes a stable foundation and clear design before implementation.
Foundation First: We will begin by rigorously analyzing, building, and testing the core
glmnetppC++ engine to ensure a reliable foundation.API-First Design: For each subsequent layer (the Python binding and the Scikit-learn API), we will first create documentation that defines the API contract. This ensures the implementation is guided by a clear and well-understood design.
Incremental Implementation: Each phase is broken down into small, testable tasks that can be completed and verified independently.
Phased Roadmap
Phase 1: glmnetpp Foundation Verification
Goal: To systematically verify, debug, and confirm that the core C++ engine can be reliably built, tested, and benchmarked.
- Key Activities:
Perform a static analysis of the Bazel build system to uncover all true dependencies and requirements.
Follow a detailed action plan to verify the environment, generate the
.bazelrc, and achieve a successful build.Execute the full C++ unit test suite to confirm correctness.
Execute the C++ benchmark suite to establish a performance baseline.
Deliverable: A stable, reproducible build of the
glmnetpplibrary and a complete set of passing tests.Detailed Plan: See Phase 1: glmnetpp Foundation Analysis and Phase 1: glmnetpp Foundation Action Plan.
Phase 2: Python-C++ Binding
Goal: To create a stable, low-level binding that exposes the necessary
glmnetppfunctions to Python.- Key Activities:
Design the binding’s API, focusing initially on the functions required for logistic regression.
Implement the binding using
pybind11, with a focus on efficient data marshaling between NumPy and Eigen.Write unit tests to verify that the bound C++ functions can be called correctly from Python.
Deliverable: A compiled Python extension module that wraps the core
glmnetppsolver.Detailed Plan: See Phase 2: Python-C++ Binding.
Phase 3: Scikit-learn API Implementation
Goal: To create a high-level, user-friendly
LogisticNetclass that is a drop-in replacement forscikit-learn’sLogisticRegression.- Key Activities:
Implement the
fit,predict, andpredict_probamethods, using the C++ binding for all computations.Ensure full compatibility with the Scikit-learn ecosystem by passing the
check_estimatortests.Implement robust error handling and input validation.
Deliverable: A fully tested and documented
LogisticNetPython class.Detailed Plan: See Phase 3: Scikit-learn Compatible API.
Phase 4: CI/CD and Public Release
Goal: To automate the build and test process and deliver a stable, installable package to users.
- Key Activities:
Configure a CircleCI pipeline to replicate the Conda/Bazel environment and run all C++ and Python tests on every commit.
Package the project for distribution on PyPI.
Publish the first official version.
Deliverable: A publicly available package on PyPI and a green CI pipeline.
Known Risks & Open Questions
This project involves integrating with a complex C++ library whose build system has already proven to be incompletely documented. This introduces several known risks and open questions that will be addressed during Phase 1.
Build System Instability: The initial build failures demonstrate that the Bazel configuration is fragile and may contain other undiscovered issues.
Test Suite Integrity: A foundational assumption is that the
glmnetpptest suite is complete and correct. We may discover that the tests themselves are broken or that the most critical logic is not covered.Binding Complexity: The exact
glmnetppfunctions to bind for logistic regression are not yet identified. This will require code-level analysis of the C++ library.API Alignment: A key challenge in Phase 3 will be aligning
glmnetpp’s internal parameters (e.g., for regularization) withscikit-learn’s user-facing parameters (e.g.,C).
Future Enhancements
Once the core functionality is delivered, future work will focus on expanding the library’s capabilities.
Full API Support: Extend the binding to support all
scikit-learnLogisticRegressionparameters (e.g.,C,penalty).Multi-Class Support: Implement support for multi-class classification.
Additional Models: Add support for other models available in
glmnet, such as linear or Poisson regression.