.. _development_roadmap: glmpynet: Development Roadmap ============================= Project Vision -------------- `glmpynet` is a Python package delivering a high-performance ``LogisticRegression`` implementation using the ``glmnetpp`` C++ library, designed to mirror ``scikit-learn``’s ``LogisticRegression`` API for user familiarity. Key objectives: * **Performance**: Utilize ``glmnetpp``’s optimized C++ solvers for fast logistic regression. * **Reliability**: Ensure robustness through comprehensive testing, error handling, and bug isolation. * **Reproducibility**: Use Bazel and Conda for consistent builds and environments. * **User Familiarity**: Provide a ``scikit-learn``-like API that works with minimal configuration. Development Strategy -------------------- The project will follow a phased, incremental development model that prioritizes a stable foundation and clear design before implementation. 1. **Foundation First:** We will begin by rigorously analyzing, building, and testing the core ``glmnetpp`` C++ engine to ensure a reliable foundation. 2. **API-First Design:** For each subsequent layer (the Python binding and the Scikit-learn API), we will first create documentation that defines the API contract. This ensures the implementation is guided by a clear and well-understood design. 3. **Incremental Implementation:** Each phase is broken down into small, testable tasks that can be completed and verified independently. Phased Roadmap -------------- Phase 1: `glmnetpp` Foundation Verification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * **Goal:** To systematically verify, debug, and confirm that the core C++ engine can be reliably built, tested, and benchmarked. * **Key Activities:** * Perform a static analysis of the Bazel build system to uncover all true dependencies and requirements. * Follow a detailed action plan to verify the environment, generate the ``.bazelrc``, and achieve a successful build. * Execute the full C++ unit test suite to confirm correctness. * Execute the C++ benchmark suite to establish a performance baseline. * **Deliverable:** A stable, reproducible build of the ``glmnetpp`` library and a complete set of passing tests. * **Detailed Plan:** See :doc:`phase_1_analysis` and :doc:`phase_1_action_plan`. Phase 2: Python-C++ Binding ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * **Goal:** To create a stable, low-level binding that exposes the necessary ``glmnetpp`` functions to Python. * **Key Activities:** * Design the binding's API, focusing initially on the functions required for logistic regression. * Implement the binding using ``pybind11``, with a focus on efficient data marshaling between NumPy and Eigen. * Write unit tests to verify that the bound C++ functions can be called correctly from Python. * **Deliverable:** A compiled Python extension module that wraps the core ``glmnetpp`` solver. * **Detailed Plan:** See :doc:`phase_2_binding`. Phase 3: Scikit-learn API Implementation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * **Goal:** To create a high-level, user-friendly ``LogisticNet`` class that is a drop-in replacement for ``scikit-learn``’s ``LogisticRegression``. * **Key Activities:** * Implement the ``fit``, ``predict``, and ``predict_proba`` methods, using the C++ binding for all computations. * Ensure full compatibility with the Scikit-learn ecosystem by passing the ``check_estimator`` tests. * Implement robust error handling and input validation. * **Deliverable:** A fully tested and documented ``LogisticNet`` Python class. * **Detailed Plan:** See :doc:`phase_3_sklearn_api`. Phase 4: CI/CD and Public Release ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * **Goal:** To automate the build and test process and deliver a stable, installable package to users. * **Key Activities:** * Configure a CircleCI pipeline to replicate the Conda/Bazel environment and run all C++ and Python tests on every commit. * Package the project for distribution on PyPI. * Publish the first official version. * **Deliverable:** A publicly available package on PyPI and a green CI pipeline. Known Risks & Open Questions ---------------------------- This project involves integrating with a complex C++ library whose build system has already proven to be incompletely documented. This introduces several known risks and open questions that will be addressed during Phase 1. * **Build System Instability:** The initial build failures demonstrate that the Bazel configuration is fragile and may contain other undiscovered issues. * **Test Suite Integrity:** A foundational assumption is that the ``glmnetpp`` test suite is complete and correct. We may discover that the tests themselves are broken or that the most critical logic is not covered. * **Binding Complexity:** The exact ``glmnetpp`` functions to bind for logistic regression are not yet identified. This will require code-level analysis of the C++ library. * **API Alignment:** A key challenge in Phase 3 will be aligning ``glmnetpp``’s internal parameters (e.g., for regularization) with ``scikit-learn``’s user-facing parameters (e.g., ``C``). Future Enhancements ------------------- Once the core functionality is delivered, future work will focus on expanding the library's capabilities. * **Full API Support:** Extend the binding to support all ``scikit-learn`` ``LogisticRegression`` parameters (e.g., ``C``, ``penalty``). * **Multi-Class Support:** Implement support for multi-class classification. * **Additional Models:** Add support for other models available in ``glmnet``, such as linear or Poisson regression.