diff --git a/doc/source/development/index.rst b/doc/source/development/index.rst index 69f04494a271c..aa7e7845bfa7a 100644 --- a/doc/source/development/index.rst +++ b/doc/source/development/index.rst @@ -23,5 +23,4 @@ Development extending developer policies - roadmap community diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst deleted file mode 100644 index 2d142453fb735..0000000000000 --- a/doc/source/development/roadmap.rst +++ /dev/null @@ -1,250 +0,0 @@ -.. _roadmap: - -======= -Roadmap -======= - -This page provides an overview of the major themes in pandas' development. Each of -these items requires a relatively large amount of effort to implement. These may -be achieved more quickly with dedicated funding or interest from contributors. - -An item being on the roadmap does not mean that it will *necessarily* happen, even -with unlimited funding. During the implementation period we may discover issues -preventing the adoption of the feature. - -Additionally, an item *not* being on the roadmap does not exclude it from inclusion -in pandas. The roadmap is intended for larger, fundamental changes to the project that -are likely to take months or years of developer time. Smaller-scoped items will continue -to be tracked on our `issue tracker `__. - -See :ref:`roadmap.evolution` for proposing changes to this document. - -Extensibility -------------- - -pandas :ref:`extending.extension-types` allow for extending NumPy types with custom -data types and array storage. pandas uses extension types internally, and provides -an interface for 3rd-party libraries to define their own custom data types. - -Many parts of pandas still unintentionally convert data to a NumPy array. -These problems are especially pronounced for nested data. - -We'd like to improve the handling of extension arrays throughout the library, -making their behavior more consistent with the handling of NumPy arrays. We'll do this -by cleaning up pandas' internals and adding new methods to the extension array interface. - -String data type ----------------- - -Currently, pandas stores text data in an ``object`` -dtype NumPy array. -The current implementation has two primary drawbacks: First, ``object`` -dtype -is not specific to strings: any Python object can be stored in an ``object`` -dtype -array, not just strings. Second: this is not efficient. The NumPy memory model -isn't especially well-suited to variable width text data. - -To solve the first issue, we propose a new extension type for string data. This -will initially be opt-in, with users explicitly requesting ``dtype="string"``. -The array backing this string dtype may initially be the current implementation: -an ``object`` -dtype NumPy array of Python strings. - -To solve the second issue (performance), we'll explore alternative in-memory -array libraries (for example, Apache Arrow). As part of the work, we may -need to implement certain operations expected by pandas users (for example -the algorithm used in, ``Series.str.upper``). That work may be done outside of -pandas. - -Consistent missing value handling ---------------------------------- - -Currently, pandas handles missing data differently for different data types. We -use different types to indicate that a value is missing (``np.nan`` for -floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically -strings or booleans -- with missing values, and ``pd.NaT`` for datetimelike -data). Integer data cannot store missing data or are cast to float. In addition, -pandas 1.0 introduced a new missing value sentinel, ``pd.NA``, which is being -used for the experimental nullable integer, boolean, and string data types. - -These different missing values have different behaviors in user-facing -operations. Specifically, we introduced different semantics for the nullable -data types for certain operations (e.g. propagating in comparison operations -instead of comparing as False). - -Long term, we want to introduce consistent missing data handling for all data -types. This includes consistent behavior in all operations (indexing, arithmetic -operations, comparisons, etc.). There has been discussion of eventually making -the new semantics the default. - -This has been discussed at :issue:`28095` (and -linked issues), and described in more detail in this -`design doc `__. - -Apache Arrow interoperability ------------------------------ - -`Apache Arrow `__ is a cross-language development -platform for in-memory data. The Arrow logical types are closely aligned with -typical pandas use cases. - -We'd like to provide better-integrated support for Arrow memory and data types -within pandas. This will let us take advantage of its I/O capabilities and -provide for better interoperability with other languages and libraries -using Arrow. - -Block manager rewrite ---------------------- - -We'd like to replace pandas current internal data structures (a collection of -1 or 2-D arrays) with a simpler collection of 1-D arrays. - -pandas internal data model is quite complex. A DataFrame is made up of -one or more 2-dimensional "blocks", with one or more blocks per dtype. This -collection of 2-D arrays is managed by the BlockManager. - -The primary benefit of the BlockManager is improved performance on certain -operations (construction from a 2D array, binary operations, reductions across the columns), -especially for wide DataFrames. However, the BlockManager substantially increases the -complexity and maintenance burden of pandas. - -By replacing the BlockManager we hope to achieve - -* Substantially simpler code -* Easier extensibility with new logical types -* Better user control over memory use and layout -* Improved micro-performance -* Option to provide a C / Cython API to pandas' internals - -See `these design documents `__ -for more. - -Decoupling of indexing and internals ------------------------------------- - -The code for getting and setting values in pandas' data structures needs refactoring. -In particular, we must clearly separate code that converts keys (e.g., the argument -to ``DataFrame.loc``) to positions from code that uses these positions to get -or set values. This is related to the proposed BlockManager rewrite. Currently, the -BlockManager sometimes uses label-based, rather than position-based, indexing. -We propose that it should only work with positional indexing, and the translation of keys -to positions should be entirely done at a higher level. - -Indexing is a complicated API with many subtleties. This refactor will require care -and attention. The following principles should inspire refactoring of indexing code and -should result on cleaner, simpler, and more performant code. - -1. **Label indexing must never involve looking in an axis twice for the same label(s).** -This implies that any validation step must either: - - * limit validation to general features (e.g. dtype/structure of the key/index), or - * reuse the result for the actual indexing. - -2. **Indexers must never rely on an explicit call to other indexers.** -For instance, it is OK to have some internal method of ``.loc`` call some -internal method of ``__getitem__`` (or of their common base class), -but never in the code flow of ``.loc`` should ``the_obj[something]`` appear. - -3. **Execution of positional indexing must never involve labels** (as currently, sadly, happens). -That is, the code flow of a getter call (or a setter call in which the right hand side is non-indexed) -to ``.iloc`` should never involve the axes of the object in any way. - -4. **Indexing must never involve accessing/modifying values** (i.e., act on ``._data`` or ``.values``) **more than once.** -The following steps must hence be clearly decoupled: - - * find positions we need to access/modify on each axis - * (if we are accessing) derive the type of object we need to return (dimensionality) - * actually access/modify the values - * (if we are accessing) construct the return object - -5. As a corollary to the decoupling between 4.i and 4.iii, **any code which deals on how data is stored** -(including any combination of handling multiple dtypes, and sparse storage, categoricals, third-party types) -**must be independent from code that deals with identifying affected rows/columns**, -and take place only once step 4.i is completed. - - * In particular, such code should most probably not live in ``pandas/core/indexing.py`` - * ... and must not depend in any way on the type(s) of axes (e.g. no ``MultiIndex`` special cases) - -6. As a corollary to point 1.i, **``Index`` (sub)classes must provide separate methods for any desired validity check of label(s) which does not involve actual lookup**, -on the one side, and for any required conversion/adaptation/lookup of label(s), on the other. - -7. **Use of trial and error should be limited**, and anyway restricted to catch only exceptions -which are actually expected (typically ``KeyError``). - - * In particular, code should never (intentionally) raise new exceptions in the ``except`` portion of a ``try... exception`` - -8. **Any code portion which is not specific to setters and getters must be shared**, -and when small differences in behavior are expected (e.g. getting with ``.loc`` raises for -missing labels, setting still doesn't), they can be managed with a specific parameter. - -Numba-accelerated operations ----------------------------- - -`Numba `__ is a JIT compiler for Python code. We'd like to provide -ways for users to apply their own Numba-jitted functions where pandas accepts user-defined functions -(for example, :meth:`Series.apply`, :meth:`DataFrame.apply`, :meth:`DataFrame.map`, -and in groupby and window contexts). This will improve the performance of -user-defined-functions in these operations by staying within compiled code. - -Performance monitoring ----------------------- - -pandas uses `airspeed velocity `__ to -monitor for performance regressions. ASV itself is a fabulous tool, but requires -some additional work to be integrated into an open source project's workflow. - -The `asv-runner `__ organization, currently made up -of pandas maintainers, provides tools built on top of ASV. We have a physical -machine for running a number of project's benchmarks, and tools managing the -benchmark runs and reporting on results. - -We'd like to fund improvements and maintenance of these tools to - -* Be more stable. Currently, they're maintained on the nights and weekends when - a maintainer has free time. -* Tune the system for benchmarks to improve stability, following - https://pyperf.readthedocs.io/en/latest/system.html -* Build a GitHub bot to request ASV runs *before* a PR is merged. Currently, the - benchmarks are only run nightly. - -.. _roadmap.evolution: - -Roadmap evolution ------------------ - -pandas continues to evolve. The direction is primarily determined by community -interest. Everyone is welcome to review existing items on the roadmap and -to propose a new item. - -Each item on the roadmap should be a short summary of a larger design proposal. -The proposal should include - -1. Short summary of the changes, which would be appropriate for inclusion in - the roadmap if accepted. -2. Motivation for the changes. -3. An explanation of why the change is in scope for pandas. -4. Detailed design: Preferably with example-usage (even if not implemented yet) - and API documentation -5. API Change: Any API changes that may result from the proposal. - -That proposal may then be submitted as a GitHub issue, where the pandas maintainers -can review and comment on the design. The `pandas mailing list `__ -should be notified of the proposal. - -When there's agreement that an implementation -would be welcome, the roadmap should be updated to include the summary and a -link to the discussion issue. - -Completed items ---------------- - -This section records now completed items from the pandas roadmap. - -Documentation improvements -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -We improved the pandas documentation - -* The pandas community worked with others to build the `pydata-sphinx-theme`_, - which is now used for https://pandas.pydata.org/docs/ (:issue:`15556`). -* :ref:`getting_started` contains a number of resources intended for new - pandas users coming from a variety of backgrounds (:issue:`26831`). - -.. _pydata-sphinx-theme: https://github.com/pydata/pydata-sphinx-theme