Skip to content

Commit 459a789

Browse files
meeseeksmachineRik-de-Kort
authored andcommitted
Backport PR #29836: ENH: XLSB support (#31166)
Co-authored-by: Rik-de-Kort <[email protected]>
1 parent ecb0527 commit 459a789

32 files changed

+185
-14
lines changed

ci/deps/azure-37-locale.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,6 @@ dependencies:
3434
- xlsxwriter
3535
- xlwt
3636
- pyarrow>=0.15
37+
- pip
38+
- pip:
39+
- pyxlsb

ci/deps/azure-macos-36.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,4 @@ dependencies:
3333
- pip
3434
- pip:
3535
- pyreadstat
36+
- pyxlsb

ci/deps/azure-windows-37.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,6 @@ dependencies:
3535
- xlsxwriter
3636
- xlwt
3737
- pyreadstat
38+
- pip
39+
- pip:
40+
- pyxlsb

ci/deps/travis-36-cov.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,3 +51,4 @@ dependencies:
5151
- coverage
5252
- pandas-datareader
5353
- python-dateutil
54+
- pyxlsb

doc/source/getting_started/install.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,7 @@ pyarrow 0.12.0 Parquet, ORC (requires 0.13.0), and
264264
pymysql 0.7.11 MySQL engine for sqlalchemy
265265
pyreadstat SPSS files (.sav) reading
266266
pytables 3.4.2 HDF5 reading / writing
267+
pyxlsb 1.0.5 Reading for xlsb files
267268
qtpy Clipboard I/O
268269
s3fs 0.3.0 Amazon S3 access
269270
tabulate 0.8.3 Printing in Markdown-friendly format (see `tabulate`_)

doc/source/user_guide/io.rst

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
2323
text;`JSON <https://www.json.org/>`__;:ref:`read_json<io.json_reader>`;:ref:`to_json<io.json_writer>`
2424
text;`HTML <https://en.wikipedia.org/wiki/HTML>`__;:ref:`read_html<io.read_html>`;:ref:`to_html<io.html>`
2525
text; Local clipboard;:ref:`read_clipboard<io.clipboard>`;:ref:`to_clipboard<io.clipboard>`
26-
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
26+
;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
2727
binary;`OpenDocument <http://www.opendocumentformat.org>`__;:ref:`read_excel<io.ods>`;
2828
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
2929
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
@@ -2768,7 +2768,8 @@ Excel files
27682768

27692769
The :func:`~pandas.read_excel` method can read Excel 2003 (``.xls``)
27702770
files using the ``xlrd`` Python module. Excel 2007+ (``.xlsx``) files
2771-
can be read using either ``xlrd`` or ``openpyxl``.
2771+
can be read using either ``xlrd`` or ``openpyxl``. Binary Excel (``.xlsb``)
2772+
files can be read using ``pyxlsb``.
27722773
The :meth:`~DataFrame.to_excel` instance method is used for
27732774
saving a ``DataFrame`` to Excel. Generally the semantics are
27742775
similar to working with :ref:`csv<io.read_csv_table>` data.
@@ -3229,6 +3230,30 @@ OpenDocument spreadsheets match what can be done for `Excel files`_ using
32293230
Currently pandas only supports *reading* OpenDocument spreadsheets. Writing
32303231
is not implemented.
32313232

3233+
.. _io.xlsb:
3234+
3235+
Binary Excel (.xlsb) files
3236+
--------------------------
3237+
3238+
.. versionadded:: 1.0.0
3239+
3240+
The :func:`~pandas.read_excel` method can also read binary Excel files
3241+
using the ``pyxlsb`` module. The semantics and features for reading
3242+
binary Excel files mostly match what can be done for `Excel files`_ using
3243+
``engine='pyxlsb'``. ``pyxlsb`` does not recognize datetime types
3244+
in files and will return floats instead.
3245+
3246+
.. code-block:: python
3247+
3248+
# Returns a DataFrame
3249+
pd.read_excel('path_to_file.xlsb', engine='pyxlsb')
3250+
3251+
.. note::
3252+
3253+
Currently pandas only supports *reading* binary Excel files. Writing
3254+
is not implemented.
3255+
3256+
32323257
.. _io.clipboard:
32333258

32343259
Clipboard

doc/source/whatsnew/v1.0.0.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,8 @@ Other enhancements
215215
- :meth:`Styler.format` added the ``na_rep`` parameter to help format the missing values (:issue:`21527`, :issue:`28358`)
216216
- Roundtripping DataFrames with nullable integer, string and period data types to parquet
217217
(:meth:`~DataFrame.to_parquet` / :func:`read_parquet`) using the `'pyarrow'` engine
218-
now preserve those data types with pyarrow >= 0.16.0 (:issue:`20612`, :issue:`28371`).
218+
now preserve those data types with pyarrow >= 1.0.0 (:issue:`20612`).
219+
- :func:`read_excel` now can read binary Excel (``.xlsb``) files by passing ``engine='pyxlsb'``. For more details and example usage, see the :ref:`Binary Excel files documentation <io.xlsb>`. Closes :issue:`8540`.
219220
- The ``partition_cols`` argument in :meth:`DataFrame.to_parquet` now accepts a string (:issue:`27117`)
220221
- :func:`pandas.read_json` now parses ``NaN``, ``Infinity`` and ``-Infinity`` (:issue:`12213`)
221222
- :func:`to_parquet` now appropriately handles the ``schema`` argument for user defined schemas in the pyarrow engine. (:issue:`30270`)

pandas/compat/_optional.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
"pyarrow": "0.13.0",
2020
"pytables": "3.4.2",
2121
"pytest": "5.0.1",
22+
"pyxlsb": "1.0.5",
2223
"s3fs": "0.3.0",
2324
"scipy": "0.19.0",
2425
"sqlalchemy": "1.1.4",

pandas/core/config_init.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -479,6 +479,7 @@ def use_inf_as_na_cb(key):
479479
_xlsm_options = ["xlrd", "openpyxl"]
480480
_xlsx_options = ["xlrd", "openpyxl"]
481481
_ods_options = ["odf"]
482+
_xlsb_options = ["pyxlsb"]
482483

483484

484485
with cf.config_prefix("io.excel.xls"):
@@ -515,6 +516,13 @@ def use_inf_as_na_cb(key):
515516
validator=str,
516517
)
517518

519+
with cf.config_prefix("io.excel.xlsb"):
520+
cf.register_option(
521+
"reader",
522+
"auto",
523+
reader_engine_doc.format(ext="xlsb", others=", ".join(_xlsb_options)),
524+
validator=str,
525+
)
518526

519527
# Set up the io.excel specific writer configuration.
520528
writer_engine_doc = """

pandas/io/excel/_base.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,9 @@
3535
"""
3636
Read an Excel file into a pandas DataFrame.
3737
38-
Support both `xls` and `xlsx` file extensions from a local filesystem or URL.
39-
Support an option to read a single sheet or a list of sheets.
38+
Supports `xls`, `xlsx`, `xlsm`, `xlsb`, and `odf` file extensions
39+
read from a local filesystem or URL. Supports an option to read
40+
a single sheet or a list of sheets.
4041
4142
Parameters
4243
----------
@@ -789,15 +790,21 @@ class ExcelFile:
789790
If a string or path object, expected to be a path to xls, xlsx or odf file.
790791
engine : str, default None
791792
If io is not a buffer or path, this must be set to identify io.
792-
Acceptable values are None, ``xlrd``, ``openpyxl`` or ``odf``.
793+
Acceptable values are None, ``xlrd``, ``openpyxl``, ``odf``, or ``pyxlsb``.
793794
Note that ``odf`` reads tables out of OpenDocument formatted files.
794795
"""
795796

796797
from pandas.io.excel._odfreader import _ODFReader
797798
from pandas.io.excel._openpyxl import _OpenpyxlReader
798799
from pandas.io.excel._xlrd import _XlrdReader
799-
800-
_engines = {"xlrd": _XlrdReader, "openpyxl": _OpenpyxlReader, "odf": _ODFReader}
800+
from pandas.io.excel._pyxlsb import _PyxlsbReader
801+
802+
_engines = {
803+
"xlrd": _XlrdReader,
804+
"openpyxl": _OpenpyxlReader,
805+
"odf": _ODFReader,
806+
"pyxlsb": _PyxlsbReader,
807+
}
801808

802809
def __init__(self, io, engine=None):
803810
if engine is None:

0 commit comments

Comments
 (0)