Skip to content

BUG: read_csv - file left open after UnicodeDecodeError when sep=None #39024

@davemfish

Description

@davemfish
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import tempfile
import textwrap
import pandas
import os

workspace_dir = tempfile.mkdtemp()
csv_file = os.path.join(workspace_dir, 'non-utf8.csv')
# encode with ISO Cyrillic, include a non-ASCII character to achieve UnicodeDecodeError
with open(csv_file, 'w', encoding='iso8859_5') as file_obj:
    file_obj.write(textwrap.dedent(
        """
        header,
        fЮЮ,
        bar
        """
    ).strip())

try:
    dataframe = pandas.read_csv(csv_file, sep=None)
except UnicodeDecodeError as error:
    os.remove(csv_file)
    raise

Problem description

os.remove raises a PermissionError on Windows because apparently the file handle is still open. This only happens when the sep=None kwarg is used. Leaving out that kwarg gets the expected output.

Expected Output

Traceback (most recent call last):
  File "..\scratch\pandas_file_handle.py", line 19, in <module>
    dataframe = pandas.read_csv(csv_file, sep=None)
  File "C:\Users\dmf\projects\invest\env\lib\site-packages\pandas\io\parsers.py", line 605, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\dmf\projects\invest\env\lib\site-packages\pandas\io\parsers.py", line 457, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Users\dmf\projects\invest\env\lib\site-packages\pandas\io\parsers.py", line 814, in __init__
    self._engine = self._make_engine(self.engine)
  File "C:\Users\dmf\projects\invest\env\lib\site-packages\pandas\io\parsers.py", line 1045, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "C:\Users\dmf\projects\invest\env\lib\site-packages\pandas\io\parsers.py", line 2291, in __init__
    self._make_reader(self.handles.handle)
  File "C:\Users\dmf\projects\invest\env\lib\site-packages\pandas\io\parsers.py", line 2412, in _make_reader
    line = f.readline()
  File "C:\Users\dmf\projects\invest\env\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 10: invalid continuation byte

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.7.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None

pandas : 1.2.0
numpy : 1.19.2
pytz : 2020.5
dateutil : 2.8.1
pip : 20.2.4
setuptools : 49.6.0.post20201009
Cython : 0.29.21
pytest : 6.1.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions