Skip to content

[Windows bug?] when appending to an already-open file to_csv() treats first data row in dataframe as the header #21577

@jeremymatt

Description

@jeremymatt

TL;DR version

to_csv() ignores the header row and instead treats the first line of data as the header when using a file object in write mode to:

  1. write a few lines of text to the file, and then
  2. use pd.to_csv() to append a dataframe to the file

This problem does appear on Win7, Win10, and a Mint VM when pandas 0.23.1 is installed.
This problem does not appear on Win10 or Mint VM when pandas 0.22.0 is installed

All testing I've done has been using either Spyder or Jupyter Notebook installed as part of Anaconda. The use of Spyder vs Jupyter does not seem to matter. The version of Spyder (3.2.6 vs. 3.2.8) and the version of Jupyter (5.4.0 vs. 5.5.0) does not seem to matter either.

Task background

I'm trying to output a dataframe containing weather data (temperature and wind speed) to a txt file. The first few lines of the file are metadata about the station (name, number, height, etc.). After the meta data, I want to store the time-series weather data. I want to write all information (station metadata and the data block) for a particular station, close the file, and move onto outputting the file for the next station.

Problem Description:

pd.to_csv() is ignoring the header information in the dataframe and is instead treating the first line of data as the header.

The problem occurs when using the following steps to write to file:

  • Open the file (mode='w')
  • Write the three lines of header information
  • call to_csv() on the dataframe

Code Sample

import pandas as pd

#make sample datafame
temp = pd.Series([20.1,20.2,20.3])
wind = pd.Series([10,20,30])
timestamp = pd.date_range(start='1/1/2018',periods = 3,freq='3min')
df = pd.DataFrame({'timestamp':timestamp,'temp':temp,'wind':wind})

#TEST1: Produces incorrect output
filename = 'sample_output1.txt'
file = open(filename,'w')
file.write('StationName:station1'+chr(10))
file.write('version:3'+chr(10))
file.write('height:10'+chr(10)+chr(10))
df.to_csv(file,mode='a',header=True,index=True,index_label='index')
file = open(filename,'r')
print(file.read())
file.close()

Gives the following output:

StationName:station1
version:3
height:10

00,20.1,10
1,2018-01-01 00:03:00,20.2,20
2,2018-01-01 00:06:00,20.3,30

Expected Output

example of the correct output:

StationName:station1
version:3
height:10

index,timestamp,temp,wind
0,2018-01-01 00:00:00,20.1,10
1,2018-01-01 00:03:00,20.2,20
2,2018-01-01 00:06:00,20.3,30

Side Note:

The output file is formatted correctly if I add an extraneous close/open just before calling to_csv() as follows:

  • Open the file (mode='w')
  • Write the three lines of header information
  • close file
  • open file (mode='a')
  • call to_csv() on the dataframe

The code below produces the expected output:

#TEST2: Produces correct output
filename = 'sample_output2.txt'
file = open(filename,'w')
file.write('StationName:station1'+chr(10))
file.write('version:3'+chr(10))
file.write('height:10'+chr(10)+chr(10))

#Note the extraneous close/open of the file
file.close()
file = open(filename,'a')

df.to_csv(file,mode='a',header=True,index=True,index_label='index')
file.close()
file = open(filename,'r')
print(file.read())
file.close()

Dependency Information

As noted above, the problem shows up on Win7, Win10, and Mint when pandas 0.23.1 is installed but does not show up on Win10 or Mint when pandas 0.22.0 is installed.

To further test this, I did a fresh install of anaconda navigator (which ships with pandas 0.22.0) on a Mint VM and then ran conda update pandas. The updates performed are as follows (everything other than pandas were updated as part of the environment solution):

    anaconda:        5.1.0-py36_2          --> custom-py36hbbc8b67_0
    ca-certificates: 2017.08.26-h1d4fec5_0 --> 2018.03.07-0         
    certifi:         2018.1.18-py36_0      --> 2018.4.16-py36_0     
    openssl:         1.0.2n-hb7f436b_0     --> 1.0.2o-h20670df_0    
    pandas:          0.22.0-py36hf484d3e_0 --> 0.23.1-py36h637b7d7_0

Before conda update pandas, pandas worked as expected for both sample cases (with and without the extraneous file.close()/file=open().
After updating, pandas began ignoring the dataframe header and treating the first line of data as the header information.

Prior to update, pd.show_versions() returned this:

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-38-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

After the update, pd.show_versions() returned this:

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-38-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.1
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO DataIO issues that don't fit into a more specific label

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions