-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
TL;DR version
to_csv() ignores the header row and instead treats the first line of data as the header when using a file object in write mode to:
- write a few lines of text to the file, and then
- use pd.to_csv() to append a dataframe to the file
This problem does appear on Win7, Win10, and a Mint VM when pandas 0.23.1 is installed.
This problem does not appear on Win10 or Mint VM when pandas 0.22.0 is installed
All testing I've done has been using either Spyder or Jupyter Notebook installed as part of Anaconda. The use of Spyder vs Jupyter does not seem to matter. The version of Spyder (3.2.6 vs. 3.2.8) and the version of Jupyter (5.4.0 vs. 5.5.0) does not seem to matter either.
Task background
I'm trying to output a dataframe containing weather data (temperature and wind speed) to a txt file. The first few lines of the file are metadata about the station (name, number, height, etc.). After the meta data, I want to store the time-series weather data. I want to write all information (station metadata and the data block) for a particular station, close the file, and move onto outputting the file for the next station.
Problem Description:
pd.to_csv() is ignoring the header information in the dataframe and is instead treating the first line of data as the header.
The problem occurs when using the following steps to write to file:
- Open the file (mode='w')
- Write the three lines of header information
- call to_csv() on the dataframe
Code Sample
import pandas as pd
#make sample datafame
temp = pd.Series([20.1,20.2,20.3])
wind = pd.Series([10,20,30])
timestamp = pd.date_range(start='1/1/2018',periods = 3,freq='3min')
df = pd.DataFrame({'timestamp':timestamp,'temp':temp,'wind':wind})
#TEST1: Produces incorrect output
filename = 'sample_output1.txt'
file = open(filename,'w')
file.write('StationName:station1'+chr(10))
file.write('version:3'+chr(10))
file.write('height:10'+chr(10)+chr(10))
df.to_csv(file,mode='a',header=True,index=True,index_label='index')
file = open(filename,'r')
print(file.read())
file.close()
Gives the following output:
StationName:station1
version:3
height:10
00,20.1,10
1,2018-01-01 00:03:00,20.2,20
2,2018-01-01 00:06:00,20.3,30
Expected Output
example of the correct output:
StationName:station1
version:3
height:10
index,timestamp,temp,wind
0,2018-01-01 00:00:00,20.1,10
1,2018-01-01 00:03:00,20.2,20
2,2018-01-01 00:06:00,20.3,30
Side Note:
The output file is formatted correctly if I add an extraneous close/open just before calling to_csv() as follows:
- Open the file (mode='w')
- Write the three lines of header information
- close file
- open file (mode='a')
- call to_csv() on the dataframe
The code below produces the expected output:
#TEST2: Produces correct output
filename = 'sample_output2.txt'
file = open(filename,'w')
file.write('StationName:station1'+chr(10))
file.write('version:3'+chr(10))
file.write('height:10'+chr(10)+chr(10))
#Note the extraneous close/open of the file
file.close()
file = open(filename,'a')
df.to_csv(file,mode='a',header=True,index=True,index_label='index')
file.close()
file = open(filename,'r')
print(file.read())
file.close()
Dependency Information
As noted above, the problem shows up on Win7, Win10, and Mint when pandas 0.23.1 is installed but does not show up on Win10 or Mint when pandas 0.22.0 is installed.
To further test this, I did a fresh install of anaconda navigator (which ships with pandas 0.22.0) on a Mint VM and then ran conda update pandas. The updates performed are as follows (everything other than pandas were updated as part of the environment solution):
anaconda: 5.1.0-py36_2 --> custom-py36hbbc8b67_0
ca-certificates: 2017.08.26-h1d4fec5_0 --> 2018.03.07-0
certifi: 2018.1.18-py36_0 --> 2018.4.16-py36_0
openssl: 1.0.2n-hb7f436b_0 --> 1.0.2o-h20670df_0
pandas: 0.22.0-py36hf484d3e_0 --> 0.23.1-py36h637b7d7_0
Before conda update pandas, pandas worked as expected for both sample cases (with and without the extraneous file.close()/file=open().
After updating, pandas began ignoring the dataframe header and treating the first line of data as the header information.
Prior to update, pd.show_versions() returned this:
pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
After the update, pd.show_versions() returned this:
pandas: 0.23.1
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None