-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
IO CSVread_csv, to_csvread_csv, to_csvPerformanceMemory or execution speed performanceMemory or execution speed performance
Description
Hi,
I noticed that datetime parsing for large sql/csv tables is really slow. Would it be acceptable to use a technique where repeated calculations are cached? For example, instead of:
def parse_date(date_str) :
return datetime.datetime.strptime(date_str,FMT)
def parse_date_col(str_col) :
return [parse_date(date_str) for date_str in str_col]
use
def parse_date(date_str) :
return datetime.datetime.strptime(date_str,FMT)
def parse_date_col(str_col) :
cache = dict()
for date_str in str_col :
if date_str not in cache :
cache[date_str] = parse_date(date_str)
return [cache[date_str] for date_str in str_col]
The reason this works is that string hashing / comparison / dictionary insertion is much much faster than strptime.
For tables where dates are repeated many times this can result in orders of magnitude speedup.
Thanks
Charles
Metadata
Metadata
Assignees
Labels
IO CSVread_csv, to_csvread_csv, to_csvPerformanceMemory or execution speed performanceMemory or execution speed performance