Skip to content

BUG: allow lex string comparisons #6158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 29, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 26 additions & 8 deletions doc/source/enhancingperf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -464,19 +464,20 @@ evaluate an expression in the "context" of a ``DataFrame``.

Any expression that is a valid :func:`~pandas.eval` expression is also a valid
``DataFrame.eval`` expression, with the added benefit that *you don't have to
prefix the name of the* ``DataFrame`` *to the column you're interested in
prefix the name of the* ``DataFrame`` *to the column(s) you're interested in
evaluating*.

In addition, you can perform in-line assignment of columns within an expression.
This can allow for *formulaic evaluation*. Only a signle assignement is permitted.
It can be a new column name or an existing column name. It must be a string-like.
In addition, you can perform assignment of columns within an expression.
This allows for *formulaic evaluation*. Only a single assignment is permitted.
The assignment target can be a new column name or an existing column name, and
it must be a valid Python identifier.

.. ipython:: python

df = DataFrame(dict(a = range(5), b = range(5,10)))
df.eval('c=a+b')
df.eval('d=a+b+c')
df.eval('a=1')
df = DataFrame(dict(a=range(5), b=range(5, 10)))
df.eval('c = a + b')
df.eval('d = a + b + c')
df.eval('a = 1')
df

Local Variables
Expand Down Expand Up @@ -616,3 +617,20 @@ different engines.

This plot was created using a ``DataFrame`` with 3 columns each containing
floating point values generated using ``numpy.random.randn()``.

Technical Minutia
~~~~~~~~~~~~~~~~~
- Expressions that would result in an object dtype (including simple
variable evaluation) have to be evaluated in Python space. The main reason
for this behavior is to maintain backwards compatbility with versions of
numpy < 1.7. In those versions of ``numpy`` a call to ``ndarray.astype(str)``
will truncate any strings that are more than 60 characters in length. Second,
we can't pass ``object`` arrays to ``numexpr`` thus string comparisons must
be evaluated in Python space.
- The upshot is that this *only* applies to object-dtype'd expressions. So,
if you have an expression--for example--that's a string comparison
``and``-ed together with another boolean expression that's from a numeric
comparison, the numeric comparison will be evaluated by ``numexpr``. In fact,
in general, :func:`~pandas.query`/:func:`~pandas.eval` will "pick out" the
subexpressions that are ``eval``-able by ``numexpr`` and those that must be
evaluated in Python space transparently to the user.
2 changes: 2 additions & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,8 @@ Bug Fixes
- Bug in DataFrame construction with recarray and non-ns datetime dtype (:issue:`6140`)
- Bug in ``.loc`` setitem indexing with a datafrme on rhs, multiple item setting, and
a datetimelike (:issue:`6152`)
- Fixed a stack overflow bug in ``query``/``eval`` during lexicographic
string comparisons (:issue:`6155`).

pandas 0.13.0
-------------
Expand Down
3 changes: 2 additions & 1 deletion pandas/computation/expr.py
Original file line number Diff line number Diff line change
Expand Up @@ -508,7 +508,8 @@ def _possibly_eval(self, binop, eval_in_python):

def _possibly_evaluate_binop(self, op, op_class, lhs, rhs,
eval_in_python=('in', 'not in'),
maybe_eval_in_python=('==', '!=')):
maybe_eval_in_python=('==', '!=', '<', '>',
'<=', '>=')):
res = op(lhs, rhs)

if self.engine != 'pytables':
Expand Down
19 changes: 19 additions & 0 deletions pandas/tests/test_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -12841,6 +12841,25 @@ def test_query_with_nested_string(self):
for parser, engine in product(PARSERS, ENGINES):
yield self.check_query_with_nested_strings, parser, engine

def check_query_lex_compare_strings(self, parser, engine):
tm.skip_if_no_ne(engine=engine)
import operator as opr

a = Series(tm.choice(list('abcde'), 20))
b = Series(np.arange(a.size))
df = DataFrame({'X': a, 'Y': b})

ops = {'<': opr.lt, '>': opr.gt, '<=': opr.le, '>=': opr.ge}

for op, func in ops.items():
res = df.query('X %s "d"' % op, engine=engine, parser=parser)
expected = df[func(df.X, 'd')]
assert_frame_equal(res, expected)

def test_query_lex_compare_strings(self):
for parser, engine in product(PARSERS, ENGINES):
yield self.check_query_lex_compare_strings, parser, engine

class TestDataFrameEvalNumExprPandas(tm.TestCase):

@classmethod
Expand Down