-
-
Notifications
You must be signed in to change notification settings - Fork 3k
[mypyc] Switch to table-driven imports for smaller IR #14917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Table-driven imports can initialize some rather large RArrays so let's make sure they stay readable. Thankfully, all it takes is moving and reusing c_array_initializer().
It won't be read anywhere as the internal statics are only used to avoid runaway recursion with import cycles in our module init functions.
Add CPyImport_ImportFromMany() which imports a module and a tuple of names, placing them in the globals dict in one go. Previously, each name would imported and placed in globals manually in IR, leading to some pretty verbose code. The other option to collect all from imports and perform them all at once in the helper would remove even more ops, however, it has some major downsides: - It wouldn't be able to be done in IRBuild directly, instead being handled in the prebuild visitor and codegen... which all sounds really involved. - It would cause from imports to be performed eagerly, potentially causing circular imports (especially in functions whose imports are probably there to avoid a circular import!). The latter is the nail in the coffin for this idea.
Change how imports (not from imports!) are processed so they can be table-driven (tuple-driven, really) and compact. Here's how it works: Import nodes are divided in groups (in the prebuild visitor). Each group consists of consecutive Import nodes: import mod <| group 1 import mod2 | def foo() -> None: import mod3 <- group 2 import mod4 <- group 3 Every time we encounter the first import of a group, build IR to call CPyImport_ImportMany() that will perform all of the group's imports in one go. Previously, each module would imported and placed in globals manually in IR, leading to some pretty verbose code. The other option to collect all imports and perform them all at once in the helper would remove even more ops, however, it's problematic for the same reasons from the previous commit (spoiler: it's not safe). Implementation notes: - I had to add support for loading the address of a static directly, so I shoehorned in LoadLiteral support for LoadAddress. - Unfortunately by replacing multiple import nodes with a single function call at the IR level, if any import within a group fails, the traceback line number is static and will be probably wrong (pointing to the first import node in my original impl.). To fix this, I had to make CPyImport_ImportMany() add the traceback entry itself on failure (instead of letting codegen handle it automatically). This is admittedly ugly. Overall, this doesn't speed up initialization. The only real speed impact is that back to back imports in a tight loop is 10-20% slower. I believe that's acceptable given the code reduction.
Just like with long arrays, table-driven imports can lead to some massive literals being loaded... let's keep the C readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, this is the file I used to measure looped import performance:
import time, statistics
import dataclasses, enum, typing
times = []
for _ in range(10):
t0 = time.perf_counter()
for _ in range(10000):
import dataclasses
# import enum
# import typing
elapsed = (time.perf_counter() - t0) * 1000
times.append(elapsed)
print(f"{elapsed:.2f}")
mean = statistics.mean(times)
stdev = statistics.stdev(times)
print(f"mean: {mean:.1f} (stdev: {stdev:.3f}) | min: {min(times):.1f} | max: {max(times):.1f}")
On master (best):
1.04
1.03
1.01
1.01
1.02
1.02
1.10
1.02
1.03
1.01
mean: 1.0 (stdev: 0.026) | min 1.0 | max: 1.1
This PR (most common result):
1.18
1.19
1.20
1.19
1.19
1.22
1.22
1.21
1.21
1.23
mean: 1.2 (stdev: 0.018) | min 1.2 | max: 1.2
If the slowdown is unacceptable, it's possible to cut it in half by unrolling the call to single_import()
in IR (only within functions where performance could matter).
I haven't measured whether from imports are any slower.
Here's a GitHub Gist with the old and new C code for the file I've been using to test my implementation. The new code is roughly 3000 LOC smaller! It's not a real example so the usual caveats apply, but long from imports and "import groups" aren't uncommon in the real world. (I stole the long from import from some mypyc file IIRC.)
|
1337874
to
f908715
Compare
I haven't done a full review yet, but based on a quick experiment the impact to self-compiling is pretty huge -- ~8% smaller generated code, and ~7% faster compilation when using -O0 (running an interpreted mypyc). And once a few additional bottlenecks in compilation speed are fixed, the relative impact will be bigger. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, thanks! Left some comments, the main questions are about the possibility of reverting to the old behavior within functions. If the performance impact would be negligible, this is not important.
module = builder.gen_import_from(id, globals, imported_names, node.line) | ||
|
||
# Copy everything into our module's dict. | ||
builder.imports[id] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, what about only using the table based option outside functions? Imports within functions are already something of a performance bottleneck, so it would be nice if we could keep them as fast as they are currently. (Eventually I'd like us to make nested imports faster, but that might take a while.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't measure any slowdown running from typing import Any
in a tight loop with the table-driven code.1 In fact, the table-driven code can be faster if there are several names. (I tried six.)
**one name**
[new] mean: 11.7 (stdev: 0.273) | min 11.4 | max: 12.2
[new] mean: 12.0 (stdev: 0.530) | min 11.3 | max: 13.0
[old] mean: 12.8 (stdev: 0.492) | min 11.8 | max: 13.5
[old] mean: 12.9 (stdev: 0.606) | min 12.0 | max: 13.8
**six names**
[new] mean: 15.9 (stdev: 0.543) | min 14.9 | max: 16.7
[new] mean: 15.0 (stdev: 0.365) | min 14.6 | max: 15.6
[old] mean: 18.8 (stdev: 0.358) | min 18.3 | max: 19.3
[old] mean: 18.4 (stdev: 0.451) | min 17.6 | max: 19.1
Since the table-driven code doesn't seem to introduce any slowdown, I won't bring back the old from import
code.
Footnotes
-
My data suggests it's actually faster, but that seems suspect. I won't call it any faster as the speedup is within the margin of error. ↩
The prebuild visitor would mark these two imports as part of the same group which isn't right. def f() -> None: import x import y
Also, when you say "compile time" or "compile speed", do you mean the time mypyc + gcc/clang/msvc takes OR just the time mypyc takes? I'd guess the latter, but it has never been clear to me... |
I mean the overall time, i.e. mypyc + C compiler. For larger programs this is usually dominated by C compilation speed when using optimizations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! This looks great now.
And thank you for the review! :) |
Add CPyImport_ImportFromMany() which imports a module and a tuple of
names, placing them in the globals dict in one go.
Previously, each name would imported and placed in globals manually
in IR, leading to some pretty verbose code.
The other option to collect all from imports and perform them all at
once in the helper would remove even more ops, however, it has some
major downsides:
It wouldn't be able to be done in IRBuild directly, instead being
handled in the prebuild visitor and codegen... which all sounds
really involved.
It would cause from imports to be performed eagerly, potentially
causing circular imports (especially in functions whose imports are
probably there to avoid a circular import!).
The latter is the nail in the coffin for this idea.
Change how imports (not from imports!) are processed so they can be
table-driven (tuple-driven, really) and compact. Here's how it works:
Import nodes are divided in groups (in the prebuild visitor). Each group
consists of consecutive Import nodes:
Every time we encounter the first import of a group, build IR to call
CPyImport_ImportMany() that will perform all of the group's imports in
one go.
(*) Imports in functions or classes are still transformed into the original,
verbose IR as speed is more important than codesize.
Previously, each module would imported and placed in globals manually
in IR, leading to some pretty verbose code.
The other option to collect all imports and perform them all at once in
the helper would remove even more ops, however, it's problematic for
the same reasons from the previous commit (spoiler: it's not safe).
Implementation notes:
I had to add support for loading the address of a static directly,
so I shoehorned in LoadStatic support for LoadAddress.
Unfortunately by replacing multiple import nodes with a single
function call at the IR level, if any import within a group fails,
the traceback line number is static and will be probably wrong
(pointing to the first import node in my original impl.).
To fix this, I had to make CPyImport_ImportMany() add the traceback
entry itself on failure (instead of letting codegen handle it
automatically). This is admittedly ugly.
Overall, this doesn't speed up initialization. The only real speed
impact is that back to back imports in a tight loop seems to be 10-20%
slower. I believe that's acceptable given the code size reduction.
Other changes:
Don't declare internal static for non-compiled modules
It won't be read anywhere as the internal statics are only used to
avoid runaway recursion with import cycles in our module init
functions.
Wrap long RArray initializers and long annotations in codegen
Table-driven imports can load some rather large RArrays and tuple
literals so this was needed to keep the generated C readable.
Add LLBuilder helper for setting up a RArray
Resolves mypyc/mypyc#591.