Skip to content

[IMP] snippets.convert_html_columns: a batch processing story #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

KangOl
Copy link
Contributor

@KangOl KangOl commented Jun 6, 2024

TLDR: RTFM

Once upon a time, in a countryside farm in Belgium...

At first, the upgrade of databases was straightforward. But, as time passed, the size of the databases grew, and some CPU-intensive computations took so much time that a solution needed to be found. Hopefully, the Python standard library has the perfect module for this task: concurrent.futures.
Then, Python 3.10 appeared, and the usage of ProcessPoolExecutor started to sometimes hang for no apparent reasons. Soon, our hero finds out he wasn't the only one to suffer from this issue1. Unfortunately, the proposed solution looked overkill. Still, it revealed that the issue had already been known2 for a few years. Despite the fact that an official patch wasn't ready to be committed, discussion about its legitimacy3 leads our hero to a nicer solution.

By default, ProcessPoolExecutor.map submits elements one by one to the pool. This is pretty inefficient when there are a lot of elements to process. This can be changed by using a large value for the chunksize argument.

Who would have thought that a bigger chunk size would solve a performance issue?
As always, the response was in the documentation4.

Footnotes

  1. https://stackoverflow.com/questions/74633896/processpoolexecutor-using-map-hang-on-large-load

  2. https://github.com/python/cpython/issues/74028

  3. https://github.com/python/cpython/pull/114975#pullrequestreview-1867070041

  4. https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map

TLDR: RTFM

Once upon a time, in a countryside farm in Belgium...

At first, the upgrade of databases was straightforward. But, as time
passed, the size of the databases grew, and some CPU-intensive
computations took so much time that a solution needed to be found.
Hopefully, the Python standard library has the perfect module for this
task: `concurrent.futures`.
Then, Python 3.10 appeared, and the usage of `ProcessPoolExecutor`
started to sometimes hang for no apparent reasons. Soon, our hero finds
out he wasn't the only one to suffer from this issue[^1].
Unfortunately, the proposed solution looked overkill. Still, it
revealed that the issue had already been known[^2] for a few years.
Despite the fact that an official patch wasn't ready to be committed,
discussion about its legitimacy[^3] leads our hero to a nicer solution.

By default, `ProcessPoolExecutor.map` submits elements one by one to the
pool. This is pretty inefficient when there are a lot of elements to
process. This can be changed by using a large value for the *chunksize*
argument.

Who would have thought that a bigger chunk size would solve a
performance issue?
As always, the response was in the documentation[^4].

[^1]: https://stackoverflow.com/questions/74633896/processpoolexecutor-using-map-hang-on-large-load
[^2]: python/cpython#74028
[^3]: python/cpython#114975 (review)
[^4]: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map
@KangOl KangOl requested review from nseinlet and aj-fuentes June 6, 2024 14:31
@robodoo
Copy link
Contributor

robodoo commented Jun 6, 2024

@nseinlet
Copy link
Contributor

nseinlet commented Jun 6, 2024

robodoo r+

@robodoo robodoo closed this in 5f83f3a Jun 6, 2024
@robodoo robodoo added the 17.4 label Jun 6, 2024
@KangOl KangOl deleted the master-processpool-chunksize-chs branch June 6, 2024 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants