Skip to content

Allow reading multiple files simultaneously from Tensorboard #2346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

bcoopers
Copy link

  • Motivation for features / changes
    Enables setting multiple evaluation workers in TF estimator. Each writes to its own file. Now all are read by Tensorboard, allowing greater statistical significance and/or throughput rate of points written.

  • Technical description of changes
    Reads from all files, publishing the event with the most recent step number. For identical step numbers publishes the earlier time.

  • Screenshots of UI changes
    Please reach out to me for this. I cannot share this externally I think.

  • Detailed steps to verify changes work correctly (as executed by you)
    Ran training, observed results on Tensorboard. Ran unit test- it passes.

  • Alternate designs / implementations considered
    None

bcoopers added 4 commits June 13, 2019 18:29
Allows using multiple eval workers at once.

They write events to Tensorboard independently.

Make Tensorboard support reading from multiple files simultaneously. Note that under the assumptions of the old class, the behavior remains the same, so existing jobs with one eval worker will not be affected.
@wchargin
Copy link
Contributor

Hi @bcoopers—thank you for the PR! We’re definitely interested in this
kind of functionality; see #1063 for the canonical issue, and #1867 for
a shelved (but not abandoned) prior attempt.

I believe that in the past we’ve run into performance issues when
continuously reading from all files under the logdir, because the number
of such files is unbounded and access over a network file system can be
expensive. The PR description of #1867 has some discussion about this.

The best person to review this is @nfelt; he’s OOO right now, but will
be back on Monday. I’ll see if I can take a preliminary look tomorrow.

@wchargin
Copy link
Contributor

For Googlers, cf. http://cl/253027277.

@bcoopers
Copy link
Author

If you want I can add a max # of files parameter (say a thousand) and start expiring the least-recently-updated files once we have that many.

@nfelt nfelt self-requested a review June 18, 2019 01:33
@bcoopers
Copy link
Author

Friendly ping

@nfelt
Copy link
Contributor

nfelt commented Jun 29, 2019

Hi and thanks for your interest in contributing, and apologies for so long to get to this PR.

I have a few reservations about this approach, relative to the approach in #1867.

  1. We do need a way to limit the total number of files we're polling to avoid resource blowup. My preference is to age out files based on last write times (like Add option to read multiple active event files per directory #1867) rather than simply a fixed file count, since I think this better captures what we're actually trying to achieve and avoids polling very old files unnecessarily, even if we have "capacity" to poll them.

  2. Even with a mitigation for that case, turning this behavior on by default could cause significant regressions for large logdirs with many small event files.  This is why in Add option to read multiple active event files per directory #1867 it's a separate implementation that's off by default behind a flag.

  3. Reconciling parallel streams of data across files is more complicated than just merging in sorted order by step. For example sometimes people write two successive separate "runs" to the same directory by accident; I believe this code would merge them into a single logical run sorted by step, which could look very misleading. In that case it's arguably better to keep them in the order they would appear in today's TensorBoard (data from first file, then second file) because at least this shows up in TensorBoard as a "zigzag" plot doubling back on itself that indicates something went wrong. In general, I think the "reconciliation across data streams" problem is better solved at a higher level in the code where we have a little more context.

I should be able to finish off #1867 in mid-July. I realize it's been a long time coming, and I was hoping I could get to it this week, but unfortunately I didn't and I'll be OOO next week. If you'll bear with us until then, I think that will be the best path to fixing the underlying issue #1063.

@nfelt
Copy link
Contributor

nfelt commented Jul 30, 2019

I've submitted #1867, please take a look and feel free to provide feedback. I believe it should satisfy the use case for this PR, so I will go ahead and close it, but let me know if you'd like it re-opened.

Note that our sync into Google is currently blocked on some unrelated issues, but once those are resolved this should land internally soon.

@nfelt nfelt closed this Jul 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants