Skip to content

[SYCL] Fix kernel program cache for multiple devices and refactor some unit tests #5017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jan 11, 2022

Conversation

maximdimakov
Copy link
Contributor

@maximdimakov maximdimakov commented Nov 23, 2021

When multiple devices are used in the same context, the problem of piProgram double free is arised.
In program_manager::build() created program is shared with multiple devices and is cached in KernelProgramCache as many times as there are devices. When cache is destructed, the same program will be released for every device, but piProgramRetain was called only for one device, which leads to double free. Adding piProgramRetain for each device solves this problem.
Deleting kernel from cache after its release prevents double free in the case a kernel was created for a program that shared with multiple devices.
Unittest with emulating multiple gpu is added.
Signed-off-by: mdimakov [email protected]

@maximdimakov maximdimakov requested a review from a team as a code owner November 23, 2021 13:24
s-kanaev
s-kanaev previously approved these changes Nov 26, 2021
Copy link
Contributor

@s-kanaev s-kanaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch looks good.

@maximdimakov , could you, please, add regression test in unit-tests or intel/llvm-test-suite?

cperkinsintel
cperkinsintel previously approved these changes Dec 2, 2021
@bader
Copy link
Contributor

bader commented Dec 3, 2021

/verify with intel/llvm-test-suite#594

bader
bader previously requested changes Dec 4, 2021
Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch, doesn't pass the test.

@maximdimakov
Copy link
Contributor Author

There is problem in the test that it can't create multiple devices. I am investigating to it

@s-kanaev
Copy link
Contributor

s-kanaev commented Dec 6, 2021

There is problem in the test that it can't create multiple devices. I am investigating to it

@maximdimakov , I suggest having a unit-test for this change as it doesn't require specific H/W or its features.

@maximdimakov maximdimakov dismissed stale reviews from cperkinsintel and s-kanaev via deaa97c December 9, 2021 13:15
@maximdimakov
Copy link
Contributor Author

@bader unit test passes checks

@bader
Copy link
Contributor

bader commented Dec 15, 2021

@bader unit test passes checks

Good, but I'm expected you to fix llvm-test-suite tests you added for this patch.

@maximdimakov
Copy link
Contributor Author

@bader I closed llvm-test-suite PR

@bader
Copy link
Contributor

bader commented Dec 15, 2021

@bader I closed llvm-test-suite PR

Why? I don't think merging untested patch is a good idea.

@maximdimakov
Copy link
Contributor Author

@bader I replaced test in test suite by unit test. I can emulate multiple devices with CreateMultipleRootDevices environment variable locally on machine, but Jenkins precommit machine can't. I think it is better to write unittest that doesn't depend on machine on which it runs.

@bader bader dismissed their stale review December 15, 2021 09:36

Withdraw my request for changes and let Chris and Sergei to validate how well the patch is covered by the unittest.


class MultTestKernel {
public:
void operator()(cl::sycl::item<1>){};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a const operator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this operator

Comment on lines 87 to 88
devices[0] = reinterpret_cast<pi_device>(1111);
devices[1] = reinterpret_cast<pi_device>(2222);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only be performed if devices is not nil.

Copy link
Contributor

@cperkinsintel cperkinsintel Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And not just not nullptr. If I'm not mistaken, the maximum possible device entries is bound by num_entries so the subscripts should check to make sure they are less than num_entries. Though, in the case of mocks and unit tests, it's less clear to me where this value for num_entries comes from and if it is a problem to blindly set devices[1].

Would this test fail if run with the SYCL_DEVICE_FILTER set to a single device? (And, do we even care?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added check for nullptr and for num_entries.
piDevicesGet first time is called with num_entries=0 and not null num_devices. Ather the call num_entries takes value of num_devices. With second call num_devices parameter passes with nullptr, so we can create num_entries devices. I rewrite num_devices to 2, so when redefinedDevicesGet is called a second time RT thinks that we have two devices.
This test emulates two devices, so we don't care about real count of devices.

EXPECT_EQ(KernelCache.size(), (size_t)2) << "Expect 2 kernels in cache";
}
// Cache is cleared here, check kernel release
EXPECT_EQ(KernelReleaseCounter, 3) << "Expect 3 piKernelRelease calls";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add comment describing how number 3 was retrieved. Why is it expected and not 2 or four?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why not test to see if KernelReleaseCounter is equal to the RetainCounter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added clarifying comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why not test to see if KernelReleaseCounter is equal to the RetainCounter?

AFAIR, there's some mem-leak test which checks for retain-release parity.
Although, I'm not sure it checks for kernels.

cperkinsintel
cperkinsintel previously approved these changes Dec 17, 2021
Copy link
Contributor

@cperkinsintel cperkinsintel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

s-kanaev
s-kanaev previously approved these changes Dec 20, 2021
Copy link
Contributor

@s-kanaev s-kanaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bader
Copy link
Contributor

bader commented Dec 23, 2021

@cperkinsintel, ping

@maximdimakov maximdimakov changed the title [SYCL] Fix kernel program cache for multiple devices [SYCL] Fix kernel program cache for multiple devices and refactor some unit tests Jan 10, 2022
Copy link
Contributor

@cperkinsintel cperkinsintel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@s-kanaev s-kanaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bader bader merged commit 64c2d35 into intel:sycl Jan 11, 2022
@maximdimakov maximdimakov deleted the fix_CP_cache branch March 24, 2022 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants