receiver/azuremonitor - Added maximumResourcesPerBatch config #40752

suvaidkhan · 2025-06-16T22:18:35Z

Description

Added a new config maximum_resources_per_batch with a default and max value of 50

Link to tracking issue

Fixes #40112

Testing

Added UTs for config validation tests

Documentation

Added details about the config in receiver/azuremonitorreceiver/README.md

linux-foundation-easycla · 2025-06-16T22:18:39Z

The committers listed above are authorized under a signed CLA.

✅ login: suvaidkhan / name: Suvaid (1a864d6, 39a013c, 52e0973, 6bec990, 768990b, a43e72c)

…esourcesPerBatch

celian-garcia

I'm curious, would you be a user of that config field?
Because I mentioned in the issue the possibility to configure that by resource type. If this is something that could be interesting for you maybe you could implement it instead of a global field WDYT?

Also @andrewegel @ahurtaud I would like your opinion on that, since you are in the origins of that idea. #40078 (comment)

celian-garcia · 2025-06-17T06:38:33Z

receiver/azuremonitorreceiver/config.go

@@ -28,6 +28,7 @@ var (
 	errMissingClientSecret    = errors.New(`"ClientSecret" is not specified in config`)
 	errMissingFedTokenFile    = errors.New(`"FederatedTokenFile" is not specified in config`)
 	errInvalidCloud           = errors.New(`"Cloud" is invalid`)
+	errInvalidMaxResPerBatch  = errors.New(`"MaximumResourcesPerBatch" is invalid`)


Suggested change

errInvalidMaxResPerBatch = errors.New(`"MaximumResourcesPerBatch" is invalid`)

errInvalidMaxResPerBatch = errors.New(`"MaximumResourcesPerBatch" is invalid. It should be between 1 and 50`)

ahurtaud · 2025-06-17T07:46:13Z

I'm curious, would you be a user of that config field? Because I mentioned in the issue the possibility to configure that by resource type. If this is something that could be interesting for you maybe you could implement it instead of a global field WDYT?

Also @andrewegel @ahurtaud I would like your opinion on that, since you are in the origins of that idea. #40078 (comment)

I think it is fine, even more that we dont have much configuration per "resource type".
I would advise users of this feature, if it is require to fine-tune, to create multiple receiver instances for the specifics resources.

something like:

receivers:
  azuremonitor/default:
    use_batch_api: true
    maximum_resources_per_batch: 50
    services:
    - Microsoft.Compute/virtualMachines
    - blahblah
  azuremonitor/throttled:
    use_batch_api: true
    maximum_resources_per_batch: 10
    services:
    - Microsoft.Network/loadBalancers

ahurtaud · 2025-06-17T07:50:44Z

receiver/azuremonitorreceiver/config.go

@@ -329,5 +333,9 @@ func (c Config) Validate() (err error) {
 		err = multierr.Append(err, errInvalidCloud)
 	}

+	if c.UseBatchAPI && c.MaximumResourcesPerBatch != 0 && (c.MaximumResourcesPerBatch > defaultMaximumResourcesPerBatch || c.MaximumResourcesPerBatch < 0) {


Suggested change

if c.UseBatchAPI && c.MaximumResourcesPerBatch != 0 && (c.MaximumResourcesPerBatch > defaultMaximumResourcesPerBatch || c.MaximumResourcesPerBatch < 0) {

if c.UseBatchAPI && c.MaximumResourcesPerBatch != 0 && (c.MaximumResourcesPerBatch > defaultMaximumResourcesPerBatch || c.MaximumResourcesPerBatch < 1) {

also I am not sure we want to reference defaultMaximumResourcesPerBatch for the maximum test here. it works as long as we dont change the default. As its clearly stated in microsoft documentation, should we hardcode 50 here?

You mean enforce it here? and don't have the condition in the code later? I think it can be better yes

I meant:

Suggested change

if c.UseBatchAPI && c.MaximumResourcesPerBatch != 0 && (c.MaximumResourcesPerBatch > defaultMaximumResourcesPerBatch || c.MaximumResourcesPerBatch < 0) {

if c.UseBatchAPI && c.MaximumResourcesPerBatch != 0 && (c.MaximumResourcesPerBatch > 50 || c.MaximumResourcesPerBatch < 1) {

I'd prefer to have it named, otherwise it's magic number

Oh yes I see it know, the default is actually to not set the field. Maybe that would make change my mind on the fact that we'd even error on that. Let's maybe let azure decide and so this will display error only if azure consider it as an error

@celian-garcia Do you want me to remove the config validation check altogether or just change it to hardcoded values?

Sorry @suvaidkhan for that change of mind 🙏🏻 , but yes, please remove the config validation altogether. That will simplify the code and let Azure fail if needed.

Could you also just add that link in the README.md for reference? https://learn.microsoft.com/en-us/azure/azure-monitor/metrics/migrate-to-batch-api?tabs=individual-response . This is where the 50 is mentioned.
That's not written in a straightforward way in the REST API spec 😞 https://learn.microsoft.com/en-us/rest/api/monitor/metrics-batch/batch?view=rest-monitor-2023-10-01&tabs=HTTP

@celian-garcia Sure that's not a problem! It does make more sense to have no validation now that I think more about it.

@celian-garcia changes done, ready for review!

celian-garcia · 2025-06-17T10:13:55Z

I'm curious, would you be a user of that config field? Because I mentioned in the issue the possibility to configure that by resource type. If this is something that could be interesting for you maybe you could implement it instead of a global field WDYT?
Also @andrewegel @ahurtaud I would like your opinion on that, since you are in the origins of that idea. #40078 (comment)

I think it is fine, even more that we dont have much configuration per "resource type". I would advise users of this feature, if it is require to fine-tune, to create multiple receiver instances for the specifics resources.

something like:
receivers:
  azuremonitor/default:
    use_batch_api: true
    maximum_resources_per_batch: 50
    services:
    - Microsoft.Compute/virtualMachines
    - blahblah
  azuremonitor/throttled:
    use_batch_api: true
    maximum_resources_per_batch: 10
    services:
    - Microsoft.Network/loadBalancers   

Having config by resource type is actually something that we do now,

with split dimensions

receivers:
  azuremonitor:
    dimensions:
      enabled: true
      overrides:
        "Microsoft.Network/azureFirewalls":
          # Real example of an Azure limitation here:
          # Dimensions exposed are Reason, Status, Protocol,
          # but when selecting Protocol in the filters, it returns nothing.
          # Note here that the metric display name is ``Network rules hit count`` but it's programmatic value is ``NetworkRuleHit``
          # Ref: https://learn.microsoft.com/en-us/azure/azure-monitor/reference/supported-metrics/microsoft-network-azurefirewalls-metrics
          "NetworkRuleHit": [Reason, Status]

With metrics aggregations specifications

receivers:
  azuremonitor:
    resource_groups:
      - ${resource_groups}
    services:
      - Microsoft.EventHub/namespaces
      - Microsoft.AAD/DomainServices # scraper will fetch all metrics from this namespace since there are no limits under the "metrics" option
    metrics:
      "microsoft.eventhub/namespaces": # scraper will fetch only the metrics listed below:
        IncomingMessages: [total]     # metric IncomingMessages with aggregation "Total"
        NamespaceCpuUsage: [*]        # metric NamespaceCpuUsage with all known aggregations

So to align in that direction, for consistency for users, it might be better to have it here as well. I don't know if it's shocking or not and in which measure this feature will be used. We are not so that's why I was asking to people that would use it ^^.

But yeah, knowing that there is an easy solution and that the feature will probably used in limited conditions, let's not overthink and do it like you said @ahurtaud

andrewegel · 2025-06-18T16:22:21Z

I've updated #40078 (comment) ; TL;DR: I don't think a maximum_resources_per_batch config option is the way to go, instead, try to catch the QueryThrottledException, and then recursively break down the resource list by half into two separate requests until the problematic resource(s) are found; While its not a fix for problematic resources (in our case, it was a single LB with 250 plus Health Probes on it which increased the Cardinality of the DipAvailability metric beyond what Azure's APIs will allow to be returned) it does prevent the problem of one resource blocking the scraping of other resources, which IMO in the telemetry world is a good thing. I would up writing my own full implementation using inspiration from the receiver, but my code is as follows:

func (r *metricReceiver) tryScrapeLocationResourceType(
	ctx context.Context,
	client *azmetrics.Client,
	resourceType string,
	metricNames []string,
	opts azmetrics.QueryResourcesOptions,
	resourceSet []string) ([]azmetrics.QueryResourcesResponse, error) {

	responses := make([]azmetrics.QueryResourcesResponse, 0)
	var errs error
	resourceSetQueue := list.New()
	resourceSetQueue.PushBack(resourceSet)
	for resourceSetQueue.Len() > 0 {
		currentSet := resourceSetQueue.Remove(resourceSetQueue.Front()).([]string)
		response, err := client.QueryResources(
			ctx,
			r.cfg.SubscriptionID,
			resourceType,
			metricNames,
			azmetrics.ResourceIDList{ResourceIDs: currentSet},
			&opts)
		if err != nil {
			var respErr *azcore.ResponseError
			if errors.As(err, &respErr) {
				// Its unclear if this is a QueryTooExpensive error, but we assume it is
				// because the error code is empty and the status code is 529.
				if respErr.StatusCode == 529 && respErr.ErrorCode == "" {
					// QueryTooExpensive, we need to retry with a smaller set of resources
					r.logger.Warn("Received 529 QueryTooExpensive based on StatusCode and ErrorCode values, breaking up currentSet by half",
						zap.String("resourceType", resourceType),
						zap.Strings("metricNames", metricNames),
						zap.String("aggregationString", *opts.Aggregation),
						zap.Stringp("filterString", opts.Filter),
						zap.Int("queueLength", resourceSetQueue.Len()),
						zap.Int("len(currentSet)", len(currentSet)),
					)
					if len(currentSet) > 1 {
						// Split the current set into two halves and retry
						mid := len(currentSet) / 2
						resourceSetQueue.PushBack(currentSet[:mid])
						resourceSetQueue.PushBack(currentSet[mid:])
					} else {
						// If we only have one resource, we can't split it further, so we log an error as we can't scrape it
						r.logger.Error("Cannot split QueryResources request resourceSet set any further",
							zap.String("resourceType", resourceType),
							zap.Strings("metricNames", metricNames),
							zap.String("aggregationString", *opts.Aggregation),
							zap.Stringp("filterString", opts.Filter),
							zap.Int("len(currentSet)", len(currentSet)),
						)
						errs = errors.Join(errs, fmt.Errorf("received 529 QueryTooExpensive for resource type %s with single resource %s", resourceType, currentSet[0]))
					}
				} else {
					r.settings.Logger.Error("azcore.ResponseError: failed to get Azure Metrics values data",
						zap.String("resourceType", resourceType),
						zap.String("Error", respErr.Error()),
						zap.Int("StatusCode", respErr.StatusCode),
						zap.String("ErrorCode", respErr.ErrorCode),
					)
					errs = errors.Join(errs, respErr)
				}
			} else {
				r.settings.Logger.Error("generic error: failed to get Azure Metrics values data", zap.String("resourceType", resourceType), zap.Error(err))
				errs = errors.Join(errs, err)
				return responses, errs
			}
		} else {
			responses = append(responses, response)
		}
	}
	return responses, errs
}

…ion-MaximumResourcesPerBatch' into receiver/azuremonitor-configuration-MaximumResourcesPerBatch

…esourcesPerBatch

celian-garcia · 2025-06-19T08:02:09Z

I've updated #40078 (comment) ; TL;DR: I don't think a maximum_resources_per_batch config option is the way to go, instead, try to catch the QueryThrottledException, and then recursively break down the resource list by half into two separate requests until the problematic resource(s) are found; ...

Thanks @andrewegel ! I answered directly in the issue. This is not incompatible IMHO.

celian-garcia

Pretty neat! Looks completely good to me. I can't trigger the approval for workflow but I approve the code.

andrewegel · 2025-06-19T23:14:32Z

Correct not incompatible with this change, but I did go with a different route

…esourcesPerBatch

ahurtaud

LGTM ready to merge. Thank you !

receiver/azuremonitor - Added maximumResourcesPerBatch config

1a864d6

suvaidkhan requested a review from a team as a code owner June 16, 2025 22:18

suvaidkhan requested a review from evan-bradley June 16, 2025 22:18

github-actions bot assigned songy23 Jun 16, 2025

github-actions bot added the receiver/azuremonitor label Jun 16, 2025

Merge branch 'main' into receiver/azuremonitor-configuration-MaximumR…

768990b

…esourcesPerBatch

celian-garcia reviewed Jun 17, 2025

View reviewed changes

ahurtaud reviewed Jun 17, 2025

View reviewed changes

andrewegel mentioned this pull request Jun 18, 2025

[receiver/azuremonitorreceiver] Review our settings for scaling #40078

Open

suvaidkhan and others added 3 commits June 18, 2025 13:56

receiver/azuremonitor - removed config validation and tests

6bec990

Merge remote-tracking branch 'origin/receiver/azuremonitor-configurat…

39a013c

…ion-MaximumResourcesPerBatch' into receiver/azuremonitor-configuration-MaximumResourcesPerBatch

Merge branch 'main' into receiver/azuremonitor-configuration-MaximumR…

52e0973

…esourcesPerBatch

celian-garcia approved these changes Jun 19, 2025

View reviewed changes

suvaidkhan requested a review from ahurtaud June 19, 2025 17:02

andborja approved these changes Jun 19, 2025

View reviewed changes

Merge branch 'main' into receiver/azuremonitor-configuration-MaximumR…

a43e72c

…esourcesPerBatch

ahurtaud approved these changes Jun 20, 2025

View reviewed changes

	errInvalidMaxResPerBatch = errors.New(`"MaximumResourcesPerBatch" is invalid`)
	errInvalidMaxResPerBatch = errors.New(`"MaximumResourcesPerBatch" is invalid. It should be between 1 and 50`)

	if c.UseBatchAPI && c.MaximumResourcesPerBatch != 0 && (c.MaximumResourcesPerBatch > defaultMaximumResourcesPerBatch \|\| c.MaximumResourcesPerBatch < 0) {
	if c.UseBatchAPI && c.MaximumResourcesPerBatch != 0 && (c.MaximumResourcesPerBatch > defaultMaximumResourcesPerBatch \|\| c.MaximumResourcesPerBatch < 1) {

receiver/azuremonitor - Added maximumResourcesPerBatch config #40752

Are you sure you want to change the base?

receiver/azuremonitor - Added maximumResourcesPerBatch config #40752

Conversation

suvaidkhan commented Jun 16, 2025

Description

Link to tracking issue

Testing

Documentation

Uh oh!

linux-foundation-easycla bot commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

celian-garcia left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahurtaud commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahurtaud Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

celian-garcia commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewegel commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

celian-garcia commented Jun 19, 2025

Uh oh!

celian-garcia left a comment

Choose a reason for hiding this comment

Uh oh!

andrewegel commented Jun 19, 2025

Uh oh!

ahurtaud left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Jun 16, 2025 •

edited

Loading

ahurtaud commented Jun 17, 2025 •

edited

Loading

ahurtaud Jun 17, 2025 •

edited

Loading

celian-garcia commented Jun 17, 2025 •

edited

Loading

andrewegel commented Jun 18, 2025 •

edited

Loading