Skip to content

receiver/azuremonitor - Added maximumResourcesPerBatch config #40752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

suvaidkhan
Copy link

Description

Added a new config maximum_resources_per_batch with a default and max value of 50

Link to tracking issue

Fixes #40112

Testing

Added UTs for config validation tests

Documentation

Added details about the config in receiver/azuremonitorreceiver/README.md

@suvaidkhan suvaidkhan requested a review from a team as a code owner June 16, 2025 22:18
@suvaidkhan suvaidkhan requested a review from evan-bradley June 16, 2025 22:18
Copy link

linux-foundation-easycla bot commented Jun 16, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Member

@celian-garcia celian-garcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, would you be a user of that config field?
Because I mentioned in the issue the possibility to configure that by resource type. If this is something that could be interesting for you maybe you could implement it instead of a global field WDYT?

Also @andrewegel @ahurtaud I would like your opinion on that, since you are in the origins of that idea. #40078 (comment)

@@ -28,6 +28,7 @@ var (
errMissingClientSecret = errors.New(`"ClientSecret" is not specified in config`)
errMissingFedTokenFile = errors.New(`"FederatedTokenFile" is not specified in config`)
errInvalidCloud = errors.New(`"Cloud" is invalid`)
errInvalidMaxResPerBatch = errors.New(`"MaximumResourcesPerBatch" is invalid`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
errInvalidMaxResPerBatch = errors.New(`"MaximumResourcesPerBatch" is invalid`)
errInvalidMaxResPerBatch = errors.New(`"MaximumResourcesPerBatch" is invalid. It should be between 1 and 50`)

@ahurtaud
Copy link

ahurtaud commented Jun 17, 2025

I'm curious, would you be a user of that config field? Because I mentioned in the issue the possibility to configure that by resource type. If this is something that could be interesting for you maybe you could implement it instead of a global field WDYT?

Also @andrewegel @ahurtaud I would like your opinion on that, since you are in the origins of that idea. #40078 (comment)

I think it is fine, even more that we dont have much configuration per "resource type".
I would advise users of this feature, if it is require to fine-tune, to create multiple receiver instances for the specifics resources.

something like:

receivers:
  azuremonitor/default:
    use_batch_api: true
    maximum_resources_per_batch: 50
    services:
    - Microsoft.Compute/virtualMachines
    - blahblah
  azuremonitor/throttled:
    use_batch_api: true
    maximum_resources_per_batch: 10
    services:
    - Microsoft.Network/loadBalancers   

@@ -329,5 +333,9 @@ func (c Config) Validate() (err error) {
err = multierr.Append(err, errInvalidCloud)
}

if c.UseBatchAPI && c.MaximumResourcesPerBatch != 0 && (c.MaximumResourcesPerBatch > defaultMaximumResourcesPerBatch || c.MaximumResourcesPerBatch < 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if c.UseBatchAPI && c.MaximumResourcesPerBatch != 0 && (c.MaximumResourcesPerBatch > defaultMaximumResourcesPerBatch || c.MaximumResourcesPerBatch < 0) {
if c.UseBatchAPI && c.MaximumResourcesPerBatch != 0 && (c.MaximumResourcesPerBatch > defaultMaximumResourcesPerBatch || c.MaximumResourcesPerBatch < 1) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also I am not sure we want to reference defaultMaximumResourcesPerBatch for the maximum test here. it works as long as we dont change the default. As its clearly stated in microsoft documentation, should we hardcode 50 here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean enforce it here? and don't have the condition in the code later? I think it can be better yes

Copy link

@ahurtaud ahurtaud Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant:

Suggested change
if c.UseBatchAPI && c.MaximumResourcesPerBatch != 0 && (c.MaximumResourcesPerBatch > defaultMaximumResourcesPerBatch || c.MaximumResourcesPerBatch < 0) {
if c.UseBatchAPI && c.MaximumResourcesPerBatch != 0 && (c.MaximumResourcesPerBatch > 50 || c.MaximumResourcesPerBatch < 1) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to have it named, otherwise it's magic number

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes I see it know, the default is actually to not set the field. Maybe that would make change my mind on the fact that we'd even error on that. Let's maybe let azure decide and so this will display error only if azure consider it as an error

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@celian-garcia Do you want me to remove the config validation check altogether or just change it to hardcoded values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @suvaidkhan for that change of mind 🙏🏻 , but yes, please remove the config validation altogether. That will simplify the code and let Azure fail if needed.

Could you also just add that link in the README.md for reference? https://learn.microsoft.com/en-us/azure/azure-monitor/metrics/migrate-to-batch-api?tabs=individual-response . This is where the 50 is mentioned.
That's not written in a straightforward way in the REST API spec 😞 https://learn.microsoft.com/en-us/rest/api/monitor/metrics-batch/batch?view=rest-monitor-2023-10-01&tabs=HTTP

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@celian-garcia Sure that's not a problem! It does make more sense to have no validation now that I think more about it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@celian-garcia changes done, ready for review!

@celian-garcia
Copy link
Member

celian-garcia commented Jun 17, 2025

I'm curious, would you be a user of that config field? Because I mentioned in the issue the possibility to configure that by resource type. If this is something that could be interesting for you maybe you could implement it instead of a global field WDYT?
Also @andrewegel @ahurtaud I would like your opinion on that, since you are in the origins of that idea. #40078 (comment)

I think it is fine, even more that we dont have much configuration per "resource type". I would advise users of this feature, if it is require to fine-tune, to create multiple receiver instances for the specifics resources.

something like:

receivers:
  azuremonitor/default:
    use_batch_api: true
    maximum_resources_per_batch: 50
    services:
    - Microsoft.Compute/virtualMachines
    - blahblah
  azuremonitor/throttled:
    use_batch_api: true
    maximum_resources_per_batch: 10
    services:
    - Microsoft.Network/loadBalancers   

Having config by resource type is actually something that we do now,

with split dimensions

receivers:
  azuremonitor:
    dimensions:
      enabled: true
      overrides:
        "Microsoft.Network/azureFirewalls":
          # Real example of an Azure limitation here:
          # Dimensions exposed are Reason, Status, Protocol,
          # but when selecting Protocol in the filters, it returns nothing.
          # Note here that the metric display name is ``Network rules hit count`` but it's programmatic value is ``NetworkRuleHit``
          # Ref: https://learn.microsoft.com/en-us/azure/azure-monitor/reference/supported-metrics/microsoft-network-azurefirewalls-metrics
          "NetworkRuleHit": [Reason, Status]

With metrics aggregations specifications

receivers:
  azuremonitor:
    resource_groups:
      - ${resource_groups}
    services:
      - Microsoft.EventHub/namespaces
      - Microsoft.AAD/DomainServices # scraper will fetch all metrics from this namespace since there are no limits under the "metrics" option
    metrics:
      "microsoft.eventhub/namespaces": # scraper will fetch only the metrics listed below:
        IncomingMessages: [total]     # metric IncomingMessages with aggregation "Total"
        NamespaceCpuUsage: [*]        # metric NamespaceCpuUsage with all known aggregations

So to align in that direction, for consistency for users, it might be better to have it here as well. I don't know if it's shocking or not and in which measure this feature will be used. We are not so that's why I was asking to people that would use it ^^.

But yeah, knowing that there is an easy solution and that the feature will probably used in limited conditions, let's not overthink and do it like you said @ahurtaud

@andrewegel
Copy link

andrewegel commented Jun 18, 2025

I've updated #40078 (comment) ; TL;DR: I don't think a maximum_resources_per_batch config option is the way to go, instead, try to catch the QueryThrottledException, and then recursively break down the resource list by half into two separate requests until the problematic resource(s) are found; While its not a fix for problematic resources (in our case, it was a single LB with 250 plus Health Probes on it which increased the Cardinality of the DipAvailability metric beyond what Azure's APIs will allow to be returned) it does prevent the problem of one resource blocking the scraping of other resources, which IMO in the telemetry world is a good thing. I would up writing my own full implementation using inspiration from the receiver, but my code is as follows:

func (r *metricReceiver) tryScrapeLocationResourceType(
	ctx context.Context,
	client *azmetrics.Client,
	resourceType string,
	metricNames []string,
	opts azmetrics.QueryResourcesOptions,
	resourceSet []string) ([]azmetrics.QueryResourcesResponse, error) {

	responses := make([]azmetrics.QueryResourcesResponse, 0)
	var errs error
	resourceSetQueue := list.New()
	resourceSetQueue.PushBack(resourceSet)
	for resourceSetQueue.Len() > 0 {
		currentSet := resourceSetQueue.Remove(resourceSetQueue.Front()).([]string)
		response, err := client.QueryResources(
			ctx,
			r.cfg.SubscriptionID,
			resourceType,
			metricNames,
			azmetrics.ResourceIDList{ResourceIDs: currentSet},
			&opts)
		if err != nil {
			var respErr *azcore.ResponseError
			if errors.As(err, &respErr) {
				// Its unclear if this is a QueryTooExpensive error, but we assume it is
				// because the error code is empty and the status code is 529.
				if respErr.StatusCode == 529 && respErr.ErrorCode == "" {
					// QueryTooExpensive, we need to retry with a smaller set of resources
					r.logger.Warn("Received 529 QueryTooExpensive based on StatusCode and ErrorCode values, breaking up currentSet by half",
						zap.String("resourceType", resourceType),
						zap.Strings("metricNames", metricNames),
						zap.String("aggregationString", *opts.Aggregation),
						zap.Stringp("filterString", opts.Filter),
						zap.Int("queueLength", resourceSetQueue.Len()),
						zap.Int("len(currentSet)", len(currentSet)),
					)
					if len(currentSet) > 1 {
						// Split the current set into two halves and retry
						mid := len(currentSet) / 2
						resourceSetQueue.PushBack(currentSet[:mid])
						resourceSetQueue.PushBack(currentSet[mid:])
					} else {
						// If we only have one resource, we can't split it further, so we log an error as we can't scrape it
						r.logger.Error("Cannot split QueryResources request resourceSet set any further",
							zap.String("resourceType", resourceType),
							zap.Strings("metricNames", metricNames),
							zap.String("aggregationString", *opts.Aggregation),
							zap.Stringp("filterString", opts.Filter),
							zap.Int("len(currentSet)", len(currentSet)),
						)
						errs = errors.Join(errs, fmt.Errorf("received 529 QueryTooExpensive for resource type %s with single resource %s", resourceType, currentSet[0]))
					}
				} else {
					r.settings.Logger.Error("azcore.ResponseError: failed to get Azure Metrics values data",
						zap.String("resourceType", resourceType),
						zap.String("Error", respErr.Error()),
						zap.Int("StatusCode", respErr.StatusCode),
						zap.String("ErrorCode", respErr.ErrorCode),
					)
					errs = errors.Join(errs, respErr)
				}
			} else {
				r.settings.Logger.Error("generic error: failed to get Azure Metrics values data", zap.String("resourceType", resourceType), zap.Error(err))
				errs = errors.Join(errs, err)
				return responses, errs
			}
		} else {
			responses = append(responses, response)
		}
	}
	return responses, errs
}

@celian-garcia
Copy link
Member

I've updated #40078 (comment) ; TL;DR: I don't think a maximum_resources_per_batch config option is the way to go, instead, try to catch the QueryThrottledException, and then recursively break down the resource list by half into two separate requests until the problematic resource(s) are found; ...

Thanks @andrewegel ! I answered directly in the issue. This is not incompatible IMHO.

Copy link
Member

@celian-garcia celian-garcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty neat! Looks completely good to me. I can't trigger the approval for workflow but I approve the code.

@suvaidkhan suvaidkhan requested a review from ahurtaud June 19, 2025 17:02
@andrewegel
Copy link

Correct not incompatible with this change, but I did go with a different route

Copy link

@ahurtaud ahurtaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ready to merge. Thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[receiver/azuremonitor] Config to reduce the number of resource ids in an Azure Batch API call
6 participants