Skip to content

[pkg/ottl] Support for grok patterns #32593

Closed
@michalpristas

Description

@michalpristas

Component(s)

pkg/ottl

Is your feature request related to a problem? Please describe.

just a copy of what i wrote as a comment in other issue as i thought we're discussing this

Why should we support grok?
grok if you ask me is much more readable and very common for our users.
what i have in mind is also custom pattern definition so you could do something like this

with ExtractGrokPattern signature like this
ExtractGrokPattern(source, pattern, custom_patterns)

custom_patterns is a map

and input string
my beagle is BLUE

you could do

ExtractGrokPattern(source, "my %{FAVORITE_DOG:dog} is colored %{RGB:color}", {
 "FAVORITE_DOG" : "beagle",
  "RGB" : "RED|GREEN|BLUE"
}

and this would result in

{
  "dog": "beagle",
  "color": "BLUE"
}

while this example is not that realistic nginx example from our pipeline shows the beauty of it

patterns:
  - (%{NGINX_HOST} )?"?(?:%{NGINX_ADDRESS_LIST:result.access.remote_ip_list}|%{NOTSPACE:source.address})
    - (-|%{DATA:user.name}) \[%{HTTPDATE:result.access.time}\] "%{DATA:result.access.info}"
    %{NUMBER:http.response.status_code:long} %{NUMBER:http.response.body.bytes:long}
    "(-|%{DATA:http.request.referrer})" "(-|%{DATA:user_agent.original})" %{NUMBER:result.access.http.request.length:long}
    %{NUMBER:result.access.http.request.time:double} \[%{DATA:result.access.upstream.name}\]
    \[%{DATA:result.access.upstream.alternative_name}\] (%{UPSTREAM_ADDRESS_LIST:result.access.upstream_address_list}|-)
    (%{UPSTREAM_RESPONSE_LENGTH_LIST:result.access.upstream.response.length_list}|-) (%{UPSTREAM_RESPONSE_TIME_LIST:result.access.upstream.response.time_list}|-)
    (%{UPSTREAM_RESPONSE_STATUS_CODE_LIST:result.access.upstream.response.status_code_list}|-) %{GREEDYDATA:result.access.http.request.id}
pattern_definitions:
  NGINX_HOST: (?:%{IP:destination.ip}|%{NGINX_NOTSEPARATOR:destination.domain})(:%{NUMBER:destination.port})?
  NGINX_NOTSEPARATOR: "[^\t ,:]+"
  NGINX_ADDRESS_LIST: (?:%{IP}|%{WORD})("?,?\s*(?:%{IP}|%{WORD}))*
  UPSTREAM_ADDRESS_LIST: (?:%{IP}(:%{NUMBER})?)("?,?\s*(?:%{IP}(:%{NUMBER})?))*
  UPSTREAM_RESPONSE_LENGTH_LIST: (?:%{NUMBER})("?,?\s*(?:%{NUMBER}))*
  UPSTREAM_RESPONSE_TIME_LIST: (?:%{NUMBER})("?,?\s*(?:%{NUMBER}))*
  UPSTREAM_RESPONSE_STATUS_CODE_LIST: (?:%{NUMBER})("?,?\s*(?:%{NUMBER}))*
  IP: (?:\[?%{IPV6}\]?|%{IPV4})

this pattern is complex and writing this using regex would be ugly

Describe the solution you'd like

ExtractGrokPattern(source, pattern, custom_patterns) on top of ExtractPattern to give user an option

Grok uses regex anyways but provides better experience

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions