-
-
Notifications
You must be signed in to change notification settings - Fork 143
feat: Add hot-tier feature for distributed deployments #852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Since the path itself has hot_tier, we can remove hot_tier from JSON keys. e.g. {
"size": "1GiB",
"used_size": "1019.41 MiB",
"available_size": "4.59 MiB",
"start_date": "2024-07-16",
"end_date": "2024-07-19"
} |
Also let's be consistent with |
31bb23a
to
007325c
Compare
- Env var P_HOT_TIER_DIR to store the files in the query server - PUT /logstream/{logstream}/hottier with below JSON body to set hottier for stream `{ "hotTierSize": "1GiB", "hotTierStartDate": "2024-07-16", "hotTierEndDate": "2024-07-19" }` - GET /logstream/{logstream}/hottier to fetch the current state of hottier for a stream Response JSON - `{ "hotTierSize": "1GiB", "hot_tier_used_size": "1019.41 MiB", "hot_tier_available_size": "4.59 MiB", "hotTierStartDate": "2024-07-16", "hotTierEndDate": "2024-07-19" }` Ingestion flow completed - - Query server periodically (every 1 minute) downloads the parquet files and corresponding manifest file from S3 for the range provided in the hot tier - deletes files if downloaded file size exceeds the hot tier size
1. updated request JSON body for PUT /logstream/{logstream}/hottier ` { "size": "1GiB", "start_date": "1days", "end_date": "now" } ` Validations in place to set hottier for a stream - 1. Hot tier can be enabled in distributed mode by Querier node only 2. Stream should exist 3. Hot tier cannot be enabled for streams having time partitions 4. Minimum size of Hot tier is set to 1GiB 5. end_date can be set to now for current date 6. if end_date is not given as now, start and end date should be in format yyyy-mm-dd 7. You can set hot tier for a minimum of 2 days duration i.e. end_date > start_date 8. Size given in request will be matched against the total size of files to be downloaded, if size of hot tier is found to be less than the total size of files, validation fails Schedular to set to run every 1 min to verify and download new files from S3 If total size of hot tier is exhausted, oldest date entry in the hot tier (start date) will be deleted and hot tier start date will be updated to start date + 1 If hot tier is updated for an existing hot tier, all downloaded files will be deleted and files in the range of dates given in PUT request will be downloaded fresh from S3 The used size and available size in hot tier gets updated as and when files get downloaded/deleted from hot tier The same can be fetched using GET /logstream/{logstream}/hottier The response JSON will be like ` { "size": "1GiB", "used_size": "540MiB", "available_size": "460MiB", "start_date": "1days", "end_date": "now" } ` Added Query flow - 1. Server gets the stream.json and related manifest.json files from storage based on the query time range provided 2. then gets the list of parquet file path from manifest.json 3. it then checks for a list of parquet files available in hot tier 4. datafusion serves the results from hot tier for files available in hot tier 5. then serves the remaining files (not available in hot tier) from S3
each ingestor writes its own manifest file to storage query server should download all manifest files to hot tier and download all corresponding parquet files from storage fix in PUT /hottier complete all validation first then create hot tier directory then update in stream json and in memory
below checks are added - if hot tier range does not have today's date and the manifest and respective parquet files are already downloaded no need to download the manifest again if stream is initialised today and S3 contains only today's data and the hot tier size is exhausted, today's data cannot be deleted log error and skip the download
in case of no manifest for a particular date in the date range remove date from the range
- check for disk availability, - delete older files from hot tier if disk usage exceeds threshold - download the latest entries from S3 to hot tier
create list of hot tier files where path matches manifest files remaining hot tier files will be set as remainder to be served from S3
- hot tier size to be set for each stream in the API call PUT /logstream/{logstream}/hottier with the body ` { "size": "20GiB" } ` - validations added 1. max disk usage (current used disk space + hot tier size of all the streams) should not exceed 80% 2. minimum hottier size for each stream is set to 10GiB - download to start from the latest date present in S3 till the hottier size is exhausted - to keep the latest data, delete oldest entry in hottier - maintain one manifest file for hottier for each date - current used size, available size and oldest datetime entry in the hottier can be get from the API call GET /logstream/{logstream}/hottier response body - ` { "size": "5GiB", "used_size": "4.24 GiB", "available_size": "775 MiB", "oldest_date_time_entry": "2024-07-28 21:32:00" } `
8f6cd60
to
8e3aca8
Compare
removed check for hot tier size while processing the files as this check is done when setting the hot tier
8b1a6f1
to
3fdd4bb
Compare
allow increasing the size of the existing hot tier restrict from reducing the size
3fdd4bb
to
df4cde7
Compare
- added api to delete the hot tier for a stream DELETE /logstream/{logstream}/hottier - exposed max disk usage to env var P_MAX_DISK_USAGE defaulted to 80.0
nitisht
reviewed
Aug 3, 2024
server/src/cli.rs
Outdated
.arg( | ||
Arg::new(Self::MAX_DISK_USAGE) | ||
.long(Self::MAX_DISK_USAGE) | ||
.env("P_MAX_DISK_USAGE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested change
.env("P_MAX_DISK_USAGE") | |
.env("P_MAX_DISK_USAGE_PERCENT") |
parmesant
pushed a commit
to parmesant/parseable
that referenced
this pull request
Aug 4, 2024
12 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Env var P_HOT_TIER_DIR to store the files in the query server
PUT /logstream/{logstream}/hottier with below JSON body to set hottier for stream
{ "hotTierSize": "1GiB", "hotTierStartDate": "2024-07-16", "hotTierEndDate": "2024-07-19" }
GET /logstream/{logstream}/hottier to fetch the current state of hottier for a stream Response JSON -
{ "hotTierSize": "1GiB", "hot_tier_used_size": "1019.41 MiB", "hot_tier_available_size": "4.59 MiB", "hotTierStartDate": "2024-07-16", "hotTierEndDate": "2024-07-19" }
Ingestion flow completed -