Skip to content

feat: Add hot-tier feature for distributed deployments #852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Aug 4, 2024

Conversation

nikhilsinhaparseable
Copy link
Contributor

  • Env var P_HOT_TIER_DIR to store the files in the query server

  • PUT /logstream/{logstream}/hottier with below JSON body to set hottier for stream
    { "hotTierSize": "1GiB", "hotTierStartDate": "2024-07-16", "hotTierEndDate": "2024-07-19" }

  • GET /logstream/{logstream}/hottier to fetch the current state of hottier for a stream Response JSON -
    { "hotTierSize": "1GiB", "hot_tier_used_size": "1019.41 MiB", "hot_tier_available_size": "4.59 MiB", "hotTierStartDate": "2024-07-16", "hotTierEndDate": "2024-07-19" }

Ingestion flow completed -

  • Query server periodically (every 1 minute) downloads the parquet files and corresponding manifest file from S3 for the range provided in the hot tier
  • deletes files if downloaded file size exceeds the hot tier size

@nikhilsinhaparseable nikhilsinhaparseable self-assigned this Jul 20, 2024
@nikhilsinhaparseable nikhilsinhaparseable marked this pull request as draft July 20, 2024 05:13
@nitisht
Copy link
Member

nitisht commented Jul 20, 2024

Since the path itself has hot_tier, we can remove hot_tier from JSON keys. e.g.

{ 
"size": "1GiB", 
"used_size": "1019.41 MiB", 
"available_size": "4.59 MiB", 
"start_date": "2024-07-16", 
"end_date": "2024-07-19" 
}

@nitisht
Copy link
Member

nitisht commented Jul 20, 2024

Also let's be consistent with camelCase or snake_case. IMHO snake_case is better to read and JSON doesn't have an official guideline on a specific naming.

@nikhilsinhaparseable nikhilsinhaparseable marked this pull request as ready for review July 24, 2024 09:26
@nikhilsinhaparseable nikhilsinhaparseable force-pushed the hot-tier branch 7 times, most recently from 31bb23a to 007325c Compare July 29, 2024 07:38
- Env var P_HOT_TIER_DIR to store the files in the query server
- PUT /logstream/{logstream}/hottier with below JSON body to set hottier for stream
     `{
    "hotTierSize": "1GiB",
    "hotTierStartDate": "2024-07-16",
    "hotTierEndDate": "2024-07-19"
}`

- GET /logstream/{logstream}/hottier to fetch the current state of hottier for a stream
Response JSON -
`{
    "hotTierSize": "1GiB",
    "hot_tier_used_size": "1019.41 MiB",
    "hot_tier_available_size": "4.59 MiB",
    "hotTierStartDate": "2024-07-16",
    "hotTierEndDate": "2024-07-19"
}`

Ingestion flow completed -
- Query server periodically (every 1 minute) downloads the parquet files and corresponding manifest file from S3 for the range provided in the hot tier
- deletes files if downloaded file size exceeds the hot tier size
1. updated request JSON body for PUT /logstream/{logstream}/hottier

`
{
    "size": "1GiB",
    "start_date": "1days",
    "end_date": "now"
}
`
Validations in place to set hottier for a stream -
1. Hot tier can be enabled in distributed mode by Querier node only
2. Stream should exist
3. Hot tier cannot be enabled for streams having time partitions
4. Minimum size of Hot tier is set to 1GiB
5. end_date can be set to now for current date
6. if end_date is not given as now, start and end date should be in format yyyy-mm-dd
7. You can set hot tier for a minimum of 2 days duration i.e. end_date > start_date
8. Size given in request will be matched against the total size of files to be downloaded,
if size of hot tier is found to be less than the total size of files, validation fails

Schedular to set to run every 1 min to verify and download new files from S3

If total size of hot tier is exhausted, oldest date entry in the hot tier (start date) will be deleted
and hot tier start date will be updated to start date + 1

If hot tier is updated for an existing hot tier, all downloaded files will be deleted
and files in the range of dates given in PUT request will be downloaded fresh from S3

The used size and available size in hot tier gets updated as and when files get downloaded/deleted from hot tier
The same can be fetched using GET /logstream/{logstream}/hottier
The response JSON will be like

`
{
    "size": "1GiB",
    "used_size": "540MiB",
    "available_size": "460MiB",
    "start_date": "1days",
    "end_date": "now"
}
`

Added Query flow -
1. Server gets the stream.json and related manifest.json files from storage based on the query time range provided
2. then gets the list of parquet file path from manifest.json
3. it then checks for a list of parquet files available in hot tier
4. datafusion serves the results from hot tier for files available in hot tier
5. then serves the remaining files (not available in hot tier) from S3
each ingestor writes its own manifest file to storage
query server should download all manifest files to
hot tier and download all corresponding parquet files from storage

fix in PUT /hottier
complete all validation first
then create hot tier directory
then update in stream json and in memory
below checks are added -

if hot tier range does not have today's date
and the manifest and respective parquet files are already downloaded
no need to download the manifest again

if stream is initialised today and S3 contains only today's data
and the hot tier size is exhausted,
today's data cannot be deleted
log error and skip the download
in case of no manifest for a particular date in the date range
remove date from the range
- check for disk availability,
- delete older files from hot tier if disk usage exceeds threshold
- download the latest entries from S3 to hot tier
create list of hot tier files where path matches manifest files
remaining hot tier files will be set as remainder to be served from S3
- hot tier size to be set for each stream in the API call
PUT /logstream/{logstream}/hottier with the body
`
{
    "size": "20GiB"
}
`
- validations added
   1. max disk usage (current used disk space + hot tier size of all the streams) should not exceed 80%
   2. minimum hottier size for each stream is set to 10GiB

- download to start from the latest date present in S3 till the hottier size is exhausted
- to keep the latest data, delete oldest entry in hottier
- maintain one manifest file for hottier for each date

- current used size, available size and oldest datetime entry in the hottier can be get from  the API call
GET /logstream/{logstream}/hottier
response body -
`
{
    "size": "5GiB",
    "used_size": "4.24 GiB",
    "available_size": "775 MiB",
    "oldest_date_time_entry": "2024-07-28 21:32:00"
}
`
removed check for hot tier size while processing the files
as this check is done when setting the hot tier
@nikhilsinhaparseable nikhilsinhaparseable force-pushed the hot-tier branch 4 times, most recently from 8b1a6f1 to 3fdd4bb Compare August 1, 2024 06:37
allow increasing the size of the existing hot tier
restrict from reducing the size
- added api to delete the hot tier for a stream
     DELETE /logstream/{logstream}/hottier
- exposed max disk usage to env var P_MAX_DISK_USAGE defaulted to 80.0
.arg(
Arg::new(Self::MAX_DISK_USAGE)
.long(Self::MAX_DISK_USAGE)
.env("P_MAX_DISK_USAGE")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.env("P_MAX_DISK_USAGE")
.env("P_MAX_DISK_USAGE_PERCENT")

@nitisht nitisht merged commit e6f1e2a into parseablehq:main Aug 4, 2024
8 checks passed
parmesant pushed a commit to parmesant/parseable that referenced this pull request Aug 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants