feat: Add hot-tier feature for distributed deployments #852

nikhilsinhaparseable · 2024-07-20T05:13:43Z

Env var P_HOT_TIER_DIR to store the files in the query server
PUT /logstream/{logstream}/hottier with below JSON body to set hottier for stream
{ "hotTierSize": "1GiB", "hotTierStartDate": "2024-07-16", "hotTierEndDate": "2024-07-19" }
GET /logstream/{logstream}/hottier to fetch the current state of hottier for a stream Response JSON -
{ "hotTierSize": "1GiB", "hot_tier_used_size": "1019.41 MiB", "hot_tier_available_size": "4.59 MiB", "hotTierStartDate": "2024-07-16", "hotTierEndDate": "2024-07-19" }

Ingestion flow completed -

Query server periodically (every 1 minute) downloads the parquet files and corresponding manifest file from S3 for the range provided in the hot tier
deletes files if downloaded file size exceeds the hot tier size

nitisht · 2024-07-20T08:43:11Z

Since the path itself has hot_tier, we can remove hot_tier from JSON keys. e.g.

{ 
"size": "1GiB", 
"used_size": "1019.41 MiB", 
"available_size": "4.59 MiB", 
"start_date": "2024-07-16", 
"end_date": "2024-07-19" 
}

nitisht · 2024-07-20T08:47:34Z

Also let's be consistent with camelCase or snake_case. IMHO snake_case is better to read and JSON doesn't have an official guideline on a specific naming.

- Env var P_HOT_TIER_DIR to store the files in the query server - PUT /logstream/{logstream}/hottier with below JSON body to set hottier for stream `{ "hotTierSize": "1GiB", "hotTierStartDate": "2024-07-16", "hotTierEndDate": "2024-07-19" }` - GET /logstream/{logstream}/hottier to fetch the current state of hottier for a stream Response JSON - `{ "hotTierSize": "1GiB", "hot_tier_used_size": "1019.41 MiB", "hot_tier_available_size": "4.59 MiB", "hotTierStartDate": "2024-07-16", "hotTierEndDate": "2024-07-19" }` Ingestion flow completed - - Query server periodically (every 1 minute) downloads the parquet files and corresponding manifest file from S3 for the range provided in the hot tier - deletes files if downloaded file size exceeds the hot tier size

1. updated request JSON body for PUT /logstream/{logstream}/hottier ` { "size": "1GiB", "start_date": "1days", "end_date": "now" } ` Validations in place to set hottier for a stream - 1. Hot tier can be enabled in distributed mode by Querier node only 2. Stream should exist 3. Hot tier cannot be enabled for streams having time partitions 4. Minimum size of Hot tier is set to 1GiB 5. end_date can be set to now for current date 6. if end_date is not given as now, start and end date should be in format yyyy-mm-dd 7. You can set hot tier for a minimum of 2 days duration i.e. end_date > start_date 8. Size given in request will be matched against the total size of files to be downloaded, if size of hot tier is found to be less than the total size of files, validation fails Schedular to set to run every 1 min to verify and download new files from S3 If total size of hot tier is exhausted, oldest date entry in the hot tier (start date) will be deleted and hot tier start date will be updated to start date + 1 If hot tier is updated for an existing hot tier, all downloaded files will be deleted and files in the range of dates given in PUT request will be downloaded fresh from S3 The used size and available size in hot tier gets updated as and when files get downloaded/deleted from hot tier The same can be fetched using GET /logstream/{logstream}/hottier The response JSON will be like ` { "size": "1GiB", "used_size": "540MiB", "available_size": "460MiB", "start_date": "1days", "end_date": "now" } ` Added Query flow - 1. Server gets the stream.json and related manifest.json files from storage based on the query time range provided 2. then gets the list of parquet file path from manifest.json 3. it then checks for a list of parquet files available in hot tier 4. datafusion serves the results from hot tier for files available in hot tier 5. then serves the remaining files (not available in hot tier) from S3

each ingestor writes its own manifest file to storage query server should download all manifest files to hot tier and download all corresponding parquet files from storage fix in PUT /hottier complete all validation first then create hot tier directory then update in stream json and in memory

below checks are added - if hot tier range does not have today's date and the manifest and respective parquet files are already downloaded no need to download the manifest again if stream is initialised today and S3 contains only today's data and the hot tier size is exhausted, today's data cannot be deleted log error and skip the download

in case of no manifest for a particular date in the date range remove date from the range

- check for disk availability, - delete older files from hot tier if disk usage exceeds threshold - download the latest entries from S3 to hot tier

create list of hot tier files where path matches manifest files remaining hot tier files will be set as remainder to be served from S3

- hot tier size to be set for each stream in the API call PUT /logstream/{logstream}/hottier with the body ` { "size": "20GiB" } ` - validations added 1. max disk usage (current used disk space + hot tier size of all the streams) should not exceed 80% 2. minimum hottier size for each stream is set to 10GiB - download to start from the latest date present in S3 till the hottier size is exhausted - to keep the latest data, delete oldest entry in hottier - maintain one manifest file for hottier for each date - current used size, available size and oldest datetime entry in the hottier can be get from the API call GET /logstream/{logstream}/hottier response body - ` { "size": "5GiB", "used_size": "4.24 GiB", "available_size": "775 MiB", "oldest_date_time_entry": "2024-07-28 21:32:00" } `

removed check for hot tier size while processing the files as this check is done when setting the hot tier

allow increasing the size of the existing hot tier restrict from reducing the size

- added api to delete the hot tier for a stream DELETE /logstream/{logstream}/hottier - exposed max disk usage to env var P_MAX_DISK_USAGE defaulted to 80.0

nitisht · 2024-08-03T06:06:31Z

server/src/cli.rs

+            .arg(
+                Arg::new(Self::MAX_DISK_USAGE)
+                    .long(Self::MAX_DISK_USAGE)
+                    .env("P_MAX_DISK_USAGE")


Suggested change

.env("P_MAX_DISK_USAGE")

.env("P_MAX_DISK_USAGE_PERCENT")

nikhilsinhaparseable self-assigned this Jul 20, 2024

nikhilsinhaparseable marked this pull request as draft July 20, 2024 05:13

nikhilsinhaparseable marked this pull request as ready for review July 24, 2024 09:26

nikhilsinhaparseable force-pushed the hot-tier branch 7 times, most recently from 31bb23a to 007325c Compare July 29, 2024 07:38

nikhilsinhaparseable added 8 commits July 31, 2024 13:02

fix: update hot tier

691aec1

in case of no manifest for a particular date in the date range remove date from the range

enhancement of hot tier implementation

e285c33

- check for disk availability, - delete older files from hot tier if disk usage exceeds threshold - download the latest entries from S3 to hot tier

fix for query with hot tier files

c7da149

create list of hot tier files where path matches manifest files remaining hot tier files will be set as remainder to be served from S3

nikhilsinhaparseable force-pushed the hot-tier branch from 8f6cd60 to 8e3aca8 Compare July 31, 2024 07:32

fix for disk availability check

52f603f

removed check for hot tier size while processing the files as this check is done when setting the hot tier

nikhilsinhaparseable force-pushed the hot-tier branch 4 times, most recently from 8b1a6f1 to 3fdd4bb Compare August 1, 2024 06:37

enhancement to hot tier implementation

df4cde7

allow increasing the size of the existing hot tier restrict from reducing the size

nikhilsinhaparseable force-pushed the hot-tier branch from 3fdd4bb to df4cde7 Compare August 1, 2024 07:44

enhancement of hot tier implementation

06d2f4a

- added api to delete the hot tier for a stream DELETE /logstream/{logstream}/hottier - exposed max disk usage to env var P_MAX_DISK_USAGE defaulted to 80.0

nitisht reviewed Aug 3, 2024

View reviewed changes

nikhilsinhaparseable and others added 2 commits August 3, 2024 19:09

P_MAX_DISK_USAGE env var renamed to P_MAX_DISK_USAGE_PERCENT

6d2b54e

Merge branch 'main' into hot-tier

bcbb471

nitisht merged commit e6f1e2a into parseablehq:main Aug 4, 2024
8 checks passed

parmesant pushed a commit to parmesant/parseable that referenced this pull request Aug 4, 2024

feat: hot-tier on query node for distributed set up (parseablehq#852)

95e57d7

vishalkrishnads mentioned this pull request Aug 12, 2024

feat: add test coverage for hot tier parseablehq/quest#71

Draft

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Add hot-tier feature for distributed deployments #852

feat: Add hot-tier feature for distributed deployments #852

Uh oh!

nikhilsinhaparseable commented Jul 20, 2024

Uh oh!

nitisht commented Jul 20, 2024

Uh oh!

nitisht commented Jul 20, 2024

Uh oh!

nitisht Aug 3, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: Add hot-tier feature for distributed deployments #852

feat: Add hot-tier feature for distributed deployments #852

Uh oh!

Conversation

nikhilsinhaparseable commented Jul 20, 2024

Uh oh!

nitisht commented Jul 20, 2024

Uh oh!

nitisht commented Jul 20, 2024

Uh oh!

nitisht Aug 3, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!