Skip to content

Design SQLFlow syntax extension for data transform. #1664

@brightcoder01

Description

@brightcoder01

The root of the discussion series is #1670

The following transform functions are common used. We can support these in the first stage.

Name Transformation Statitical Parameter Input Type Output Type
NORMALIZE(x) Scale the inputs to the range [0, 1]. out = x - x_min / (x_max - x_min) x_min, x_max number float64
STANDARDIZE(x) Scale the inputs to z-score subtracts out the mean and divides by standard deviation. out = x - x_mean / x_stddev x_mean, x_stddev number float64
BUCKETIZE(x, num_buckets, boundaries) Transform the numeric features into categorical ids using a set of thresholds. boundaries Number int64
HASH_BUCKET(x, hash_bucket_size) Map the inputs into a finite number of buckets by hashing. out_id = Hash(input_feature) % bucket_size hash_bucket_size string, int32, int64 int64
VOCABULARIZE(x) Map the inputs to integer ids by looking up the vocabulary vocabulary_list string, int32, int64 int64
EMBEDDING(x, dimension) Map the inputs to embedding vectors N/A int32, int64 float32
CROSS(x1, x2, ..., xn, hash_bucket_size) Hash(cartesian product of features) % hash_bucket_size N/A string, number int64
CONCAT(x1, x2, ..., xn) Concatenate multiple tensors representing categorical ids into one tensor. N/A int32, int64 int64

There are three options for the style of the generated transform code:

  1. Feature Column API. Integrate it with model definition using tf.keras.layers.DenseFeatures;
  2. Customized Keras Layer provided from ElasticDL. The functionality should cover all the common used feature engineering operations above;
  3. Keras Preprocess Layer. This will be ready in TF2.2;

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions