-
Notifications
You must be signed in to change notification settings - Fork 116
Closed
Labels
Description
The root of the discussion series is #1670
The following transform functions are common used. We can support these in the first stage.
Name | Transformation | Statitical Parameter | Input Type | Output Type |
---|---|---|---|---|
NORMALIZE(x) | Scale the inputs to the range [0, 1]. out = x - x_min / (x_max - x_min) |
x_min, x_max | number | float64 |
STANDARDIZE(x) | Scale the inputs to z-score subtracts out the mean and divides by standard deviation. out = x - x_mean / x_stddev |
x_mean, x_stddev | number | float64 |
BUCKETIZE(x, num_buckets, boundaries) | Transform the numeric features into categorical ids using a set of thresholds. | boundaries | Number | int64 |
HASH_BUCKET(x, hash_bucket_size) | Map the inputs into a finite number of buckets by hashing. out_id = Hash(input_feature) % bucket_size |
hash_bucket_size | string, int32, int64 | int64 |
VOCABULARIZE(x) | Map the inputs to integer ids by looking up the vocabulary | vocabulary_list | string, int32, int64 | int64 |
EMBEDDING(x, dimension) | Map the inputs to embedding vectors | N/A | int32, int64 | float32 |
CROSS(x1, x2, ..., xn, hash_bucket_size) | Hash(cartesian product of features) % hash_bucket_size | N/A | string, number | int64 |
CONCAT(x1, x2, ..., xn) | Concatenate multiple tensors representing categorical ids into one tensor. | N/A | int32, int64 | int64 |
There are three options for the style of the generated transform code:
- Feature Column API. Integrate it with model definition using tf.keras.layers.DenseFeatures;
- Customized Keras Layer provided from ElasticDL. The functionality should cover all the common used feature engineering operations above;
- Keras Preprocess Layer. This will be ready in TF2.2;