-
Notifications
You must be signed in to change notification settings - Fork 116
Description
Why do we need to concatenate IDs of multiple category features for embedding?
In #1721, we have introduced that we need to convert categorical feature values to integer IDs if we want to export the model using Tensorflow SavedModel for TF Serving.
age | education | marital-status |
---|---|---|
34 | Master | Divorced |
54 | Doctor | Never-married |
42 | Bachelor | Never-married |
To
age | education | marital-status |
---|---|---|
34 | 0 | 0 |
54 | 1 | 1 |
42 | 2 | 1 |
After converting category values to IDs, we generally use those IDs to perform a lookup in the embedding matrix.
The problem: a dataset sometimes has many categorical features. If we make embedding for each categorical features separately, we need to create many embedding table variables. Besides the weights in variables, there are overhead to create a variable. So, the size of the model may be very huge and the performance of embedding lookup may be inefficient.
In order to reduce the number of embedding table variables, we can concatenate the categorical feature IDs tensor to a big tensor and merge the embedding tables. However, the same IDs will return the same embedding vectors by lookup in the merged embedding table. In the following figure, we can see that embedding vectors of "marital-status" are the same as "education".
So, we need to add an offset for IDs of "marital-status" so that the "martial-status" feature can get its embedding vectors by lookup in the merged embedding table.
Solution Proposals to Concatenate the IDs using different Tensorflow API.
#1721 has listed 3 methods to convert category value to IDs. We need to adopt different methods to concatenate IDs using different methods.
1. Concatenate the IDs generated by categorical columns in tf.feature_column
such as tf.feature_column.categorical_column_with_hash_bucket
.
The example is the 1st case showed in #1721. If we use categorical columns to convert categories values to IDs, we must use embedding_column
to make embedding for those IDs. Because the output of categorial columns is a sparse tensor which cannot be directly used in DenseFeature
. So, we need to concatenate the outputs of categorial features before embedding_column
to reduce the number of embedding variables.
education_hash_column = tf.feature_column.categorical_column_with_hash_bucket(
name="education", hash_bucket_size=3
) # the id is in [0,3)
marital_hash_column = tf.feature_column.categorical_column_with_hash_bucket(
name="marital-status", hash_bucket_size=5
) # the id is in [0,5)
edu_marital_concat = concat_column([education_hash_column, marital_hash_column])
education_embedded_column = tf.feature_column.embedding_column(
edu_marital_concat, embedding_dim=2
)
The concat_column
will concatenate the outputs of education_hash_column
and marital_hash_column
and add offset 3 for the IDs of marital_hash_column
.
In the case, we need to customize a concat_column
showed in PR #1719.
2. Concatenate the IDs generated by numeric_column
with custom transform_fn。
The examples are 2nd in #1721. The output of numeric_column
is a tensor with IDs and we can directly use it in DenseFeatures
.
def generate_hash_bucket_column(name, hash_bucket_size):
def hash_bucket_id(x, hash_bucket_size):
if x.dtype is not tf.string:
x = tf.strings.as_string(x)
return tf.strings.to_hash_bucket_fast(x, hash_bucket_size)
transform_fn = lambda x, hash_bucket_size=hash_bucket_size : (
hash_bucket_id(x, hash_bucket_size)
)
return tf.feature_column.numeric_column(
name, dtype=tf.int32, normalizer_fn=transform_fn
)
input_layers = [
tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
]
education_hash = generate_hash_bucket_column(
name="education", hash_bucket_size=3
) # the id is in [0,3)
marital_hash = generate_hash_bucket_column(
name="marital-status", hash_bucket_size=5
) # the id is in [0,5)
education_hash_ids = tf.keras.layers.DenseFeature([education_hash])(input_layers)
marital_hash_ids = tf.keras.layers.DenseFeature([marital_hash])(input_layers)
Then, we can add offset for the IDs tensor of "marital-status" and concatenate it with "education" IDs tensor like:
marital_ids_with_offset = marital_hash_ids + 3 #3 is the number of education IDs
edu_marital_concat = tf.keras.layers.Concatenate()([education_hash_ids, marital_ids_with_offset])
In those case, we need to customize a transform_fn for numeric_column
and don't need to customize a concat_column
.
3. Concatenate the IDs generated by custom transformation layers.
The examples are the 3rd methods in #1721. The output of the custom layer HashBucket
is the same as the numeric_column
in the 2nd method. So we can use the same way to add offset to "marital-status" IDs and concatenate.
class HashBucket(tf.keras.layers.Layer):
def __init__(self, hash_bucket_size):
super(HashBucket, self).__init__()
self.hash_bucket_size =hash_ bucket_size
def call(self, inputs):
if inputs.dtype is not tf.string:
inputs = tf.strings.as_string(inputs)
bucket_id = tf.strings.to_hash_bucket_fast(
inputs, self.hash_bucket_size
)
return tf.cast(bucket_id, tf.int64)
education_input = tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
marital_input = tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
education_hash_ids = HashBucket(hash_bucket_size=3)(education_input) # the id is in [0,3)
marital_hash_ids = HashBucket(hash_bucket_size=5)(marital_input) # the id is in [0,5)
marital_ids_with_offset = marital_hash_ids + 3 #3 is the number of education IDs
edu_marital_concat = tf.keras.layers.Concatenate()([education_hash_ids, marital_ids_with_offset])