Features in Training

JFrog ML Documentation

Products
JFrog ML
Content Type
User Guide

This documentation provides examples and usage patterns for interacting with the Offline Feature Store using the OfflineClientV2 in Python (available from SDK version 0.5.61 and higher). It covers how to retrieve feature values for machine learning model training and analysis.

Prerequisites:

Before using these examples, ensure you have the following Python packages installed:

pip install pyathena pyarrow

APIs:

Get Feature Values

This API retrieves features from an offline feature store for one or more feature sets, given a population DataFrame. The resulting DataFrame will include the population DataFrame enriched with the requested feature values as of the point_in_time specified.

Arguments:

  • features: List[FeatureSetFeatures] - required

    A list of feature sets to fetch.

  • population: pd.DataFrame - required

    A DataFrame containing:

    • All keys of the requested feature sets.

    • A point in time column.

    • Optional enrichments, e.g., labels.

  • point_in_time_column_name: str - required

    The name of the point in time column in the population DataFrame.

Returns:pd.DataFrame

Example call:

import pandas as pd
from frogml.feature_store.offline import OfflineClientV2
from frogml.core.feature_store.offline.feature_set_features import FeatureSetFeatures

offline_feature_store = OfflineClientV2()

user_impressions_features = FeatureSetFeatures(
    feature_set_name='impressions',
    feature_names=['number_of_impressions']
)
user_purchases_features = FeatureSetFeatures(
    feature_set_name='purchases',
    feature_names=['number_of_purchases', 'avg_purchase_amount']
)
features = [user_impressions_features, user_purchases_features]

population_df = pd.DataFrame(
    columns=['impression_id', 'purchase_id', 'timestamp', 'label'],
    data=[['1', '100', '2021-01-02 17:00:00', 1], ['2', '200', '2021-01-01 12:00:00', 0]]
)

train_df: pd.DataFrame = offline_feature_store.get_feature_values(
    features=features,
    population=population_df,
    point_in_time_column_name='timestamp'
)

print(train_df.head())

Example results:

# train_df
#    impression_id   purchase_id           timestamp           label   impressions.number_of_impressions   purchases.number_of_purchases   purchases.avg_purchase_amount
# 0       1               100       2021-04-24 17:00:00       1                   312                                         76                                4.796842
# 1       2               200       2021-04-24 12:00:00       0                    86                                          5                                1.548000

In this example, the label serves as an enhancement to the dataset, rather than a criterion for data selection. This approach is particularly useful when you possess a comprehensive list of keys along with their respective timestamps. The Feature Store API is designed to cater to scenarios requiring data amalgamation from multiple feature sets, ensuring that, for each row in population_df, no more than one corresponding record is returned. Leveraging JFrog ML time-series based feature store, which organizes data within start_timestamp and end_timestamp bounds for each feature vector (key), guarantees that a singular, most relevant result is retrieved for every unique key-timestamp combination.

Get Feature Range Values

Retrieve features from an offline feature-set for a given time range. The result data-frame will contain all data points of the given feature-set in the given time range. If population is provided, then the result will be filtered by the key values it contains.

Arguments:

  • features: FeatureSetFeatures - required:

    A list of features to fetch from a single feature set.

  • start_date: datetime - required:

    The lower time bound.

  • end_date: datetime - required:

    The upper time bound.

  • population: pd.DataFrame - optional:

    A DataFrame containing the following columns:

    • The key of the requested feature-set required

    • Enrichments e.g., labels. optional

Returns:pd.DataFrame

Example Call:

from datetime import datetime
import pandas as pd
from frogml.feature_store.offline import OfflineClientV2
from frogml.core.feature_store.offline.feature_set_features import FeatureSetFeatures

offline_feature_store = OfflineClientV2()

start_date = datetime(year=2021, month=1, day=1)
end_date = datetime(year=2021, month=1, day=3)
features = FeatureSetFeatures(
    feature_set_name='purchases',
    feature_names=['number_of_purchases', 'avg_purchase_amount']
)

train_df: pd.DataFrame = offline_feature_store.get_feature_range_values(
    features=features,
    start_date=start_date,
    end_date=end_date
)

print(train_df.head())

Example Results:

# train_df
#      purchase_id           timestamp           purchases.number_of_purchases     purchases.avg_purchase_amount
# 0       1             2021-01-02 17:00:00               76                                4.796842
# 1       1             2021-01-01 12:00:00                5                                1.548000
# 2       2             2021-01-02 12:00:00                5                                5.548000
# 3       2             2021-01-01 18:00:00                5                                2.788000                         

Note

Current Limitations

The get_feature_range_values API call is currently not available for Streaming Aggregations feature sets and not available to fetch data for multiple feature sets at the same time (join data).