Feldera: the Real-Time Compute Platform for Fraud Detection Teams
When it comes to fraud detection, training an accurate ML model is only the beginning. To ensure its effectiveness, you must supply the model with up-to-date features computed from the latest data. Building a real-time feature pipeline is a massive undertaking:
It requires many months of work from specialized platform teams.
Even then, the results may be less than ideal. Real-time feature pipelines frequently face issues like poor feature freshness and online/offline skew. This is exactly the type of weakness fraudsters will exploit to avoid or delay detection!
Feldera's unique incremental compute engine solves both challenges out-of-the box, making it an ideal compute platform for fraud detection teams:
Agility. With Feldera, even lean data teams can ship complex real-time feature pipelines within days.
Precision and speed. Feldera delivers perfect feature freshness and zero online/offline skew while processing millions of events per second.
Let's find out how this works!
Build feature pipelines in SQL
Feldera feature pipelines are written in SQL. Our powerful SQL engine supports all common types of feature queries, ranging from simple transformations to complex enrichments, aggregates, window-based queries, rolling aggregates, etc. Complex features are built by composing and combining simpler ones.
If you know SQL, you are ready to do real-time feature engineering with Feldera. Why not start right now? Just paste the following code into the Feldera online sandbox and hit RUN:
-- Credit card holders.
CREATE TABLE CUSTOMER (
cc_num BIGINT NOT NULL PRIMARY KEY, -- Credit card number
name varchar, -- Customer name
lat DOUBLE, -- Customer home address latitude
long DOUBLE -- Customer home address longitude
) WITH (
'materialized' = 'true',
-- Configure the random data generator to generate 100000 customer records.
-- (see https://www.feldera.com/docs/connectors/sources/datagen)
'connectors' = '[{
"transport": {
"name": "datagen",
"config": {
"plan": [{
"limit": 100000,
"fields": {
"name": { "strategy": "name" },
"cc_num": { "range": [ 100000000000000, 100000000100000 ] },
"lat": { "strategy": "uniform", "range": [ 25, 50 ] },
"long": { "strategy": "uniform", "range": [ -126, -67 ] }
}
}]
}
}
}]'
);
-- Credit card transactions.
CREATE TABLE TRANSACTION (
ts TIMESTAMP LATENESS INTERVAL 10 MINUTES, -- Transaction time
amt DOUBLE, -- Transaction amount
cc_num BIGINT NOT NULL, -- Credit card number
shipping_lat DOUBLE, -- Shipping address latitude
shipping_long DOUBLE, -- Shipping address longitude
FOREIGN KEY (cc_num) REFERENCES CUSTOMER(cc_num)
) WITH (
'materialized' = 'true',
-- Configure the random data generator to generate 1M transactions at the rate of 1000 transactions/s.
'connectors' = '[{
"transport": {
"name": "datagen",
"config": {
"plan": [{
"limit": 1000000,
"rate": 1000,
"fields": {
"ts": { "strategy": "increment", "scale": 1000, "range": [1722063600000,2226985200000] },
"amt": { "strategy": "zipf", "range": [ 1, 10000 ] },
"cc_num": { "strategy": "uniform", "range": [ 100000000000000, 100000000100000 ] },
"shipping_lat": { "strategy": "uniform", "range": [ 25, 50 ] },
"shipping_long": { "strategy": "uniform", "range": [ -126, -67 ] }
}
}]
}
}
}]'
);
-- Data enrichment query: left-join the TRANSACTION table with the CUSTOMER table to compute the
-- distance between the shipping address and the customer's home address for each transaction.
CREATE VIEW TRANSACTION_WITH_DISTANCE AS
SELECT
t.*,
ST_DISTANCE(ST_POINT(shipping_long, shipping_lat), ST_POINT(long,lat)) AS distance
FROM
TRANSACTION as t
LEFT JOIN CUSTOMER as c
ON t.cc_num = c.cc_num;
-- Compute two rolling aggregates over a 1-day time window for each transaction:
-- 1. Average spend per transaction.
-- 2. The number of transactions whose shipping address is more than 50,000 meters away from
-- the card holder's home address.
CREATE VIEW TRANSACTION_WITH_AGGREGATES AS
SELECT
*,
AVG(amt) OVER window_1_day as avg_1day,
SUM(case when distance > 50000 then 1 else 0 end) OVER window_1_day as count_1day
FROM
TRANSACTION_WITH_DISTANCE
WINDOW window_1_day AS (PARTITION BY cc_num ORDER BY ts RANGE BETWEEN INTERVAL 1 DAYS PRECEDING AND CURRENT ROW);
Here we did three things:
Used
CREATE TABLE
SQL statements to specify the data format for two input data streams: credit card transactions and customers.Connected input data sources (in this case, the random data generator) to each table.
Used
CREATE VIEW
SQL statements to define several features over the input streams.
The queries in this example represent three important classes of ML features used in fraud detection. Let us take a closer look at them.
Data enrichment in SQL
In fraud detection, the raw stream of financial transactions does not usually
contain enough information to identify fraud reliably. A stronger signal can be
obtained by combining (or enriching) raw transaction data with data from
other sources. In our example, each credit card transaction contains a shipping
destination (expressed as latitude/longitude coordinatates). To obtain a useful
signal to identify fraudulent transactions, we compute the distance from the
shipping destination to the card holder's home address extracted from the
CUSTOMER
table. To this end, we LEFT JOIN
the TRANSACTION
table with the
CUSTOMER
table:
CREATE VIEW TRANSACTION_WITH_DISTANCE AS
SELECT
t.*,
ST_DISTANCE(ST_POINT(shipping_long, shipping_lat), ST_POINT(long,lat)) AS distance
FROM
TRANSACTION as t
LEFT JOIN CUSTOMER as c
ON t.cc_num = c.cc_num;
Thus, SQL joins provide a powerful (and familiar!) tool for expressing data enrichment queries.
Rolling aggregation
Rolling aggregates are arguably the most important class of features in fraud detection. For each event in a time series, they compute an aggregate (e.g., count, sum, average, standard deviation, etc.) over a fixed time frame (such as a day, an hour, or a month) preceding this event. This aggregate summarizes the entire recent transaction history into a single value. This summary, provided as a feature to the ML model, helps the model to detect anomalous behaviors.
Rolling aggregates can be expressed in ANSI SQL using the OVER
clause:
CREATE VIEW TRANSACTION_WITH_AGGREGATES AS
SELECT
*,
AVG(amt) OVER window_1_day as avg_1day,
SUM(case when distance > 50000 then 1 else 0 end) OVER window_1_day as count_1day
FROM
TRANSACTION_WITH_DISTANCE
WINDOW window_1_day AS (PARTITION BY cc_num ORDER BY ts RANGE BETWEEN INTERVAL 1 DAYS PRECEDING AND CURRENT ROW);
However, modern streaming query engines either don't support rolling aggregates or make them prohibitively expensive, forcing users to eploy workarounds such as replacing rolling windows with simpler hopping windows, thus sacrificing model accuracy.
Feldera is the only streaming analytics platform that supports rolling aggregates without compromising on either performance or accuracy. See our blog post on the topic for more details.
Modular feature pipelines
Real-world feature pipelines compute hundreds of features, with complex features
expressed as queries over simpler ones. Can we write such modular
specifications in SQL? We can, using SQL views! For instance, the FEATURES
view above is built on top of the TRANSACTION_WITH_DISTANCE
view.
Feldera allows nesting views up to an arbitrarily depth without sacrificing
throughput or latency. This is in contrast with traditional SQL engines,
where nested views are exceedingly expensive and are therefore considered an
anti-pattern.
Unified online/offline processing
In feature engineering, feature queries must be evaluated in two different modes:
Offline (batch) mode: while training and testing the fraud detection model, inputs to the model are computed over historical datasets, loaded from a database or a data lake.
Online (streaming) mode: during real-time inference, feature vectors are computed over high-velocity streaming data.
The two modes traditionally require two separate implementations. It is the online version of the pipeline that tends to take months to build and suffers from loss of accuracy and poor performance.
Feldera eliminates the need for two implementations. The SQL queries we wrote above run equally well on batch and streaming inputs. Our SQL engine doesn't actually distinguish between the two. Conceptually, input to Feldera is a sequence of changes to SQL tables. It processes these changes, incrementally updating the results of the queries. Whether the changes arrive in a few large chunks (batch mode) or many small chunks (streaming mode) is immaterial.
Hence, as soon as you are done training and testing the model on batch data, your feature queries are ready to run on streaming inputs. You just need to connect new data sources to them (e.g., Kafka instead of S3).
Speed and precision
Feldera is not only easy to use, but also delivers unprecedented performance and accuracy on large-scale production workloads:
Throughput: Feldera can process millions of records per second on a single host while evaluating complex feature queries. A high-performance storage layer enables it to run larger than memory workloads without sacrificing speed.
Feature freshness: Upon ingesting new events, Feldera processes them and outputs new and updated feature vectors within milliseconds. As a result, ML inference can be performed on feature vectors derived from the latest inputs. This property, known as feature freshness, is particularly important in fraud detection, where the most recent history is the most useful for recognizing fraudulent activity.
No online/offline skew: Feldera not only runs the same queries in offline and online pipelines, but also guarantees identical outputs in both settings. More precisely, if we pause an online pipeline at any time and inspect its output views, their contents will be precisely the same as if we ran the same queries on the same input ingested as a single batch. Thus, Feldera completely eliminates online/offline skew--a major factor in accuracy in the losss of accuracy real-time ML. We are not aware of another feature engineering platform that offers a similar guarantee.
Resources
Join our Slack community to discuss this article.
Use case: Real-time Feature Engineering for Credit Card Fraud Detection
We are working on a series of blog posts focusing on feature engineering with Feldera. Here are the posts published so far:
Real-time feature engineering is hard (but it doesn't need to be) - introduces real-time feature engineering and points out challenges when doing it with existing tools.
Real-time feature engineering with Feldera - a practical example implementing real-time credit card fraud detection with Feldera.
Real-time fraud detection using Feldera and Hopsworks - shows how Feldera integrates with a state-of-the-art real-time feature store.
Rolling Aggregates: the Most Important Feature in Your Real-Time Fraud Detection Pipeline - discusses rolling window aggregates - a class of feature queries, essential for achieving perfect feature freshness, uniquely well supported by Feldera.
Questions? Join our Slack community to discuss this article.