Skip to main content

Feldera: the Real-Time Compute Platform for Fraud Detection Teams

When it comes to fraud detection, training an accurate ML model is only the beginning. To ensure its effectiveness, you must supply the model with up-to-date features computed from the latest data. Building a real-time feature pipeline is a massive undertaking:

  • It requires many months of work from specialized platform teams.

  • Even then, the results may be less than ideal. Real-time feature pipelines frequently face issues like poor feature freshness and online/offline skew. This is exactly the type of weakness fraudsters will exploit to avoid or delay detection!

Feldera's unique incremental compute engine solves both challenges out-of-the box, making it an ideal compute platform for fraud detection teams:

  • Agility. With Feldera, even lean data teams can ship complex real-time feature pipelines within days.

  • Precision and speed. Feldera delivers perfect feature freshness and zero online/offline skew while processing millions of events per second.

Let's find out how this works!

Build feature pipelines in SQL

Feldera feature pipelines are written in SQL. Our powerful SQL engine supports all common types of feature queries, ranging from simple transformations to complex enrichments, aggregates, window-based queries, rolling aggregates, etc. Complex features are built by composing and combining simpler ones.

If you know SQL, you are ready to do real-time feature engineering with Feldera. Why not start right now? Just paste the following code into the Feldera online sandbox and hit RUN:

-- Credit card holders.
CREATE TABLE CUSTOMER (
cc_num BIGINT NOT NULL PRIMARY KEY, -- Credit card number
name varchar, -- Customer name
lat DOUBLE, -- Customer home address latitude
long DOUBLE -- Customer home address longitude
) WITH (
'materialized' = 'true',
-- Configure the random data generator to generate 100000 customer records.
-- (see https://www.feldera.com/docs/connectors/sources/datagen)
'connectors' = '[{
"transport": {
"name": "datagen",
"config": {
"plan": [{
"limit": 100000,
"fields": {
"name": { "strategy": "name" },
"cc_num": { "range": [ 100000000000000, 100000000100000 ] },
"lat": { "strategy": "uniform", "range": [ 25, 50 ] },
"long": { "strategy": "uniform", "range": [ -126, -67 ] }
}
}]
}
}
}]'
);

-- Credit card transactions.
CREATE TABLE TRANSACTION (
ts TIMESTAMP LATENESS INTERVAL 10 MINUTES, -- Transaction time
amt DOUBLE, -- Transaction amount
cc_num BIGINT NOT NULL, -- Credit card number
shipping_lat DOUBLE, -- Shipping address latitude
shipping_long DOUBLE, -- Shipping address longitude
FOREIGN KEY (cc_num) REFERENCES CUSTOMER(cc_num)
) WITH (
'materialized' = 'true',
-- Configure the random data generator to generate 1M transactions at the rate of 1000 transactions/s.
'connectors' = '[{
"transport": {
"name": "datagen",
"config": {
"plan": [{
"limit": 1000000,
"rate": 1000,
"fields": {
"ts": { "strategy": "increment", "scale": 1000, "range": [1722063600000,2226985200000] },
"amt": { "strategy": "zipf", "range": [ 1, 10000 ] },
"cc_num": { "strategy": "uniform", "range": [ 100000000000000, 100000000100000 ] },
"shipping_lat": { "strategy": "uniform", "range": [ 25, 50 ] },
"shipping_long": { "strategy": "uniform", "range": [ -126, -67 ] }
}
}]
}
}
}]'
);

-- Data enrichment query: left-join the TRANSACTION table with the CUSTOMER table to compute the
-- distance between the shipping address and the customer's home address for each transaction.
CREATE VIEW TRANSACTION_WITH_DISTANCE AS
SELECT
t.*,
ST_DISTANCE(ST_POINT(shipping_long, shipping_lat), ST_POINT(long,lat)) AS distance
FROM
TRANSACTION as t
LEFT JOIN CUSTOMER as c
ON t.cc_num = c.cc_num;

-- Compute two rolling aggregates over a 1-day time window for each transaction:
-- 1. Average spend per transaction.
-- 2. The number of transactions whose shipping address is more than 50,000 meters away from
-- the card holder's home address.
CREATE VIEW TRANSACTION_WITH_AGGREGATES AS
SELECT
*,
AVG(amt) OVER window_1_day as avg_1day,
SUM(case when distance > 50000 then 1 else 0 end) OVER window_1_day as count_1day
FROM
TRANSACTION_WITH_DISTANCE
WINDOW window_1_day AS (PARTITION BY cc_num ORDER BY ts RANGE BETWEEN INTERVAL 1 DAYS PRECEDING AND CURRENT ROW);

Here we did three things:

  1. Used CREATE TABLE SQL statements to specify the data format for two input data streams: credit card transactions and customers.

  2. Connected input data sources (in this case, the random data generator) to each table.

  3. Used CREATE VIEW SQL statements to define several features over the input streams.

The queries in this example represent three important classes of ML features used in fraud detection. Let us take a closer look at them.

Data enrichment in SQL

In fraud detection, the raw stream of financial transactions does not usually contain enough information to identify fraud reliably. A stronger signal can be obtained by combining (or enriching) raw transaction data with data from other sources. In our example, each credit card transaction contains a shipping destination (expressed as latitude/longitude coordinatates). To obtain a useful signal to identify fraudulent transactions, we compute the distance from the shipping destination to the card holder's home address extracted from the CUSTOMER table. To this end, we LEFT JOIN the TRANSACTION table with the CUSTOMER table:

CREATE VIEW TRANSACTION_WITH_DISTANCE AS
SELECT
t.*,
ST_DISTANCE(ST_POINT(shipping_long, shipping_lat), ST_POINT(long,lat)) AS distance
FROM
TRANSACTION as t
LEFT JOIN CUSTOMER as c
ON t.cc_num = c.cc_num;

Thus, SQL joins provide a powerful (and familiar!) tool for expressing data enrichment queries.

Rolling aggregation

Rolling aggregates are arguably the most important class of features in fraud detection. For each event in a time series, they compute an aggregate (e.g., count, sum, average, standard deviation, etc.) over a fixed time frame (such as a day, an hour, or a month) preceding this event. This aggregate summarizes the entire recent transaction history into a single value. This summary, provided as a feature to the ML model, helps the model to detect anomalous behaviors.

Rolling aggregates can be expressed in ANSI SQL using the OVER clause:

CREATE VIEW TRANSACTION_WITH_AGGREGATES AS
SELECT
*,
AVG(amt) OVER window_1_day as avg_1day,
SUM(case when distance > 50000 then 1 else 0 end) OVER window_1_day as count_1day
FROM
TRANSACTION_WITH_DISTANCE
WINDOW window_1_day AS (PARTITION BY cc_num ORDER BY ts RANGE BETWEEN INTERVAL 1 DAYS PRECEDING AND CURRENT ROW);

However, modern streaming query engines either don't support rolling aggregates or make them prohibitively expensive, forcing users to eploy workarounds such as replacing rolling windows with simpler hopping windows, thus sacrificing model accuracy.

Feldera is the only streaming analytics platform that supports rolling aggregates without compromising on either performance or accuracy. See our blog post on the topic for more details.

Modular feature pipelines

Real-world feature pipelines compute hundreds of features, with complex features expressed as queries over simpler ones. Can we write such modular specifications in SQL? We can, using SQL views! For instance, the FEATURES view above is built on top of the TRANSACTION_WITH_DISTANCE view. Feldera allows nesting views up to an arbitrarily depth without sacrificing throughput or latency. This is in contrast with traditional SQL engines, where nested views are exceedingly expensive and are therefore considered an anti-pattern.

Unified online/offline processing

In feature engineering, feature queries must be evaluated in two different modes:

  • Offline (batch) mode: while training and testing the fraud detection model, inputs to the model are computed over historical datasets, loaded from a database or a data lake.

  • Online (streaming) mode: during real-time inference, feature vectors are computed over high-velocity streaming data.

The two modes traditionally require two separate implementations. It is the online version of the pipeline that tends to take months to build and suffers from loss of accuracy and poor performance.

Feldera eliminates the need for two implementations. The SQL queries we wrote above run equally well on batch and streaming inputs. Our SQL engine doesn't actually distinguish between the two. Conceptually, input to Feldera is a sequence of changes to SQL tables. It processes these changes, incrementally updating the results of the queries. Whether the changes arrive in a few large chunks (batch mode) or many small chunks (streaming mode) is immaterial.

Hence, as soon as you are done training and testing the model on batch data, your feature queries are ready to run on streaming inputs. You just need to connect new data sources to them (e.g., Kafka instead of S3).

Speed and precision

Feldera is not only easy to use, but also delivers unprecedented performance and accuracy on large-scale production workloads:

  • Throughput: Feldera can process millions of records per second on a single host while evaluating complex feature queries. A high-performance storage layer enables it to run larger than memory workloads without sacrificing speed.

  • Feature freshness: Upon ingesting new events, Feldera processes them and outputs new and updated feature vectors within milliseconds. As a result, ML inference can be performed on feature vectors derived from the latest inputs. This property, known as feature freshness, is particularly important in fraud detection, where the most recent history is the most useful for recognizing fraudulent activity.

  • No online/offline skew: Feldera not only runs the same queries in offline and online pipelines, but also guarantees identical outputs in both settings. More precisely, if we pause an online pipeline at any time and inspect   its output views, their contents will be precisely the same as if we ran the same queries on the same input ingested as a single batch. Thus, Feldera completely eliminates online/offline skew--a major factor in accuracy in the losss of accuracy real-time ML. We are not aware of another feature engineering platform that offers a similar guarantee.

Resources

Questions? Join our Slack community to discuss this article.