Streamfold - Concepts

Streams are the central component in Streamfold. This document will walk you through the components that compose a stream and how they operate.

Introduction

A stream connects a source to a destination and allows you to filter or transform the data passing through the stream. Any events that traverse the entire way through a stream are delivered to the destination. A typical setup may consist of multiple streams, connecting multiple sources and destinations. Together the collection of all configured streams is known as a topology configuration. When you modify the collection of streams in your organization, a new topology configuration is pushed down to the worker nodes that compose the data plane. These nodes atomically deploy the new topology configuration, routing data from the sources to the new streams.

Components

A stream is composed of several high level components:

Source
Filters
Functions
Destination

The next several sections cover each of these topics in more detail

Sources

Sources are how you ingest data into Streamfold. A source can be push-based, which means that an external system writes data to a worker node, or it can be pull-based, whereby the worker node scrapes data from an external endpoint. Today Streamfold supports two push-based sources: Datadog Agent and HTTP.

Let us know!

We are always looking to expand the ways to ingest data, so we will likely add more sources in the future. If there are particular data sources you would like to see supported, please let us know!

Each source will ingest incoming data and emit one or more Events. The method by which an incoming request is transformed into an event is particular to a given source. However, once converted into an event, you can select and operate on the data using the same methods regardless of source. See the Data Model section for more detail on how events are composed and the Selector Syntax reference for more information on using selectors.

When you create a new stream, you start by choosing a data source. The events emitted from that data source are then sent to that stream. A single source can be connected to multiple streams, in that scenario each event is duplicated across each of the streams that the source is connected to.

Filters

Filters are how you choose which events to include in a stream or should apply to a function. Filters are constructed using the Streamfold Expression Language (SEL). Filters compose a set conditions by selecting fields from an event and evaluating them against a boolean expression. For example, you may want to operate on events where the environment field is equal to "staging".

There are two primary types of filters in a stream: Stream Filters and Function Filters. They are covered below in more detail.

Stream filters

For each stream you can optionally add a stream filter. A stream filter allows you to select which events from a source should be sent to that stream. It is often useful to select a subset of the events from a source for a given stream. For example, you may only want events that correspond to log lines to be included on a stream that archives to S3, excluding metric events.

A stream filter is an opt-in filter, only events matching it will be included. The lack of any filter on a stream will include all events emitted from the source.

If you know a stream should only operate on certain event types, it is best practice to use a stream filter that selects only those events. Other functions, like Drop, can further reduce the events that pass entirely through a stream, but it's better to start with a selective set of matching events when possible.

Function filters

Every function has either a required or optional filter that can be set. A function filter restricts the events that a given function should apply to. If a function has a filter and an event does not match the filter, then the processing of that event will skip over the function and move onto the next function. It is best practice to limit functions to only the events the function applies to. This can help improve processing performance as well as reduce errors when applying functions to events that don't match. The lack of a filter on a function means that the function will be applied to all events.

The Drop function has a required filter, so you must specify which events will match the filter and hence be dropped.

Functions

Every stream supports the ability to add a list of functions that operate on the events as they flow the through stream. Functions allow you to further filter, transform, or enrich events in the stream. There are a number of built-in functions that are supported by Streamfold and we will be adding to that list over time, including support for custom functions. Each built-in function is documented here individually, including supported options and example uses.

Functions operate in sequential order on a stream. Events are processed in the same sequential order through the list of functions until the event has been processed through each function or a function drops the given event. Functions emit zero or more events as their output, those emitted events are then processed through the next function and so on. Events that are emitted from the last function in a stream will be delivered to the destination for the stream.

If a function has a filter applied to it and an event does not match the filter, the event processing moves onto the next function.

Destinations

Destinations are the final egress for a stream and are designed to deliver events to external systems. Examples include sending observability data onto your Datadog account or sending logs to Amazon S3 for archiving. Any event that emits from a stream after processing through each of the functions will be delivered to the destination. A destination is considered a final endpoint for an event on a stream.

A single destination may be connected to multiple streams. This is common when you want to have a different set of processing functions for different data types, but deliver the resulting events to the same location. For example, you may have a separate stream for processing log messages then you do for metrics, but you want to deliver them all to Datadog. In that scenario, both streams would use the Datadog destination.

Destinations may have certain requirements for the event types and event schemas they are able to support. These requirements will be documented as part of the specific destination.

Configuration

Streams can be in active or inactive mode. Streams start inactive, meaning that no events are sent to the destination configured on the stream. When a stream is inactive, you can make changes to the filter or functions without impacting any of your data streams.

You move a stream to active by enabling the "Write to Destination" option for the stream. When this is enabled, the stream configuration is pushed down to the data plane workers. If your data source is sending events, those events will start to be processed on the stream and will be written to the destination. If a stream is active, then anytime you save a modification to a stream filter or function property, those changes will be pushed immediately down to the worker.

You can move an active stream back to inactive by turning off the "Write to Destination" option. This will stop any events from processing on the stream and no events will be delivered to the destination from that stream.

Partitioning

At the data plane level every stream has a given number of partitions. The partitions allow the data plane to concurrently process multiple events at a single time, thereby allowing it to scale to higher event volumes. A partition enforces a logical ordering so that events are processed in-order along a given partition and are delivered in-order to the destination. For example, this ensures that log messages from a file on a given host are processed in order so that functions and destinations see the same ordered set of logs as you would if you were to tail the log file from that host.

Events are assigned to a partition by a partition key. The partition key for an event is assigned by the data source and represents some unit that logical ordering of messages should be enforced for. For example, this may be the file name or hostname that an event was generated from, ensuring ordering is maintained respective to that. At the moment the particular partitioning key can not be set by the user, but we intend to provide more flexibility in the future.

To maximize the throughput of for high-volume streams, stream filters and functions on a stream are built with a shared-nothing architecture. This means that each partition of a stream is processed concurrently without locking across partitions.

The number of partitions for a stream is set internally based on the available resources and throughput of the data plane. They can not be custom configured at the moment.

Back pressure

A stream may experience back pressure if the incoming rate of events from a data source is greater than the rate the stream functions or destination can handle. To limit data loss, Streamfold uses minimal buffering in a stream pipeline, so back pressure on the stream can be pushed onto the source. Each source can decide how to handle back pressure differently, but some sources may respond with error responses if they can not deliver incoming data to a stream. Once the delay is resolved, the sources can once again push data onto a stream. If a source is connected to multiple streams, back pressure from a single stream can impact the source across all streams it is connected to.

Future support

In the future we plan to add more support for handling back pressure internal to Streamfold. This may include options to buffer data to memory or persistent disks in the case of unexpected back pressure.