Support

Guides

Working with PII data

Personally Identifiable Information, or PII for short, is any piece of information that can be used to identify a specific person. This could be any data: a name, an email address, a phone number, or an IP address. PII is important to protect because it can erode the privacy of individuals. Many systems may collect and emit PII, so it critically important to limit the scope of this data. You can read more about PII from the NIST guide.

This guide will show how Streamfold can help derive value from PII while still protecting users privacy concerns. This can help you connect critical data sources to vendors and storage systems without concern of exposing PII data.

In this guide we'll walk through several ways to protect PII:

  1. Masking PII fields
  2. Anonymizing PII
  3. Encrypting PII

In several of the examples below we will make use of the Bloblang Code function.

Masking PII fields

One way to hide PII is to encode it using a one-way hash function. This hides the original values, but still allows you to identify the same input value across events. For example, the following Bloblang examples encodes the fullname field with the SHA1 hash algorithm and uses an HMAC shared secret.

Now when you compare multiple records you can group records with the same user's fullname without identifying their actual real name. Using an HMAC secret key will make it more difficult for someone to generate SHA1 hashes from known input values.

root = this
root.fullname_hash = root.fullname.hash("hmac_sha1", "secret-key").encode("hex")
root.fullname = deleted()

Input:

signup: "2024-08-19 14:30:14+00:00"
fullname: "Jules Perry"
status: "OK"

Output:

signup: "2024-08-19 14:30:14+00:00"
fullname_hash: "3c640982274ea6428b56b80db35ccfca6cc870dc"
status: "OK"

Anonymizing PII

We can anonymize PII data by replacing it with something less specific to a single user. For IP addresses this often is done by dropping the last bits of an IP address, leaving only the network address. For example, for IPv4 we can drop the last octet of the address. In Streamfold we can do this using the Replace function.

Consider an input of the following:

method: "POST"
url: "/signup"
client_ipv4: "192.168.33.45"

With the following parameters to the Replace function:

  • Field name: client_ipv4
  • Expression: ^(?P<ipv4>[0-9]+\.[0-9]+\.[0-9]+)\.[0-9]+$
  • Replacement value: ${ipv4}.0

We end up with the following output:

method: "POST"
url: "/signup"
client_ipv4: "192.168.33.0"

Encrypting PII

You may want to keep PII fields intact, but encrypt them so that intermediary systems can not access the real value. Take for example a Streamfold stream that archives raw log data to an S3 bucket. You may want to prevent those with access to the files in S3 from accessing the real values, but you still don't want to lose the PII data in case you need it for later forensics.

We can again use our Bloblang Code function to encrypt event field values:

let key = "a3dbf7c55d212552ae820c5e0f8bc153".decode("hex")
let vector = "d90fc275dcf93de85ae79f0984d4abb1".decode("hex")
root = this
root.ssn_encrypted = root.ssn.encrypt_aes("ctr", $key, $vector).encode("hex")
root.ssn = deleted()

Input:

ssn: "555-90-1234"

Output:

ssn_encrypted: "750f80368313af5416e22c"

Later, if we decided to replay this traffic through Streamfold we could decrypt the values back to their originals:

let key = "a3dbf7c55d212552ae820c5e0f8bc153".decode("hex")
let vector = "d90fc275dcf93de85ae79f0984d4abb1".decode("hex")
root = this

root.ssn = root.ssn_encrypted.decode("hex").decrypt_aes("ctr", $key, $vector).string()
root.ssn_encrypted = deleted()

Wrap up

You should now be comfortable with several approaches to handling PII data in Streamfold. Give it a try and let us know what you come up with!

Previous
Extract and parse log fields