The Bytes

How to Collect Behavioral Data? A Guide for Data Engineers and Analysts

So your company is launching a new product and you’ve been tasked with setting up the behavioral data infrastructure? Or maybe you need to revamp the existing infra using modern tools? 

There are a few different technologies (CDI, CDP, ELT) that can be used to collect behavioral data, and at the same time, there are many tools (Segment, Rudderstack, Airbyte, etc) with capabilities that span multiple technologies.

I know that navigating this maze and making an informed decision is daunting and time-consuming. Lucky for you though, I spent a great deal of time doing this during my time as the Head of Growth at Integromat and continue to keep track of all new data tools and technologies as I build astorik, a place to explore data tools.

My goal with this guide is to walk through the various behavioral data collection technologies as well as popular tools under each technology. But first, I’d like to shed some light on why behavioral data is important and where it comes from.

Why collect behavioral data?

Behavioral data is collected as a result of users performing actions or events while interacting with a product and is therefore also referred to as event data or product-usage data

Behavioral data serves two main purposes for teams — understanding how the product is being used or not used (user behavior) and building personalized customer experiences across various touchpoints to influence user behavior.

Understanding product usage requires prior instrumentation of the features whose usage you’d like to measure — tracking the events a user performs and sending those events to third-party tools for analysis. Similarly, you need to track events based on which you’d like to trigger campaigns and experiences via downstream activation tools.

Launching new features without instrumenting them beforehand is a classic mistake — it takes away the opportunity to analyze how those features are used (if at all) and to trigger in-app experiences or messages when relevant events take place (or don’t).

Where does behavioral data come from? 

Although the events I’m referring to take place within your product, the actual source of behavioral data can be an external tool or service that is embedded within your product. For the love of simplicity, I like to categorize the data sources as primary and secondary.

Primary data sources

Your core product — web app, mobile apps, a smart device, or a combinationpowered by proprietary code is a primary or first-party behavioral data source.

If your product is built using no-code tools, you won’t have a primary source for your behavioral data — you’d rely on the no-code tools to make behavioral data available to you (either via webhooks or integrations with data collection tools).  

To collect data from your primary sources, you can use the client and server-side SDKs or the APIs provided by data collection tools. 

Secondary data sources

Secondary data sources include all external or third-party tools that your customers interact with directly or indirectly — tools used for authentication, payments, in-app experiences, support, feedback, engagement, and advertising.

Customers interact with third-party tools indirectly or unknowingly when they are embedded within your core product experiences. Examples include Auth0 for authentication, Stripe for payments, and AppCues for in-app experiences — from a user’s point of view, they are using your product even when interacting with these external tools.

Customers also interact with external tools that are evidently not part of the core product experience but are integral touchpoints. Opening a support ticket via Zendesk, leaving feedback via Typeform, opening an email sent via Intercom, or engaging with an ad on Facebook — these are all interactions that help understand the customer journey. 

It’s also helpful to keep in mind that third-party tools generate a lot of data but not all of it is event data. What exactly you can collect in terms of events and objects depends on the integrations offered by the data collection tool you use.

To collect data from secondary sources, you can either use source integrations offered by data collection tools or write your own code.

Technologies and tools to collect behavioral data

Just like all the layers of the modern data landscape, the data collection layer has experienced a lot of activity in the last couple of years, with the launch of several open source products that have become popular very quickly.

The overlap between products is also increasing as core capabilities are being extended to cover adjacent use cases.

The difference between CDI and CDP

CDI or Customer Data Infrastructure is a less common term that’s often confused with CDP or Customer Data Platform. 

A platform cannot exist without infrastructure — CDP is essentially a layer on top of CDI that offers a set of capabilities to do some cool stuff with the data using a visual interface. 

CDI is a standalone solution that can exist without a CDP, whereas a CDP is sold as an add-on by some CDI vendors.

Key aspects of a CDI are as follows:

  1. CDI (Customer Data Infrastructure) is purpose-built to collect behavioral data from primary or first-party data sources but some solutions also support a handful of secondary data sources (third-party tools).
  2. Data is typically synced to a cloud data warehouse like Snowflake, BigQuery, or Redshift, but most CDI solutions have the ability to sync data to third-party tools as well.
  3. All CDI vendors offer a variety of data collection SDKs and APIs
  4. Some CDI solutions store a copy of the data, some make it optional, and some don’t. 

The core capabilities of a CDP include identity resolution and the ability for users to build and sync audiences to external tools using a drag and drop UI (without writing SQL). 

CDI and CDP tools

Connections and Personas are Segment’s CDI and CDP products respectively. mParticle takes a slightly different approach — it offers CDI capabilities along with identity resolution in its Standard edition whereas audience building is available on the Premium plan. Both Segment and mParticle support data warehouses and a host of third-party tools as destinations, as well as store a copy of your data that can be accessed later if needed. 

RudderStack Event Stream and Jitsu are open-source CDI solutions positioned as alternatives to Segment Connections. Both products support warehouses and third-party tools but RudderStack offers a more extensive catalog of destinations.

Snowplow is the only CDI solution that literally calls itself a behavioral data platform. It is also open-source and unlike the others, Snowplow doesn’t support third-party tools as — it is focused on warehouses and a few open source projects as destinations.

Other CDI solutions worth looking into are Freshpaint that offers codeless or implicit tracking and MetaRouter which is a server-side CDI that only runs in a private cloud instance.

The links below will take you to the integration catalogs of the respective tools: 

ELT tools

ELT tools are purpose-built to extract all types of data from a large number of third-party tools (secondary sources) and load the data into cloud data warehouses. That said, not all integrations offered by ELT tools support behavioral data or event data. 

ELT tools don’t store any data and don’t support third-party tools as destinations. 

Airbyte is an open-source ELT tool with a growing library of connectors and a thriving community of contributors. Airbyte offers source connectors with 150+ tools like Zendesk, Intercom, Stripe, Typeform, and Facebook Ads, many of which generate event data. Airbyte also offers a Connector Development Kit (CDK) that you can use to build integrations that are maintained by Airbyte’s community members.

Other ELT vendors include Fivetran, Stitch, and Meltano (also open-source). 

As mentioned earlier, CDI solutions also offer source integrations with a few third-party tools but those are not as comprehensive and deep as the integrations offered by ELT tools.

When contemplating whether to use an ELT tool or a source integration of a CDI tool to extract data from a third-party tool, consider the following:

  • CDI is best-in-class to collect behavioral data from primary or first-party data sources — web and mobile apps, and IoT devices
  • ELT is best-in-class to collect all types of data including behavioral data from secondary data sources — third-party tools that power various customer experiences.

Product analytics tools 

Amplitude, Mixpanel, Indicative, Heap, and PostHog (open-source) are purpose-built tools for behavioral data analysis. At the same time though, all of these offer SDKs and APIs to collect data from your primary (first-party) data sources. 

Product analytics tools by nature store a copy of your data and allow you to export the data (usually for an additional fee). You can also use Airbyte’s integrations with Amplitude, Mixpanel, or PostHog to export data from those tools to a destination supported by Airbyte. 

However, it’s important to keep in mind that beyond analysis, there are plenty of activation use cases for behavioral data. 

As a best practice, companies must set up a data warehouse to store a copy of all data they collect — using purpose-built data collection tools (CDI and ELT) is more efficient, prevents vendor lock-in, and just makes more sense.

Custom tracking solutions

If readymade solutions are not for you, you can always build a custom tracking service that collects data from your apps and syncs it to your warehouse and downstream applications. That said, having first-hand experience with such a solution, I can tell you that maintenance and troubleshooting are not trivial and the frustration is real.

More importantly, with so many different flavors of CDI and ELT solutions available, building one’s own is just not the best use of engineering resources. In fact, engineers generally hate building integrations — you’re probably one so let me know if I’m wrong. 

Conclusion

CDI tools are purpose-built for behavioral data and I would highly recommend adopting one to collect data from primary or first-party data sources, and sticking to your ELT tool to collect data from secondary or third-party sources.

Now that you have a better picture of the tools needed to collect behavioral data for analysis and activation, don’t forget to collaborate with stakeholders from various teams when it comes to deciding which events to track and what data to send to which destination. 

Open-source data integration

Get all your ELT data pipelines running in minutes with Airbyte.