Refining Intellum's Data Processing Systems
Andres Bravo
About Andres Bravo Gorgonio
Andres Bravo Gorgonio is a Spain-based software engineer at Intellum who focuses on helping his data team extract, transform and migrate customer data. He has built shops and huge CMS products from scratch throughout his career, playing multiple roles like developer, scrum master, and UI/UX consultant. Before joining Intellum in 2017, he worked at growing startups like Qustodian, Redbooth, Ztory, and ProFinda.
In a conversation with Andres, we learn how Intellum has revolutionized the online learning industry to change how people consume educational content. With high-profile clients like Google, Facebook, and Amazon, his team needs reliable software that can consolidate and process large databases of information quickly and efficiently. He discusses how the search for a feasible solution led his squad to Airbyte and its flexible, open-source model.
About Intellum
Intellum, the pioneer of online learning, has transformed virtual education with its easy-to-use platform for on-demand learning. We help our clients educate their customers, partners, and employees, including some of the world's fastest-growing companies. Companies can present, manage, track, and improve learning events. My team has created a new market space for online learning platforms, attracting the attention and interest of different competitors.
What was the problem?
Unifying online content into one platform
In recent years, the shift to distance learning has led to an increase in global content creators, and more users have started taking courses online. To consolidate the content generated by our partners and customers, we re-platformed our solution into a single learning platform called "Evolve". With Evolve, users can access multiple e-learning courses simultaneously without opening separate tabs on their mobile devices. The result is a much richer, highly interactive learning experience, plus easy tracking of e-learning metrics. However, gathering multiple data sources into a single platform proved a challenge.
"Learners completed Evolve content a staggering 90.3% of the time, indicating its effectiveness in encouraging students to participate and complete courses."
Meeting local compliance regulations
Customer data stored within our service is highly confidential for our data teams. Using any third-party services that are proprietary, un-certified, or running outside the allowed geo boundaries will only elevate our business risk. Therefore, to meet the local compliance requirements, it is vital to ensure that we strictly contain our data within the geographical service boundary.
Our old architecture was not good enough to meet our requirements because:
Unreliable under heavy data volumes
For data ingestion into BigQuery, the legacy data pipeline architecture relied heavily on home-grown custom scripts built by our DevOps teams. Singer connected salesforce and Zendesk. A significant amount of data had to be lifted and shifted from multiple databases, each with over 1 TB of data. The process was inefficient and difficult to maintain. The home-grown scripts were also resource-intensive, sometimes causing overloaded machines to crash. We wanted more predictability in our business, and the lack of reliability was not helping.
"We have large databases with 20 years' worth of data, and this is a lot to process. The manual system we were using was not effective, so we're looking for a reliable system that splits up the ETL process easily and handles things more incrementally."
Performance pitfalls in the script
Apart from reliability, the home-grown scripts had several performance pitfalls. For example, it could process 1 table, which took a long time to complete. Table data also needed to be synced, and storing incremental states was complex. Due to many data sources to export, it was also difficult to configure and tune the script.
How did we discover Airbyte?
As our platform grew exponentially, my team solved the performance and data integration problems that plagued our old architecture. It was crucial to use a tool to accelerate the ETL process for data analysis. In our search for an effective solution to integrate into our data processing systems, we discovered Airbyte.
How was the problem solved?
How did Airbyte solve our problem?
Break down tasks to handle large amounts of data
As part of daily operations, 30 TB of data is moved to BigQuery from more than 17 databases, with 15 or so tables, each containing more than 500 million rows. One of the schemas also has up to 300 tables! Our code integrates Airbyte APIs into different Docker instances to help manage these large data pipelines. As a result, we could select from a pool of available connectors, configure the necessary dbt transformations on data, and load data seamlessly into BigQuery. In addition, using a container-based deployment approach allowed us to restart jobs easily and refresh or upgrade to different connectors on demand.
"We were looking for a reliable system to break down the ETL process into smaller steps which Airbyte does effectively, and the process is further made easier with the docker instances, connectors, and networking."
Fast incremental data syncing and increased scalability
Changes to data are identified and tracked by Airbyte and replicated in the BigQuery data warehouse. Airbyte's Change Data Capture (CDC) allows incremental data changes across our platform to be easily captured and pushed near real-time to the data warehouse. Since Airbyte is open-source, we can view the code underneath the hood and make modifications as needed. Additionally, with Airbyte's support for Kubernetes, our new architecture can scale horizontally to sync large amounts of data while keeping our server resources in check.
"With the quick and easy Airbyte setup process and with a few answers from the Airbyte community in Slack, Intellum was able to launch the new platform quickly and efficiently within two weeks. Airbyte worked out-of-the-box, and there was no long setup protocol like other tools."
With Airbyte's extensive set of APIs and real-time data synchronization abilities, we can capture all customer data and consolidate it in a single platform. In addition, by leveraging Airbyte's custom-built connectors, we can successfully track data and ensure everything runs smoothly within our technical architecture.
How do we feel about using Airbyte?
Today, Airbyte is an essential component of our data processing systems. In the future, as we grow and scale our use cases, our team is confident in Airbyte's capabilities and the support that we can get from experts in the community.
"We are extremely impressed by engagement in the Slack community, and the team makes it a point to respond to every question and issue on the Slack channel, to customers big and small."