Hey there, it's Simon, Data Engineer and Technical Writer at Airbyte. I'm excited to share with you the first edition of our new data content-specific newsletter, called dataNews.filter() - the name is a link to the pandas or python API of how to filter 🙃, something every data engineer uses, indicating that the newsletter is curated and filtered for you saving you time scanning the latest news on socials or Substacks.
In this edition, we cover the following topics:
- Entity-Centric Data Modeling (ECM) and its benefits
- Latest updates in data engineering tools and techniques
- Inspiring reads and discussions
- Trending topics on social media platforms
Read on for more details and estimated reading times for each section.
But first, why did we create another newsletter? The goal is to keep up with the hypergrowth of the data ecosystem and have a reliable stream of relevant content and, hopefully, here and there, some lesser-known inspirations. We try to achieve this with the dataNews.filter(). The format will be minimal and succinct, with added thought leadership, providing you with the highest value in the shortest time.
In this newsletter, we, as a company, curate a hand-picked list of the latest data engineering content to keep you up to speed with minimal effort.
Let's start with the first topics that have been highly discussed over the past weeks, data modeling with a new way of modeling called entity-centric data modeling that Max introduced, and the never-ending questions around orchestration, is it dead or alive, where Stephen released a series as part of his Symposium. Let's dive into it.
dataNews.select() - Entity-Centric Data Modeling (ECM) [3 minutes]
In this chapter, we’ll discuss the latest news selected and elaborate more on it. This time we are talking about data modeling and a new technique called entity-centric data modeling that emerged recently and is worth analyzing.
So why is data modeling in vogue again? What is it? The TLDR is due to the growing need for solid data architecture and efficient data-driven solutions. It involves creating a structured representation of an organization's data, aiding in the design of efficient data systems. Data modeling is essential for successful data projects, as it helps align business and performance requirements, enabling a better understanding of the business from a data perspective. You can read my latest article on An Introduction to Data Modeling.
But what is interesting, besides the different modeling techniques out there, such as dimensional modeling, data vault modeling, anchor modeling, bitemporal, and many more, Max introduced a brand new one that he based on dimensional modeling and feature engineering, an ML practice. Essentially, the entity-centric model, is to push the boundaries of data modeling by enriching those dimensions with metrics and data structures, combining dimensional modeling with feature engineering.
Max has always been a great inspiration for me. He announced combining dimensional modeling with feature engineering to address the multi-factual analysis of entities by adding metrics to dimensions. This new approach, called Entity-centric data modeling (ECM), puts the core idea of an "entity" (e.g., user, customer, product, business unit, ad campaign, etc.) at the forefront. ECM addresses the shortcomings of traditional methods like Slowly Changing Dimensions using Snapshotting, allowing for easier management, point-in-time querying, and time-series analysis. It aligns with people's mental models of data and tabular datasets, making it a more user-friendly and efficient method for analysis.
Key takeaways from the article defining entity-centric modeling:
- Anchoring on entities and bringing metrics into dimensions
- Simplifying complex queries for segmentation, cohort creation, and difficult classification
- Techniques such as time-bound metrics, dimensional snapshots, and complex data structures
- Addressing challenges like circular dependencies in Directed Acyclic Graphs (DAGs) and wide tables through vertical partitioning, logical vertical partitioning, and using views
Max's Entity-Centric Data Modeling approach is a fresh breeze that can make a significant difference in your data projects, improving query performance and providing actionable insights for data-driven decision-making. It's definitely worth diving into if you're interested in data modeling or looking for ways to optimize your existing data architectures. You can read the full article on Entity-Centric Data Modeling: A New Approach, or discuss it on Twitter or LinkedIn.
dataNews.update() - Latest Updates in Data Engineering Tools and Techniques [2 minutes]
This section covers the most recent releases and updates in data engineering tools, libraries, and frameworks that can help you enhance your data engineering skills.
Some notable updates in the data engineering world include:
- The next big step forward for analytics engineering: Where Tristan says the future of analytics engineering requires more mature processes, teams, and tools to handle increasing complexity. Currently, dbt lacks certain features needed for successful re-organization and scaling. The upcoming dbt Core v1.5 release aims to address these issues by introducing access control, contracts, and versioning, enabling better collaboration and management of multiple dbt projects. This will allow individual teams to own their data, improving decision-making, quality, and reliability, while reducing costs and increasing scalability.
- Parquet is more than just "Turbo CSV." Parquet is approximately 7.5 times quicker to read and ten times quicker to write than CSV and only takes up a fifth of the size on disk. Impressive, right? Check out the article and the accompanying Hacker News discussion.
- Datafold discusses the challenges of testing data pipelines and balancing testing efforts with productivity. Ari discusses the modern data stack's multiple layers and tools: storage, orchestration, integration, transformation, visualization, and activation.
- Cube Blog presents an AI-powered conversational interface for the semantic layer, allowing Slack questions through a Delphi AI-powered conversation interface via Cube's semantic layer to connect with all data sources. It sounds like a dream, doesn't it?
- Pandas 2.0 is out! You can follow the extensive Twitter thread by Marc Garcia or check out our blog post on Pandas 2.0 and its ecosystem (Arrow, Polars, DuckDB). More resources can be found on Marc Garcia's LinkedIn post. On a related note, Apache Arrow announced the release of nanoarrow 0.1, which aims to simplify Arrow-based interface implementation for handling tabular data.
dataNews.inspire() - Inspiring Reads and Discussions [2 minutes]
In this section, we share some thought-provoking reads and conversations from the data engineering community that can help broaden your perspective.
Inspiring read we gathered along the way.
- Kayla's recent blog post, All You Need Is Data and Functions, discusses how we tend towards complexity as engineers and the importance of simplicity in programming languages. Join the Hacker News discussion for more insights.
- Rill Data, a new cmd line first BI Tool on Why We Built Rill with DuckDB. DuckDBs have a high performance for analytical queries and lightweight, embeddable nature, making it an attractive choice as it enables fast data profiling and interactive dashboard experiences for developers, Michael says.
- Casual data engineering, or a poor man's Data Lake in the cloud: You explore the fundamentals of modern data lakes and demonstrates how to construct a serverless, near-realtime data pipeline using AWS services and DuckDB, specifically for a web analytics application.
- Twitter has open-sourced its data stack for full transparency. Steve Nouri shares some tricks where he decodes the code for you, allowing you to "hack" Twitter's algorithm.
- Jordan Tigani discusses the death of Big Data in Software Defined Talk episode 410. He also mentions what DuckDB got right, making it an enjoyable listen.
dataNews.observe() - Trending Topics on Social Media Platforms [1 minute]
In this section, we keep you updated on the most talked-about topics in the data engineering community on social media platforms.
Trending on Twitter, Hacker News, LinkedIn and Reddit.
- Exporting to Excel is always a people pleaser.... The image in the post truly captures the sentiment! See Image 1.
- If I have to run this data pipeline one more time, I'm going to lose my mind. A feeling every data engineer can relate to!
- What is the hottest tech stack in the Data Engineering world now? Orchestration is one of the heavily discussed topics.
- Dagster introduced a declarative scheduling system that looks very different from other orchestrators. Instead of DAGs, Flows, or Jobs, you specify how up-to-date each data asset should be, and Dagster takes it from there. Check out this tweet by Sandy Ryza for more information.
- Building data engineering projects is a great way to show off your skills on your resume (as well as learn new ones) by Ben Rogojan.
- Data materialization is a convergence problem. This interesting LinkedIn post, highlights the importance of naming and solving problems in the data world.
And that wraps up our first edition of DataNews.filter(). I hope you enjoyed this curated selection of data engineering news, insights, and discussions.
Feel free to reach out if you have any feedback on Slack or anywhere on socials or suggestions for future editions. Also, let us know which frequency you'd want to receive such a newsletter; we are thinking bi-weekly.
Until next time, happy data engineering :)
Simon & the Airbyte Team
Image 1 - Excel is a people pleaser: