The data ecosystem has been evolving rapidly in the last year. A few things we saw:
- Declarative approaches appearing everywhere (from Kubernetes where we have code as infra, orchestration as code, and even integration as code)
- Rise of the Semantic Layer
- Rust becoming the future of performance-intense applications in data (potentially replacing Spark eventually).
- Vector Databases being used for small data, such as DuckDB along with newer ones especially supporting the AI wave behind the curtains with Pinecone, Qdrant, etc.
- Data modeling coming back with the exposing of the modern data stack
- AI and generative AI with chatGPT, still early for the moment
- Some early indication of bundling of the stack with startup acquisition (Transform by dbt) and layoffs
The State of Data research is a way for us to take a step back and understand what the community is using and feeling excited about. A way to see the signal through the noise in the modern data stack.
This survey research had 886 participants, which makes it the largest in the data engineering ecosystem, and was conducted late in 2022. First, we’ll provide details on the demographics of the survey participants. It's a great opportunity to find hard-to-get information on compensation based on experience, for instance. Then, it goes through the data stack, and then blogs, podcasts, newsletters, and more.
The best insights are usually discovered when using the filters at your disposal, per company size and per experience, so with this interactive report you are able to drill down on the information that matters most to you.
Let’s discover some useful insights. But before, let’s see how representative of the industry it is.
Demographics of the research
State of Data covers all main continents and therefore shows a great global perspective.
It also shows a wide range of experience, with more participants between 3 to 10 years into their careers, which seems relevant with our industry.
In terms of company size, it is relevant to see less participants from companies with less than 10 employees. But the 501-1,000 employee segment - could be considered the mid-market segment - is less represented than the SMB (small- and medium-sized businesses) and enterprises. It still counts 80 participants in the mid-market segment, which is significant enough to give some good insights.
50% of the participants are either data engineers or engineers. Data analysts and scientists represent 100 (about 10%) of the participants. This is also due to the fact that the survey has been published mostly in data engineering communities.
Let’s see which insights we could get from this research.
The State of Data gave some indications on the compensation trends according to experience, company size and geography.
Airbyte and Fivetran are clear leaders in the market, with Airbyte showing double the number of people that want to try it. When drilling down in company size, you can see Airbyte is becoming dominant in the small/medium-sized segment, but hasn’t been as adopted in the mid-size market (500-1,000 employees). However, in the enterprise segment (1,000+ employees), Airbyte is already making headway. This might show a propensity for enterprises to adopt an open-source self-hosted platform (Airbyte Open Source being the dominant solution there).
The survey goes deeper by prompting participants to share what they care most about in data ingestion: correctness, stability and then performance.
dbt has the most positive sentiment for Data Transformation, but Pandas is actually most used. This is even more noticeable in the enterprise segments where both Spark and Pandas are more used than dbt. Still, dbt shows the most “want to try” willingness in that segment.
Snowflake and BigQuery are clearly at the top for Data Warehouses, with still a lot of positive sentiment for Databricks. Azure Synapse seems to be lagging behind. When looking at the enterprise segments, Databricks shows as much usage and positive sentiment as BigQuery and Snowflake (Snowflake being the most, Databricks being second). Redshift is also very present but shows the least “Want to try”.
Most people are still using self-hosted Airflow, especially in the enterprise segment. This may again (like in Data Ingestion) indicate a preference for self-hosted solutions for enterprises. It should be noted that Dagster is definitely coming up the ranks with the highest number of ‘Want to try’.
The giants Looker and Tableau are still ruling the roost, but there is also significant churn from Tableau to newer solutions.
Great Expectations and Monte Carlo are leading the pack. The other tools have mostly either been unheard of or not considered. This may evolve as the industry continues to mature.
Hightouch and Census are neck-and-neck, but the vast majority of the market is still up for grabs. It’s early in the market for this technology, as the high number of ‘Don’t know’ indicates.
DataHub and Atlan are leading for now, but the vast majority of the market is also up for grabs.
The State of Data went beyond the technology stack and also looked at the communities and resources leveraged by the community. Here is the list of most followed newsletters and podcasts, and YouTube channels.
The State of Data holds a lot more insights with the additional filters based on company size, location, and years of experience, in addition to quotes from data influencers.
It is a first iteration that we will renew every year with the intent to grow the number of participants every year, so we can better and better reflect the industry. We also hope to involve other data communities in the next survey. Stay tuned!