SQL vs Python for Data Analysis

A decade ago, the data industry struggled with scaling data warehouses. The solution that emerged was to take data out of the warehouse and transform it in memory with programming languages like Python and Java, using tools like Hadoop, Spark, and Dask. This was accompanied by a move from data warehouses to data lakes.

But modern data warehouses have removed the problem of scalability and we’re now seeing a reverse-trend: a move back to transforming data within warehouses. This change has been largely driven by dbt (data build tool), which has fixed a number of important limitations of SQL (Structured Query Language) and is showing strong adoption. dbt enables data analysts to write modular SQL code, forces the use of Git to help with version control, and enables model testing. This seems to take away a lot of the traditional arguments against SQL in the SQL vs Python debate until now.

It seems like the clean division of labor between SQL (data querying and consolidation) and Python (complex data transformation) is fading. For example, tools like dask-sql allow you to both query and transform data using a mix of SQL operations and Python code. This article compares SQL and Python performance, functionality, and developer experience.

Benn Stencil on Twitter

SQL vs Python: Performance

Running SQL code on data warehouses is generally faster than Python for querying data and doing basic aggregations. This is mainly because the data has a schema applied and the computation happens close to the data. When processing data with Python or other programming languages, you first need to extract the data out of the database and load it into local memory which is likely to introduce latency. Technically, databases have to load data into memory too, but the cost of that operation is much lower. 

That said, it is difficult to say anything concretely meaningful about the performance comparison between SQL and Python without specifying things like the SQL execution engine (SQLite, PostgreSQL, Snowflake, …) and the Python libraries you are using to process the data (pandas, NumPy with Numba, …). 

For example, parallel computing solutions like Dask and others scale Python code to larger-than-memory datasets and can significantly lower processing times. Managing the underlying infrastructure for running computations on remote clusters has often been a barrier to adoption for such tools. Enterprise solutions like Coiled are removing this barrier by automating the launching and management of remote clusters (disclaimer: I work there).

Nevertheless, using SQL to transform data inside the warehouse is generally a lot faster for many basic queries and aggregations as it moves code to data rather than data to code. And while a full-fledged technical benchmark is beyond the scope of this article, this comparison shows SQLite can be significantly faster than pandas for basic “select” and “filter” queries. For more context, however, you might want to check out the Databricks and Snowflake benchmarking discussions here and here.

Modern analytic databases are also equipped with powerful optimizers. SQL users can write queries that describe the desired transformations but leave the actual execution plan to the warehouse itself. Because Python is a general-purpose programming language, users need to be far more explicit about every step taken. Tools like PySpark do provide optimizers that address this issue.

So if running queries on data warehouses with SQL is faster...why isn’t every data scientist, engineer, and analyst moving from Python to SQL?

SQL vs Python: Functionality 

While SQL may often be faster than Python for basic queries and aggregations, it does not have the same range of functionality. As Furcy Pin writes, SQL’s greatest strength is also its weakness: simplicity. For example, writing SQL code to perform iterative exploratory data analysis, data science or machine learning tasks can quickly get lengthy and hard to read.

Decrypting complex SQL on Medium

Another important limitation of SQL is that there is no SQL package manager that allows you to import functions, so most teams end up writing their SQL queries from scratch. dbt has tried to address this problem with macros and the dbt package hub, but it only counts just over a hundred packages today. When you compare that to PyPI which counts 300k libraries, it’s clear that the Python ecosystem of third-party libraries is far more vibrant and mature.

Instead of a unified platform that is easily extensible through open-source third-party packages, we are seeing an explosion of SQL dialects as companies try to commercialize open-source SQL and provide additional functionality. This means decreased compatibility and potential vendor lock-in for SQL users. 

SQL vs Python: Developer Experience

But performance and functionality are not everything. The SQL vs Python divide also has a lot to do with the developer experience the two languages offer. Let’s look at three specific components of developer experiences: testing, debugging, and code version control.

Erik Benhardsson on Twitter

Testing

Running unit tests is crucial to any data pipeline that will run in production. As a general-purpose programming language, Python lets you write unit tests for any part of your data processing pipeline: from data queries to machine learning models to complex mathematical functions. To my knowledge, this is not possible with SQL. dbt mitigates this to some extent, but their testing functionality only applies to testing entire SQL models and does not offer the complex unit testing functionality of a programming language like Python.

SQL testing libraries limit themselves to testing the data but not the code. These database testing libraries most often get executed on production as a last resort to break the data pipeline if the data is incorrect. On the other hand, it’s easier to execute Python unit tests on your CI to assure that the code you merge is correct. For example, you can use the chispa and beavis libraries to test PySpark and Dask code, respectively: 


# test column equality in PySpark with chispa
from chispa.column_comparer import assert_column_equality
import pyspark.sql.functions as F

def remove_non_word_characters(col):
    return F.regexp_replace(col, "[^\w\s]+", "")

# define unit test function
def test_remove_non_word_characters_nice_error():
    data = [
        ("matt7", "matt"),
        ("bill&", "bill"),
        ("isabela*", "isabela"),
        (None, None)
    ]
    df = (spark.createDataFrame(data, ["name", "expected_name"])
        .withColumn("clean_name", remove_non_word_characters(F.col("name"))))

    assert_column_equality(df, "clean_name", "expected_name")

# test dataframe equality in Dask/pandas with beavis
import beavis

# create dask dataframes from pandas dataframes
df1 = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame({'col1': [5, 2], 'col2': [3, 4]})
ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)

# assert equality
beavis.assert_dd_equality(ddf1, ddf2)

Debugging

Debugging SQL code is harder as you can’t set up a breakpoint like in a Python script to halt execution within a statement to get into an interactive console. With SQL you can only execute a complete statement at once. Using CTEs and splitting dbt models into multiple files makes it easier to debug intermediary models but still not as powerful as setting a breakpoint anywhere within your code.

Code Versioning

Version control has traditionally been one of the main arguments in Python’s favor. dbt is changing the game here by forcing the data analyst to take the SQL queries that they used to run directly in the data warehouse, and instead store them in a Git repository following dbt’s project structure.

Still, if you have written a long enough nested SQL query and then tried to modify it, the Git difference will be harder to read than a codebase written in Python where the code is split into variables, functions, and classes.

So what should you use?

The great news is that the two universes are not entirely isolated from each other anymore. Tools are emerging that recognize the advantages of each language and bridge the gap between them.

For example, it’s common to query data lakes with SQL using tools like AWS Athena that allow you to query data in an S3 bucket with SQL. Open data formats like Parquet and Arrow that support schemas have contributed to this trend. And on the other side of the spectrum, data warehouses like Snowflake have begun to add support for querying data with DataFrame-like APIs, through tools like Snowpark.

A traditional bottleneck for Python has been getting data out of the data warehouse quickly. This has become considerably faster with tools like dask-snowflake and dask-mongo that allow you to write SQL queries from inside a Python session and support distributed fetch to read and write in parallel. These tools bridge the gap to hit that sweet spot: use SQL for what it’s good at (querying, aggregating, and extracting data efficiently) and Python for its computational power and flexibility (iterative exploratory analysis, machine learning, complex math). 


import dask_snowflake
import snowflake

with snowflake.connector.connect(...) as conn:
    ddf = dask_snowflake.from_snowflake(
       query="""
       SELECT * FROM TableA JOIN TableB ON ...
       """,
       conn=conn,
    )


Check this article for a complete notebook that loads data from Snowflake into a Python session, trains an XGBoost model on the data, and then writes the results back to Snowflake.

Conclusion

While it may be tempting to frame the debate between SQL and Python as a stand-off, the two languages in fact excel at different parts of the data-processing pipeline. Traditionally, there was a large gap between the two languages in terms of performance, functionality, and developer experience. This meant data analysts had to choose a side – and defend their territory aggressively. With tools like dbt, Snowpark, and dask-snowflake, however, the industry seems to be moving towards recognizing the value of each language and providing value to data professionals by lowering the barrier to integration between them.

One potential rule-of-thumb to take from this is to use SQL for simple queries that need to run fast on a data warehouse, dbt for organizing more complex SQL models, and Python with distributed computing libraries like Dask for free-form exploratory analysis and machine learning code and/or code that needs to be reliably unit tested. 

Open-source data integration

Get all your ELT data pipelines running in minutes with Airbyte.