Data Insights
Article

SQL vs Python for Data Analysis

Richard Pelgrim
March 14, 2022
10 min read
Limitless data movement with free Alpha and Beta connectors
Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program ->

SQL vs Python: Developer Experience

But performance and functionality are not everything. The SQL vs Python divide also has a lot to do with the developer experience the two languages offer. Let’s look at three specific components of developer experiences: testing, debugging, and code version control.

Erik Benhardsson on Twitter

Testing

Running unit tests is crucial to any data pipeline that will run in production. As a general-purpose programming language, Python lets you write unit tests for any part of your data processing pipeline: from data queries to machine learning models to complex mathematical functions. To my knowledge, this is not possible with SQL. dbt mitigates this to some extent, but their testing functionality only applies to testing entire SQL models and does not offer the complex unit testing functionality of a programming language like Python.

SQL testing libraries limit themselves to testing the data but not the code. These database testing libraries most often get executed on production as a last resort to break the data pipeline if the data is incorrect. On the other hand, it’s easier to execute Python unit tests on your CI to assure that the code you merge is correct. For example, you can use the chispa and beavis libraries to test PySpark and Dask code, respectively: 


# test column equality in PySpark with chispa
from chispa.column_comparer import assert_column_equality
import pyspark.sql.functions as F

def remove_non_word_characters(col):
    return F.regexp_replace(col, "[^\w\s]+", "")

# define unit test function
def test_remove_non_word_characters_nice_error():
    data = [
        ("matt7", "matt"),
        ("bill&", "bill"),
        ("isabela*", "isabela"),
        (None, None)
    ]
    df = (spark.createDataFrame(data, ["name", "expected_name"])
        .withColumn("clean_name", remove_non_word_characters(F.col("name"))))

    assert_column_equality(df, "clean_name", "expected_name")

# test dataframe equality in Dask/pandas with beavis
import beavis

# create dask dataframes from pandas dataframes
df1 = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame({'col1': [5, 2], 'col2': [3, 4]})
ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)

# assert equality
beavis.assert_dd_equality(ddf1, ddf2)

Debugging

Debugging SQL code is harder as you can’t set up a breakpoint like in a Python script to halt execution within a statement to get into an interactive console. With SQL you can only execute a complete statement at once. Using CTEs and splitting dbt models into multiple files makes it easier to debug intermediary models but still not as powerful as setting a breakpoint anywhere within your code.

Code Versioning

Version control has traditionally been one of the main arguments in Python’s favor. dbt is changing the game here by forcing the data analyst to take the SQL queries that they used to run directly in the data warehouse, and instead store them in a Git repository following dbt’s project structure.

Still, if you have written a long enough nested SQL query and then tried to modify it, the Git difference will be harder to read than a codebase written in Python where the code is split into variables, functions, and classes.

So what should you use?

The great news is that the two universes are not entirely isolated from each other anymore. Tools are emerging that recognize the advantages of each language and bridge the gap between them.

For example, it’s common to query data lakes with SQL using tools like AWS Athena that allow you to query data in an S3 bucket with SQL. Open data formats like Parquet and Arrow that support schemas have contributed to this trend. And on the other side of the spectrum, data warehouses like Snowflake have begun to add support for querying data with DataFrame-like APIs, through tools like Snowpark.

A traditional bottleneck for Python has been getting data out of the data warehouse quickly. This has become considerably faster with tools like dask-snowflake and dask-mongo that allow you to write SQL queries from inside a Python session and support distributed fetch to read and write in parallel. These tools bridge the gap to hit that sweet spot: use SQL for what it’s good at (querying, aggregating, and extracting data efficiently) and Python for its computational power and flexibility (iterative exploratory analysis, machine learning, complex math). 


import dask_snowflake
import snowflake

with snowflake.connector.connect(...) as conn:
    ddf = dask_snowflake.from_snowflake(
       query="""
       SELECT * FROM TableA JOIN TableB ON ...
       """,
       conn=conn,
    )


Check this article for a complete notebook that loads data from Snowflake into a Python session, trains an XGBoost model on the data, and then writes the results back to Snowflake.

Conclusion

While it may be tempting to frame the debate between SQL and Python as a stand-off, the two languages in fact excel at different parts of the data-processing pipeline. Traditionally, there was a large gap between the two languages in terms of performance, functionality, and developer experience. This meant data analysts had to choose a side – and defend their territory aggressively. With tools like dbt, Snowpark, and dask-snowflake, however, the industry seems to be moving towards recognizing the value of each language and providing value to data professionals by lowering the barrier to integration between them.

One potential rule-of-thumb to take from this is to use SQL for simple queries that need to run fast on a data warehouse, dbt for organizing more complex SQL models, and Python with distributed computing libraries like Dask for free-form exploratory analysis and machine learning code and/or code that needs to be reliably unit tested. 

The data movement infrastructure for the modern data teams.
Try a 14-day free trial