Azure Databricks Best Practice Guide

Azure Databricks (ADB) has the power to process terabytes of data, while simultaneously running heavy data science workloads. Over time, as data input and workloads increase, job performance decreases. As an ADB developer, optimizing your platform enables you to work faster and save hours of effort for you and your team. Below are the 18 best practices you need to optimize your ADB environment.

Delete temporary tables after notebook execution

Delete temporary tables that were created as intermediate tables during notebook execution. Deleting tables saves storage, especially if the notebook is scheduled daily.

Use dbutils.fs.rm() to permanently delete temporary table metadata

ADB clusters store table metadata, even if you use drop statements to delete. Before creating temporary tables, use dbutils.fs.rm() to permanently delete metadata. If you don’t use this statement, an error message will appear stating that the table already exists. To avoid this error in daily refreshes, you must use dbutils.fs.rm().

Use Lower() or Upper() when comparing strings to avoid losing data

ADB can't compare strings with different casing. To avoid losing data, use case conversion statements Lower() or Upper(). Example:

SELECT 'MAQSoftware' = 'maqsoftware' AS WithOutLowerOrUpper ,LOWER('MAQSoftware') = 'maqsoftware' AS WithLower ,UPPER('MAQSoftware') = 'MAQSOFTWARE' AS WithUpper

Use Scala for data processing

ADB’s data processing is based on Apache Spark, which is written in Scala. As a result, Scala performs better than SQL in ADB. Note: Python should still be used for machine learning functions in ADB.

Use custom functions to simplify complex calculations

If your calculation requires multiple steps, you can save time and by creating a one-step custom function. ADB offers a variety of built in SQL functions, however to create custom functions, known as user-defined functions (UDF), use Scala. Once you have a custom function, you can call it every time you need to perform that specific calculation.

Use Delta tables for DML commands

In ADB, Hive tables do not support UPDATE and MERGE statements or NOT NULL and CHECK constraints. Delta tables do support these commands, however running large amounts of data on Delta tables decreases query performance. So not to decrease performance, store table versions.

Use views when creating intermediate tables

If you need to create intermediate tables, use views to minimize storage usage and save costs. Views are session-oriented and will automatically remove tables from storage after query execution. For optimal query performance, do not use joins or subqueries in views.

Enable adaptive query execution (AQE)

AQE improves large query performance. By default, AQE is disabled in ADB. To enable it, use: set spark.sql.adaptive.enabled = true;

Partition by columns

Delta tables in ADB support partitioning, which enhances performance. You can partition by a column if you expect data in that partition to be at least 1 GB. If column cardinality is high, do not use that column for partitioning. For example, if you partition by user ID and there are 1M distinct user IDs, partitioning would increase table load time. Syntax example:
CREATE TABLE events ( DATE DATE ,eventId STRING ,eventType STRING ,data STRING ) USING delta PARTITIONED BY (DATE)

Use key vault credentials when creating mount points

When creating mount points to Azure Data Lake Storage (ADLS), use a key vault client ID and client secret to enhance security.

Query directly on parquet files from ADLS

If you need to use the data from parquet files, do not extract into ADB in intermediate table format. Instead, directly query on the parquet file to save time and storage. Example:
SELECT ColumnName FROM parquet.`Location of the file`

Specify distribution when publishing data to Azure Data Warehouse (ADW)

Use hash distribution for fact tables or large tables, round robin for dimensional tables, replicated for small dimensional tables. Example:
df.write \ .format("com.databricks.spark.sqldw") \ .option("url", "jdbc:sqlserver:// ") \ .option("forwardSparkAzureStorageCredentials", "true") \ .option("dbTable", "my_table_in_dw_copy") \ .option("tableOptions", "table_options") \ .save()

Customize cluster termination time

Terminating inactive clusters saves costs. ADB automatically terminates clusters based on a default down time. As different projects have different needs, it’s important to customize the down time to avoid premature or delayed termination. For example: set a longer down time for development environments, as work is continuous.

Enable cluster autoscaling

ADB offers cluster autoscaling, which is disabled by default. Enable this feature to enhance job performance. Instead of providing a fixed number of worker nodes during cluster creation, you should provide a minimum and maximum. ADB then automatically reallocates the worker nodes based on job characteristics.

Use Azure Data Factory (ADF) to run ADB notebook jobs

If you run numerous notebooks daily, the ADB job scheduler will not be efficient. The ADB job scheduler cannot set notebook dependency, so you would have to store all notebooks in one master, which is difficult to debug. Instead, schedule jobs through Azure Data Factory, which enables you to set dependency and easily debug if anything fails.

Use the retry feature in ADF when scheduling jobs

Processing notebooks in ADB through ADF can overload the cluster, causing notebooks to fail. If failure occurs, the entire job should not stop. To continue work from the point of failure, set ADF to retry two to three times with five-minute intervals. As a result, the processing should continue from the set time, saving you time and effort.

Implement failure checkpoints while publishing data

With ADB, you can dump data into multiple resources like ADW or ADLS. Publishing numerous tables to another resource takes time. If publishing fails, do not restart the entire process. Implement checkpoints, so that you can restart from the point of failure.

Consider upgrading to ADB Premium

Your business’s data has never been more valuable. Additional security is a worthwhile investment. ADB Premium includes 5-level access control. For more features, check out: Premium vs. Standard.

References

Automatic termination – Databricks, last updated August 1, 2023
Configure Clusters (Autoscaling) – Microsoft Corporation, last updated July 31, 2023
Create an Azure Key Vault-backed secret scope – Microsoft Corporation, last updated July 27, 2023
What is Delta Lake – Microsoft Corporation, last updated June 28, 2023
Datatypes in ADB – Microsoft Corporation, last updated April 14, 2023
Adaptive Query Execution: Speeding Up Spark SQL at Runtime – Databricks, last updated May 29, 2020

Up Next

Azure Optimization Best Practices

Learn More →

Last updated: August 2, 2023