Azure Databricks Best Practice Guide
Azure Databricks (ADB) has the power to process terabytes of data, while simultaneously running heavy data science workloads. Over time, as data input and workloads increase, job performance decreases. As an ADB developer, optimizing your platform enables you to work faster and save hours of effort for you and your team. Below are the 18 best practices you need to optimize your ADB environment.
- Delete temporary tables after notebook execution
dbutils.fs.rm()to permanently delete temporary table metadata
- Use Lower() or Upper() when comparing strings to avoid losing data
- Use Scala for data processing
- Use custom functions to simplify complex calculations
- Use Delta tables for DML commands
- Use views when creating intermediate tables
- Enable adaptive query execution (AQE)
- Partition by columns
- Use key vault credentials when creating mount points
- Query directly on parquet files from ADLS
- Specify distribution when publishing data to Azure Data Warehouse (ADW)
- Customize cluster termination time
- Enable cluster autoscaling
- Use Azure Data Factory (ADF) to run ADB notebook jobs
- Use the retry feature in ADF when scheduling jobs
- Implement failure checkpoints while publishing data
- Consider upgrading to ADB Premium
Delete temporary tables that were created as intermediate tables during notebook execution. Deleting tables saves storage, especially if the notebook is scheduled daily.
ADB clusters store table metadata, even if you use drop statements to delete. Before creating temporary tables, use
dbutils.fs.rm() to permanently delete metadata. If you don’t use this statement, an error message will appear stating that the table already exists. To avoid this error in daily refreshes, you must use
ADB can't compare strings with different casing. To avoid losing data, use case conversion statements Lower() or Upper(). Example:
SELECT 'MAQSoftware' = 'maqsoftware' AS WithOutLowerOrUpper
,LOWER('MAQSoftware') = 'maqsoftware' AS WithLower
,UPPER('MAQSoftware') = 'MAQSOFTWARE' AS WithUpper
ADB’s data processing is based on Apache Spark, which is written in Scala. As a result, Scala performs better than SQL in ADB. Note: Python should still be used for machine learning functions in ADB.
If your calculation requires multiple steps, you can save time and by creating a one-step custom function. ADB offers a variety of built in SQL functions, however to create custom functions, known as user-defined functions (UDF), use Scala. Once you have a custom function, you can call it every time you need to perform that specific calculation.
In ADB, Hive tables do not support UPDATE and MERGE statements or NOT NULL and CHECK constraints. Delta tables do support these commands, however running large amounts of data on Delta tables decreases query performance. So not to decrease performance, store table versions.
If you need to create intermediate tables, use views to minimize storage usage and save costs. Views are session-oriented and will automatically remove tables from storage after query execution. For optimal query performance, do not use joins or subqueries in views.
AQE improves large query performance. By default, AQE is disabled in ADB. To enable it, use:
set spark.sql.adaptive.enabled = true;
Delta tables in ADB support partitioning, which enhances performance. You can partition by a column if you expect data in that partition to be at least 1 GB. If column cardinality is high, do not use that column for partitioning. For example, if you partition by user ID and there are 1M distinct user IDs, partitioning would increase table load time. Syntax example:
CREATE TABLE events (
) USING delta PARTITIONED BY (DATE)
When creating mount points to Azure Data Lake Storage (ADLS), use a key vault client ID and client secret to enhance security.
If you need to use the data from parquet files, do not extract into ADB in intermediate table format. Instead, directly query on the parquet file to save time and storage. Example:
SELECT ColumnName FROM parquet.`Location of the file`
Use hash distribution for fact tables or large tables, round robin for dimensional tables, replicated for small dimensional tables. Example:
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "my_table_in_dw_copy") \
.option("tableOptions", "table_options") \
Terminating inactive clusters saves costs. ADB automatically terminates clusters based on a default down time. As different projects have different needs, it’s important to customize the down time to avoid premature or delayed termination. For example: set a longer down time for development environments, as work is continuous.
ADB offers cluster autoscaling, which is disabled by default. Enable this feature to enhance job performance. Instead of providing a fixed number of worker nodes during cluster creation, you should provide a minimum and maximum. ADB then automatically reallocates the worker nodes based on job characteristics.
If you run numerous notebooks daily, the ADB job scheduler will not be efficient. The ADB job scheduler cannot set notebook dependency, so you would have to store all notebooks in one master, which is difficult to debug. Instead, schedule jobs through Azure Data Factory, which enables you to set dependency and easily debug if anything fails.
Processing notebooks in ADB through ADF can overload the cluster, causing notebooks to fail. If failure occurs, the entire job should not stop. To continue work from the point of failure, set ADF to retry two to three times with five-minute intervals. As a result, the processing should continue from the set time, saving you time and effort.
With ADB, you can dump data into multiple resources like ADW or ADLS. Publishing numerous tables to another resource takes time. If publishing fails, do not restart the entire process. Implement checkpoints, so that you can restart from the point of failure.
Your business’s data has never been more valuable. Additional security is a worthwhile investment. ADB Premium includes 5-level access control. For more features, check out: Premium vs. Standard.
- Automatic termination – Databricks, last updated August 1, 2023
- Configure Clusters (Autoscaling) – Microsoft Corporation, last updated July 31, 2023
- Create an Azure Key Vault-backed secret scope – Microsoft Corporation, last updated July 27, 2023
- What is Delta Lake – Microsoft Corporation, last updated June 28, 2023
- Datatypes in ADB – Microsoft Corporation, last updated April 14, 2023
- Adaptive Query Execution: Speeding Up Spark SQL at Runtime – Databricks, last updated May 29, 2020
Azure Optimization Best Practices
Learn More →