Azure Databricks Best Practices Guide

Azure Databricks Best Practices

January 13, 2021 10 Minutes to Read

Azure Databricks (ADB) has the power to process terabytes of data, while simultaneously running heavy data science workloads. Over time, as data input and workloads increase, job performance decreases. As an ADB developer, optimizing your platform enables you to work faster and save hours of effort for you and your team. Below are the 18 best practices you need to optimize your ADB environment.

  1. Delete temporary tables after notebook execution
    • Delete any temporary tables that were created as intermediate tables during notebook execution. Deleting tables saves storage, especially if the notebook is scheduled daily.
  2. Use dbutils.fs.rm() to permanently delete temporary table metadata
    • ADB clusters store table metadata, even if you use drop statements to delete. Before creating temporary tables, use dbutils.fs.rm() to permanently delete metadata. If you don’t use this statement, an error message will appear stating that the table already exists. To avoid this error in daily refreshes, you must use dbutils.fs.rm().
  3. Use Lower() or Upper() when comparing strings to avoid losing data
    • ADB can't compare strings with different casing. To avoid losing data, use case conversion statements Lower() or Upper().
    • Example:
      SELECT 'MAQSoftware' = 'maqsoftware' AS WithOutLowerOrUpper
      ,LOWER('MAQSoftware') = 'maqsoftware' AS WithLower
      ,UPPER('MAQSoftware') = 'MAQSOFTWARE' AS WithUpper
  4. Use Scala to improve data processing performance
    • ADB’s data processing is based on Apache Spark, which is written in Scala. As a result, Scala performs better than SQL in ADB. Note: Python should still be used for machine learning functions in ADB.
  5. Use custom functions to simplify complex calculations
    • If your calculation requires multiple steps, you can save time and by creating a one-step custom function. ADB offers a variety of built in SQL functions, however to create custom functions, known as user-defined functions (UDF), use Scala. Once you have a custom function, you can call it every time you need to perform that specific calculation.
  6. Use Delta tables to enhance DML command performance
    • In ADB, Hive tables do not support UPDATE and MERGE statements or NOT NULL and CHECK constraints. Delta tables do support these commands, however running large amounts of data on Delta tables decreases query performance. So not to decrease performance, store table versions.
  7. Use views to minimize storage usage and save costs
    • If you need to create intermediate tables, use views to minimize storage usage and save costs. Views are session-oriented and will automatically remove tables from storage after query execution. For optimal query performance, do not use joins or subqueries in views.
  8. Enable adaptive query execution (AQE)
    • AQE improves large query performance. By default, AQE is disabled in ADB. To enable it, see below.
    • Example:
      set spark.sql.adaptive.enabled = true;
  9. Partition by columns to enhance performance
    • Delta tables in ADB support partitioning, which enhances performance. You can partition by a column if you expect data in that partition to be at least 1 GB. If column cardinality is high, do not use that column for partitioning. For example, if you partition by user ID and there are 1M distinct user IDs, partitioning would increase table load time.
    • Syntax Example:
      CREATE TABLE events (
      DATE DATE
      ,eventId STRING
      ,eventType STRING
      ,data STRING
      ) USING delta PARTITIONED BY (DATE)
  10. Use key vault credentials when creating mount points
    • When creating mount points to Azure Data Lake Storage (ADLS), use a key vault client ID and client secret to enhance security.
  11. Query directly on parquet files from ADLS
    • If you need to use the data from parquet files, do not extract into ADB in intermediate table format. Instead, directly query on the parquet file to save time and storage.
    • Example:
      SELECT ColumnName FROM parquet.`Location of the file`
  12. Specify distribution when publishing data to Azure Data Warehouse (ADW)
    • Use hash distribution for fact tables or large tables, round robin for dimensional tables, replicated for small dimensional tables. For more information, check out: Distributed tables design guidance - Azure Synapse Analytics
    • Example:
      df.write \
      .format("com.databricks.spark.sqldw") \
      .option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>
      ") \
      .option("forwardSparkAzureStorageCredentials", "true") \
      .option("dbTable", "my_table_in_dw_copy") \
      .option("tableOptions", "table_options") \
      .save()
  13. Customize cluster termination time
    • Terminating inactive clusters saves costs. ADB automatically terminates clusters based on a default down time. As different projects have different needs, it’s important to customize the down time to avoid premature or delayed termination. For example: set a longer down time for development environments, as work is continuous.
  14. Enable cluster autoscaling
    • ADB offers cluster autoscaling, which is disabled by default. Enable this feature to enhance job performance. Instead of providing a fixed number of worker nodes during cluster creation, you should provide a minimum and maximum. ADB then automatically reallocates the worker nodes based on job characteristics.
  15. Use Azure Data Factory (ADF) to run ADB notebook jobs
    • If you run numerous notebooks daily, the ADB job scheduler will not be efficient. The ADB job scheduler cannot set notebook dependency, so you would have to store all notebooks in one master, which is difficult to debug. Instead, schedule jobs through Azure Data Factory, which enables you to set dependency and easily debug if anything fails.
  16. Use the retry feature in ADF when scheduling jobs
    • Processing notebooks in ADB through ADF can overload the cluster, causing notebooks to fail. If failure occurs, the entire job should not stop. To continue work from the point of failure, set ADF to retry two to three times with five-minute intervals. As a result, the processing should continue from the set time, saving you time and effort.
  17. Implement failure checkpoints while publishing data
    • With ADB, you can dump data into multiple resources like ADW or ADLS. Publishing numerous tables to another resource takes time. If publishing fails, do not restart the entire process. Implement checkpoints, so that you can restart from the point of failure.
  18. Consider using ADB Premium for increased security and privacy compliance
    • Your business’s data has never been more valuable. Additional security is a worthwhile investment. ADB Premium includes 5-level access control. For more features, check out: Premium vs Standard.

References

Microsoft offers additional documents that provide a high-level framework for best practices. We encourage you to review the following: