Azure Databricks Best Practice Guide

Azure Databricks (ADB) has the power to process terabytes of data, while simultaneously running heavy data science workloads. Over time, as data input and workloads increase, job performance decreases. As an ADB developer, optimizing your platform enables you to work faster and save hours of effort for you and your team. Below are the 18 best practices you need to optimize your ADB environment.

  1. Delete temporary tables after notebook execution

  2. Delete temporary tables that were created as intermediate tables during notebook execution. Deleting tables saves storage, especially if the notebook is scheduled daily.

  3. Use dbutils.fs.rm() to permanently delete temporary table metadata

  4. ADB clusters store table metadata, even if you use drop statements to delete. Before creating temporary tables, use dbutils.fs.rm() to permanently delete metadata. If you don’t use this statement, an error message will appear stating that the table already exists. To avoid this error in daily refreshes, you must use dbutils.fs.rm().

  5. Use Lower() or Upper() when comparing strings to avoid losing data

  6. ADB can't compare strings with different casing. To avoid losing data, use case conversion statements Lower() or Upper(). Example:

    SELECT 'MAQSoftware' = 'maqsoftware' AS WithOutLowerOrUpper
    ,LOWER('MAQSoftware') = 'maqsoftware' AS WithLower
    ,UPPER('MAQSoftware') = 'MAQSOFTWARE' AS WithUpper

  7. Use Scala for data processing

  8. ADB’s data processing is based on Apache Spark, which is written in Scala. As a result, Scala performs better than SQL in ADB. Note: Python should still be used for machine learning functions in ADB.

  9. Use custom functions to simplify complex calculations

  10. If your calculation requires multiple steps, you can save time and by creating a one-step custom function. ADB offers a variety of built in SQL functions, however to create custom functions, known as user-defined functions (UDF), use Scala. Once you have a custom function, you can call it every time you need to perform that specific calculation.

  11. Use Delta tables for DML commands

  12. In ADB, Hive tables do not support UPDATE and MERGE statements or NOT NULL and CHECK constraints. Delta tables do support these commands, however running large amounts of data on Delta tables decreases query performance. So not to decrease performance, store table versions.

  13. Use views when creating intermediate tables

  14. If you need to create intermediate tables, use views to minimize storage usage and save costs. Views are session-oriented and will automatically remove tables from storage after query execution. For optimal query performance, do not use joins or subqueries in views.

  15. Enable adaptive query execution (AQE)

  16. AQE improves large query performance. By default, AQE is disabled in ADB. To enable it, use: set spark.sql.adaptive.enabled = true;

  17. Partition by columns

  18. Delta tables in ADB support partitioning, which enhances performance. You can partition by a column if you expect data in that partition to be at least 1 GB. If column cardinality is high, do not use that column for partitioning. For example, if you partition by user ID and there are 1M distinct user IDs, partitioning would increase table load time. Syntax example:

    CREATE TABLE events (
    DATE DATE
    ,eventId STRING
    ,eventType STRING
    ,data STRING
    ) USING delta PARTITIONED BY (DATE)

  19. Use key vault credentials when creating mount points

  20. When creating mount points to Azure Data Lake Storage (ADLS), use a key vault client ID and client secret to enhance security.

  21. Query directly on parquet files from ADLS

  22. If you need to use the data from parquet files, do not extract into ADB in intermediate table format. Instead, directly query on the parquet file to save time and storage. Example:
    SELECT ColumnName FROM parquet.`Location of the file`

  23. Specify distribution when publishing data to Azure Data Warehouse (ADW)

  24. Use hash distribution for fact tables or large tables, round robin for dimensional tables, replicated for small dimensional tables. Example:
    df.write \
    .format("com.databricks.spark.sqldw") \
    .option("url", "jdbc:sqlserver://
    ") \
    .option("forwardSparkAzureStorageCredentials", "true") \
    .option("dbTable", "my_table_in_dw_copy") \
    .option("tableOptions", "table_options") \
    .save()

  25. Customize cluster termination time

  26. Terminating inactive clusters saves costs. ADB automatically terminates clusters based on a default down time. As different projects have different needs, it’s important to customize the down time to avoid premature or delayed termination. For example: set a longer down time for development environments, as work is continuous.

  27. Enable cluster autoscaling

  28. ADB offers cluster autoscaling, which is disabled by default. Enable this feature to enhance job performance. Instead of providing a fixed number of worker nodes during cluster creation, you should provide a minimum and maximum. ADB then automatically reallocates the worker nodes based on job characteristics.

  29. Use Azure Data Factory (ADF) to run ADB notebook jobs

  30. If you run numerous notebooks daily, the ADB job scheduler will not be efficient. The ADB job scheduler cannot set notebook dependency, so you would have to store all notebooks in one master, which is difficult to debug. Instead, schedule jobs through Azure Data Factory, which enables you to set dependency and easily debug if anything fails.

  31. Use the retry feature in ADF when scheduling jobs

  32. Processing notebooks in ADB through ADF can overload the cluster, causing notebooks to fail. If failure occurs, the entire job should not stop. To continue work from the point of failure, set ADF to retry two to three times with five-minute intervals. As a result, the processing should continue from the set time, saving you time and effort.

  33. Implement failure checkpoints while publishing data

  34. With ADB, you can dump data into multiple resources like ADW or ADLS. Publishing numerous tables to another resource takes time. If publishing fails, do not restart the entire process. Implement checkpoints, so that you can restart from the point of failure.

  35. Consider upgrading to ADB Premium

  36. Your business’s data has never been more valuable. Additional security is a worthwhile investment. ADB Premium includes 5-level access control. For more features, check out: Premium vs. Standard.

References

Up Next


Azure Optimization Best Practices


Learn More →