ADB Optimization Best Practice Guide

Azure Databricks (ADB) has the power to process terabytes of data, while simultaneously running heavy data science workloads. Over time, as data input and workloads increase, job performance decreases. As an ADB developer, optimizing your platform enables you to work faster and save hours of effort for you and your team. Below are the best practices you need to optimize your ADB environment.

Cost Optimization

  1. Use Serverless Compute
  2. Adopt Serverless SQL Warehouses for interactive SQL workloads to eliminate infrastructure management overhead and optimize costs through consumption-based billing. Serverless compute starts instantly and scales automatically.

  3. Leverage Cluster Policies
  4. Implement compute policies to enforce cost-effective configurations across all workspaces. Restrict instance types, enforce auto-termination settings, and ensure tagging compliance to prevent cost overruns.

  5. Customize Cluster Termination
  6. Terminating inactive clusters saves costs. Customize the auto-termination time based on the environment (e.g., shorter for production jobs, longer for development) to avoid paying for idle resources.

  7. Enable Cluster Autoscaling
  8. Enable autoscaling to allow clusters to resize based on workload. Provide a minimum and maximum number of worker nodes so ADB can automatically reallocate resources as needed.

  9. Use Spot Instances
  10. For interruptible workloads like development or testing, use Azure Spot VMs to save up to 90% on compute costs.

Performance Tuning

  1. Enable Photon Engine
  2. Use the Photon engine, a native vectorized query engine, to accelerate SQL queries and DataFrame API calls. It provides significant performance improvements for data ingestion, ETL, and interactive queries.

  3. Optimize Delta Tables
  4. Regularly run OPTIMIZE and VACUUM commands. OPTIMIZE compacts small files into larger ones to improve read performance, while VACUUM removes old files to save storage costs.

  5. Use Liquid Clustering
  6. Replace traditional partitioning and Z-Ordering with Liquid Clustering. It automatically adjusts data layout based on query patterns, solving the "small files" problem and improving query performance without manual tuning.

  7. Adaptive Query Execution (AQE)
  8. Ensure Adaptive Query Execution (AQE) is enabled (default in newer runtimes). AQE optimizes query plans at runtime based on actual data statistics, handling data skew and join strategies dynamically.

  9. Cache Frequently Accessed Data
  10. Use the Delta Cache (Disk Cache) to accelerate data reads by creating copies of remote files in the local storage (NVMe SSDs) of the worker nodes.

Governance & Security

  1. Implement Unity Catalog
  2. Use Unity Catalog for centralized access control, auditing, and data discovery across all Databricks workspaces. It provides a unified governance model for files, tables, and ML models.

  3. Secure Secrets Management
  4. Never hardcode credentials. Use Azure Key Vault backed secret scopes to securely manage and access secrets, keys, and tokens within notebooks and jobs.

  5. Network Security
  6. Deploy Databricks in your own Virtual Network (VNet Injection) to enable secure connectivity to other Azure services using Service Endpoints or Private Links.

Operational Excellence

  1. Orchestrate with Azure Data Factory
  2. Use Azure Data Factory (ADF) or Azure Synapse Pipelines to orchestrate complex workflows. This allows for better dependency management, retries, and monitoring across different Azure services.

  3. CI/CD & Version Control
  4. Integrate with Git (Azure DevOps or GitHub) for version control. Use Databricks Asset Bundles (DABs) or Terraform for Infrastructure as Code (IaC) to automate deployments across environments.

  5. Clean Up Temporary Data
  6. Use dbutils.fs.rm() to remove temporary files and drop intermediate tables after execution to maintain a clean environment and reduce storage costs.

References

Up Next


Azure Optimization Best Practices


Learn More →