ADB Optimization Best Practice Guide
Azure Databricks (ADB) has the power to process terabytes of data, while simultaneously running heavy data science workloads. Over time, as data input and workloads increase, job performance decreases. As an ADB developer, optimizing your platform enables you to work faster and save hours of effort for you and your team. Below are the best practices you need to optimize your ADB environment.
Cost Optimization
- Use Serverless Compute
- Leverage Cluster Policies
- Customize Cluster Termination
- Enable Cluster Autoscaling
- Use Spot Instances
Adopt Serverless SQL Warehouses for interactive SQL workloads to eliminate infrastructure management overhead and optimize costs through consumption-based billing. Serverless compute starts instantly and scales automatically.
Implement compute policies to enforce cost-effective configurations across all workspaces. Restrict instance types, enforce auto-termination settings, and ensure tagging compliance to prevent cost overruns.
Terminating inactive clusters saves costs. Customize the auto-termination time based on the environment (e.g., shorter for production jobs, longer for development) to avoid paying for idle resources.
Enable autoscaling to allow clusters to resize based on workload. Provide a minimum and maximum number of worker nodes so ADB can automatically reallocate resources as needed.
For interruptible workloads like development or testing, use Azure Spot VMs to save up to 90% on compute costs.
Performance Tuning
- Enable Photon Engine
- Optimize Delta Tables
- Use Liquid Clustering
- Adaptive Query Execution (AQE)
- Cache Frequently Accessed Data
Use the Photon engine, a native vectorized query engine, to accelerate SQL queries and DataFrame API calls. It provides significant performance improvements for data ingestion, ETL, and interactive queries.
Regularly run OPTIMIZE and VACUUM commands. OPTIMIZE compacts small files into larger ones to improve read performance, while VACUUM removes old files to save storage costs.
Replace traditional partitioning and Z-Ordering with Liquid Clustering. It automatically adjusts data layout based on query patterns, solving the "small files" problem and improving query performance without manual tuning.
Ensure Adaptive Query Execution (AQE) is enabled (default in newer runtimes). AQE optimizes query plans at runtime based on actual data statistics, handling data skew and join strategies dynamically.
Use the Delta Cache (Disk Cache) to accelerate data reads by creating copies of remote files in the local storage (NVMe SSDs) of the worker nodes.
Governance & Security
- Implement Unity Catalog
- Secure Secrets Management
- Network Security
Use Unity Catalog for centralized access control, auditing, and data discovery across all Databricks workspaces. It provides a unified governance model for files, tables, and ML models.
Never hardcode credentials. Use Azure Key Vault backed secret scopes to securely manage and access secrets, keys, and tokens within notebooks and jobs.
Deploy Databricks in your own Virtual Network (VNet Injection) to enable secure connectivity to other Azure services using Service Endpoints or Private Links.
Operational Excellence
- Orchestrate with Azure Data Factory
- CI/CD & Version Control
- Clean Up Temporary Data
Use Azure Data Factory (ADF) or Azure Synapse Pipelines to orchestrate complex workflows. This allows for better dependency management, retries, and monitoring across different Azure services.
Integrate with Git (Azure DevOps or GitHub) for version control. Use Databricks Asset Bundles (DABs) or Terraform for Infrastructure as Code (IaC) to automate deployments across environments.
Use dbutils.fs.rm() to remove temporary files and drop intermediate tables after execution to maintain a clean environment and reduce storage costs.
References
- Best practices for cost optimization – Microsoft Corporation
- Photon Acceleration – Microsoft Corporation
- Liquid Clustering for Delta Tables – Microsoft Corporation
- Unity Catalog Governance – Microsoft Corporation
- Adaptive Query Execution – Microsoft Corporation