Best practices in Databricks
Optimizing Performance, Collaboration, and Security
Databricks is a unified analytics platform, providing cloud-based services for big data and machine learning. This guide outlines best practices to optimize Databricks performance, enhance collaboration, and ensure security.
Performance & Cost Optimization
- Use Automatic Liquid Clustering
- Replace traditional partitioning and Z-Ordering with Liquid Clustering. It automatically adjusts data layout based on query patterns, solving the "small files" problem and improving query performance without manual tuning.
- Adopt Serverless Compute
- Use Serverless SQL Warehouses and Serverless Jobs to eliminate idle time and infrastructure management. Serverless compute scales instantly and charges only for the seconds used.
- Enable Predictive Optimization
- For Unity Catalog managed tables, enable Predictive Optimization to automatically run maintenance operations like
OPTIMIZEandVACUUMat the optimal time.
Data Engineering & Development
- Databricks Asset Bundles (DABs)
- Use Databricks Asset Bundles for Infrastructure as Code (IaC) and CI/CD. DABs allow you to define your jobs, pipelines, and infrastructure in YAML and deploy them consistently across environments (Dev, Staging, Prod).
- Lakeflow Declarative Pipelines
- Leverage Lakeflow (Delta Live Tables) to build reliable ETL pipelines. Define your data transformations declaratively, and let Databricks handle orchestration, error handling, and auto-scaling.
- Version Control with Git
- Integrate Databricks Repos with your Git provider (GitHub, Azure DevOps) to enable branch-based development, code reviews, and version history for notebooks and code.
Governance & Security
- Unity Catalog
- Centralize access control, auditing, and data discovery with Unity Catalog. It provides a unified governance layer for all your data and AI assets (tables, files, models) across workspaces.
- Attribute-Based Access Control (ABAC)
- Implement ABAC to create scalable access policies based on tags (e.g., "PII", "Confidential") rather than managing permissions for individual users and tables.
- Databricks Clean Rooms
- Use Clean Rooms for secure collaboration with external partners. Share data and run joint analyses without exposing the underlying raw data or moving it out of your environment.
Generative AI
- Mosaic AI Agent Framework
- Build and deploy production-grade GenAI agents using the Mosaic AI Agent Framework. It provides tools for evaluation, tracing, and deployment of LLM applications.
- Vector Search
- Use Mosaic AI Vector Search to build RAG (Retrieval Augmented Generation) applications. It automatically syncs your Delta tables to a vector index for fast semantic search.
References
- Liquid Clustering – Microsoft Corporation
- Databricks Asset Bundles – Microsoft Corporation
- Unity Catalog – Microsoft Corporation
- Mosaic AI Agent Framework – Microsoft Corporation
Best practices for creating enterprise-wide knowledge bots
Improve your enterprise-wide knowledge bot's performance, security, and design with our best practices
Read More →
Last updated: December 1, 2025