January 13, 2021 10 Minutes to Read
Azure Databricks (ADB) has the power to process terabytes of data, while simultaneously running heavy data science workloads. Over time, as data input and workloads increase, job performance decreases. As an ADB developer, optimizing your platform enables you to work faster and save hours of effort for you and your team. Below are the 18 best practices you need to optimize your ADB environment.
Tweet
dbutils.fs.rm()
to permanently delete temporary table metadata
dbutils.fs.rm()
to permanently delete metadata. If you don’t use this statement, an error message will appear stating that the table already exists. To avoid this error in daily refreshes, you must use dbutils.fs.rm()
.
SELECT 'MAQSoftware' = 'maqsoftware' AS WithOutLowerOrUpper
,LOWER('MAQSoftware') = 'maqsoftware' AS WithLower
,UPPER('MAQSoftware') = 'MAQSOFTWARE' AS WithUpper
set spark.sql.adaptive.enabled = true;
CREATE TABLE events (
DATE DATE
,eventId STRING
,eventType STRING
,data STRING
) USING delta PARTITIONED BY (DATE)
SELECT ColumnName FROM parquet.`Location of the file`
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>
") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "my_table_in_dw_copy") \
.option("tableOptions", "table_options") \
.save()
Microsoft offers additional documents that provide a high-level framework for best practices. We encourage you to review the following: