top of page

How to become Databricks Certified Data Engineer Professional?

  • Writer: Kamila Woldan
    Kamila Woldan
  • Nov 29, 2022
  • 2 min read

Updated: Dec 1, 2022

Below you will find some advice on how to prepare for the exam. #databricks #dataengineer #cerftification




Looking back on my exam preparation, I would strongly advise planning out the areas of knowledge that need to be covered in a specific timeframe.


If your experience with day-to-day project work with technology is shorter than a year, then I would advise revising the basics of architecture and delta format.


As a data engineer, you are required to know the principles of the architecture and understand different layers of a data lakehouse as well as techniques of data processing and performance optimization.

I would like to underline a few compulsory subjects or materials to work through:

  • the basics of the Apache Spark and Databricks architecture,

  • the delta format and its superiority to parquet,

  • the notebooks available on the Databricks Academy repositories.

In addition, I would suggest learning about the various performance issues.


The basics of Apache Spark and Databricks architecture


I can recommend the following links to do a review of your knowledge:


Exploring delta tables


The great baseline for delta format is provided by Unpacking The Transaction Log blog post from Databricks Diving Into Delta Lake series.


Test how the delta format works and what happens to a transaction log if you make changes to a specific table. Check the table history, transaction log files, statistics, and how many files were created or made inactive after the change.


Databricks Academy repository


Import notebooks from the Git Databricks Academy repository to your Databricks workspace.


Go through each notebook as their scope goes through the particular requirements that will appear in the exam details section for certification.


Please check any comments in the notebooks and just google any term if you have a little knowledge of.


Performance issues


Databricks provides a Spark UI simulator ( Spark UI documentation ) where you can analyze ready-made examples related to performance issues.


I would recommend focusing on the following performance issues:

  • spill

  • skew

  • shuffle

  • storage

  • serialization


Most of these topics are covered in Apache Spark Core – Practical Optimization session provided by Daniel Tomes at SPARK + AI Submit 2019. It gives you a baseline and a good understanding of performance basics and introduces more complex issues. You can find also valuable Fine Tuning and Enhancing Performance of Apache Spark Jobs session from SPARK + AI Submit 2020.


Be familiar with Adaptive Query Execution and Dynamic Partition Pruning, and cluster configuration.

If you have an opportunity to take part in a workshop organized by Databricks, go for it. You will be able to systematize your knowledge for the exam in a short time allowing you to ask any questions you may have. Trainers have amazing knowledge and are willing to share it with you.


Some of the topics covered are not trivial, so it may happen that after 2 hours of intensive reading and testing, you find that you feed up exploration of e.g. performance issues for that day. This is why I recommend that you spread your studies out in time not to interface with your work and private life.


Good luck!:)


Kamila


The badge graphic comes from the official Databrticks website. Source: https://www.databricks.com/learn/certification/data-engineer-professional

Comments


Thank you for visiting my blog!

  • LinkedIn
bottom of page