databricks CERTIFIED DATA ENGINEER PROFESSIONAL Exam Questions

Questions for the CERTIFIED DATA ENGINEER PROFESSIONAL were updated on : Jan 11 ,2025

Page 1 out of 11. Viewing questions 1-10 out of 110

Question 1

Which distribution does Databricks support for installing custom Python code packages?

  • A. sbt
  • B. CRANC. npm
  • D. Wheels
  • E. jars
Answer:

d

User Votes:
A
50%
B
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
D
E
0 / 1000

Question 2

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.
The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.
The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.
Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

  • A. The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.
  • B. Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.
  • C. Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.
  • D. Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.
  • E. Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.
Answer:

d

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 3

A data team's Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new promotion, and they would like to add a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows. Note that proposed changes are in bold.

Original query:



Proposed query:



Proposed query:

.start(/item_agg)

Which step must also be completed to put the proposed query into production?

  • A. Specify a new checkpointLocation
  • B. Increase the shuffle partitions to account for additional aggregates
  • C. Run REFRESH TABLE delta.'/item_agg'
  • D. Register the data in the "/item_agg" directory to the Hive metastore
  • E. Remove .option(mergeSchema, true) from the streaming write
Answer:

a

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 4

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT
New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.
Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?

  • A. Use Auto Loader to subscribe to new files in the account_history directory; configure a Structured Streaming trigger once job to batch update newly detected files into the account_current table.
  • B. Overwrite the account_current table with each batch using the results of a query against the account_history table grouping by user_id and filtering for the max value of last_updated.
  • C. Filter records in account_history using the last_updated field and the most recent hour processed, as well as the max last_iogin by user_id write a merge statement to update or insert the most recent value for each user_id.
  • D. Use Delta Lake version history to get the difference between the latest version of account_history and one version prior, then write these records to account_current.
  • E. Filter records in account_history using the last_updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the most recent value for each username.
Answer:

c

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 5

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

  • A. Set the configuration delta.deduplicate = true.
  • B. VACUUM the Delta table after each batch completes.
  • C. Perform an insert-only merge with a matching condition on a unique key.
  • D. Perform a full outer join on a unique key and overwrite existing data.
  • E. Rely on Delta Lake schema enforcement to prevent duplicate records.
Answer:

d

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 6

What statement is true regarding the retention of job run history?

  • A. It is retained until you export or delete job run logs
  • B. It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3
  • C. It is retained for 60 days, during which you can export notebook run results to HTML
  • D. It is retained for 60 days, after which logs are archived
  • E. It is retained for 90 days or until the run-id is re-used through custom run configuration
Answer:

b

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 7

An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.
For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.
Which solution meets these requirements?

  • A. Create a separate history table for each pk_id resolve the current state of the table by running a union all filtering the history tables for the most recent state.
  • B. Use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a bronze table, then propagate all changes throughout the system.
  • C. Iterate through an ordered set of changes to the table, applying each in turn; rely on Delta Lake's versioning ability to create an audit log.
  • D. Use Delta Lake's change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.
  • E. Ingest all log information into a bronze table; use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a silver table to recreate the current table state.
Answer:

e

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 8

The data engineering team maintains the following code:



Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?

  • A. The silver_customer_sales table will be overwritten by aggregated values calculated from all records in the gold_customer_lifetime_sales_summary table as a batch job.
  • B. A batch job will update the gold_customer_lifetime_sales_summary table, replacing only those rows that have different values than the current version of the table, using customer_id as the primary key.
  • C. The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job.
  • D. An incremental job will leverage running information in the state store to update aggregate values in the gold_customer_lifetime_sales_summary table.
  • E. An incremental job will detect if new rows have been written to the silver_customer_sales table; if new rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetime_sales_summary table.
Answer:

e

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 9

A Delta Lake table was created with the below query:



Consider the following query:


DROP TABLE prod.sales_by_store
If this statement is executed by a workspace admin, which result will occur?

  • A. Nothing will occur until a COMMIT command is executed.
  • B. The table will be removed from the catalog but the data will remain in storage.
  • C. The table will be removed from the catalog and the data will be deleted.
  • D. An error will occur because Delta Lake prevents the deletion of production data.
  • E. Data will be marked as deleted but still recoverable with Time Travel.
Answer:

d

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 10

A Delta Lake table was created with the below query:

Realizing that the original query had a typographical error, the below code was executed:
ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store
Which result will occur after running the second command?

  • A. The table reference in the metastore is updated and no data is changed.
  • B. The table name change is recorded in the Delta transaction log.
  • C. All related files and metadata are dropped and recreated in a single ACID transaction.
  • D. The table reference in the metastore is updated and all data files are moved.
  • E. A new Delta transaction log Is created for the renamed table.
Answer:

a

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000
To page 2