databricks CERTIFIED ASSOCIATE DEVELOPER FOR APACHE SPARK Exam Questions

Questions for the CERTIFIED ASSOCIATE DEVELOPER FOR APACHE SPARK were updated on : Jan 11 ,2025

Page 1 out of 11. Viewing questions 1-10 out of 102

Question 1

Which of the following operations can be used to return a DataFrame with no duplicate rows? Please select the most complete answer.

  • A. DataFrame.distinct()
  • B. DataFrame.dropDuplicates() and DataFrame.distinct()
  • C. DataFrame.dropDuplicates()
  • D. DataFrame.drop_duplicates()
  • E. DataFrame.dropDuplicates(), DataFrame.distinct() and DataFrame.drop_duplicates()
Answer:

e

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 2

The code block shown below contains an error. The code block intended to create a single-column DataFrame from Scala List years which is made up of integers. Identify the error.

Code block:

spark.createDataset(years)

  • A. The years list should be wrapped in another list like List(years) to make clear that it is a column rather than a row.
  • B. The data type is not specified the second argument to createDataset should be IntegerType.
  • C. There is no operation createDataset the createDataFrame operation should be used instead.
  • D. The result of the above is a Dataset rather than a DataFrame the toDF operation must be called at the end.
  • E. The column name must be specified as the second argument to createDataset.
Answer:

b

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 3

A Spark application has a 128 GB DataFrame A and a 1 GB DataFrame B. If a broadcast join were to be performed on these two DataFrames, which of the following describes which DataFrame should be broadcasted and why?

  • A. Either DataFrame can be broadcasted. Their results will be identical in result and efficiency.
  • B. DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself.
  • C. DataFrame A should be broadcasted because it is larger and will eliminate the need for the shuffling of DataFrame B.
  • D. DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of DataFrame A.
  • E. DataFrame A should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself.
Answer:

d

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 4

Which of the following Spark properties is used to configure whether DataFrame partitions that do not meet a minimum size threshold are automatically coalesced into larger partitions during a shuffle?

  • A. spark.sql.shuffle.partitions
  • B. spark.sql.autoBroadcastJoinThreshold
  • C. spark.sql.adaptive.skewJoin.enabled
  • D. spark.sql.inMemoryColumnarStorage.batchSize
  • E. spark.sql.adaptive.coalescePartitions.enabled
Answer:

e

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 5

The code block shown below contains an error. The code block is intended to create a Python UDF assessPerformanceUDF() using the integer-returning Python function assessPerformance() and apply it to column customerSatisfaction in DataFrame storesDF. Identify the error.
Code block:
assessPerformanceUDF udf(assessPerformance)
storesDF.withColumn(result, assessPerformanceUDF(col(customerSatisfaction)))

  • A. The assessPerformance() operation is not properly registered as a UDF.
  • B. The withColumn() operation is not appropriate here UDFs should be applied by iterating over rows instead.
  • C. UDFs can only be applied vie SQL and not through the DataFrame API.
  • D. The return type of the assessPerformanceUDF() is not specified in the udf() operation.
  • E. The assessPerformance() operation should be used on column customerSatisfaction rather than the assessPerformanceUDF() operation.
Answer:

a

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 6

Which of the following code blocks writes DataFrame storesDF to file path filePath as JSON?

  • A. storesDF.write.option("json").path(filePath)
  • B. storesDF.write.json(filePath)
  • C. storesDF.write.path(filePath)
  • D. storesDF.write(filePath)
  • E. storesDF.write().json(filePath)
Answer:

b

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 7

The below code block contains a logical error resulting in inefficiency. The code block is intended to efficiently perform a broadcast join of DataFrame storesDF and the much larger DataFrame employeesDF using key column storeId. Identify the logical error.
Code block:
storesDF.join(broadcast(employeesDF), storeId)

  • A. The larger DataFrame employeesDF is being broadcasted rather than the smaller DataFrame storesDF.
  • B. There is never a need to call the broadcast() operation in Apache Spark 3.
  • C. The entire line of code should be wrapped in broadcast() rather than just DataFrame employeesDF.
  • D. The broadcast() operation will only perform a broadcast join if the Spark property spark.sql.autoBroadcastJoinThreshold is manually set.
  • E. Only one of the DataFrames is being broadcasted rather than both of the DataFrames.
Answer:

a

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 8

Which of the following operations can be used to create a new DataFrame that has 12 partitions from an original DataFrame df that has 8 partitions?

  • A. df.repartition(12)
  • B. df.cache()
  • C. df.partitionBy(1.5)
  • D. df.coalesce(12)
  • E. df.partitionBy(12)
Answer:

a

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 9

Which of the following code blocks returns the first 3 rows of DataFrame storesDF?

  • A. storesDF.top_n(3)
  • B. storesDF.n(3)
  • C. storesDF.take(3)
  • D. storesDF.head(3)
  • E. storesDF.collect(3)
Answer:

c

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 10

Which of the following statements about the Spark driver is true?

  • A. Spark driver is horizontally scaled to increase overall processing throughput.
  • B. Spark driver is the most coarse level of the Spark execution hierarchy.
  • C. Spark driver is fault tolerant if it fails, it will recover the entire Spark application.
  • D. Spark driver is responsible for scheduling the execution of data by various worker nodes in cluster mode.
  • E. Spark driver is only compatible with its included cluster manager.
Answer:

d

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000
To page 2