Questions for the PROFESSIONAL DATA ENGINEER were updated on : Jan 11 ,2025
Your neural network model is taking days to train. You want to increase the training speed. What can you do?
D
Explanation:
Reference: https://towardsdatascience.com/how-to-increase-the-accuracy-of-a-neural-network-9f5d1c6f407d
Your company is using WILDCARD tables to query data across multiple tables with similar names. The SQL statement is
currently failing with the following error:
Which table name will make the SQL statement work correctly?
D
Explanation:
Reference: https://cloud.google.com/bigquery/docs/wildcard-tables
You work for a shipping company that has distribution centers where packages move on delivery lines to route them
properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in
transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real
time while the packages are in transit. Which solution should you choose?
A
You launched a new gaming app almost three years ago. You have been uploading log files from the previous day to a
separate Google BigQuery table with the table name format LOGS_yyyymmdd. You have been using table wildcard
functions to generate daily and monthly reports for all time ranges. Recently, you discovered that some queries that cover
long date ranges are exceeding the limit of 1,000 tables and failing. How can you resolve this issue?
A
Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud.
Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about
a transmission, the system re-transmits the data. How should you deduplicate the data most efficiency?
D
You are designing an Apache Beam pipeline to enrich data from Cloud Pub/Sub with static reference data from BigQuery.
The reference data is small enough to fit in memory on a single worker. The pipeline should write enriched results to
BigQuery for analysis. Which job type and transforms should this pipeline use?
A
You are managing a Cloud Dataproc cluster. You need to make a job run faster while minimizing costs, without losing work
in progress on your clusters. What should you do?
D
Explanation:
Reference: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/flex
You currently have a single on-premises Kafka cluster in a data center in the us-east region that is responsible for ingesting
messages from IoT devices globally. Because large parts of globe have poor internet connectivity, messages sometimes
batch at the edge, come in all at once, and cause a spike in load on your Kafka cluster. This is becoming difficult to manage
and prohibitively expensive. What is the Google-recommended cloud native architecture for this scenario?
C
You want to rebuild your batch pipeline for structured data on Google Cloud. You are using PySpark to conduct data
transformations at scale, but your pipelines are taking over twelve hours to run. To expedite development and pipeline run
time, you want to use a serverless tool and SOL syntax. You have already moved your raw data into Cloud Storage. How
should you build the pipeline on Google Cloud while meeting speed and processing requirements?
D
You want to analyze hundreds of thousands of social media posts daily at the lowest cost and with the fewest steps.
You have the following requirements:
You will batch-load the posts once per day and run them through the Cloud Natural Language API.
You will extract topics and sentiment from the posts.
You must store the raw posts for archiving and reprocessing.
You will create dashboards to be shared with people both inside and outside your organization.
You need to store both the data extracted from the API to perform analysis as well as the raw social media posts for
historical archiving. What should you do?
D
You are developing an application on Google Cloud that will automatically generate subject labels for users blog posts. You
are under competitive pressure to add this feature quickly, and you have no additional developer resources. No one on your
team has experience with machine learning. What should you do?
A
Your company is selecting a system to centralize data ingestion and delivery. You are considering messaging and data
integration systems to address the requirements. The key requirements are: The ability to seek to a particular offset in a
topic, possibly back to the start of all data ever captured Support for publish/subscribe semantics on hundreds of topics
Retain per-key ordering
Which system should you choose?
A
You architect a system to analyze seismic data. Your extract, transform, and load (ETL) process runs as a series of
MapReduce jobs on an Apache Hadoop cluster. The ETL process takes days to process a data set because some steps are
computationally expensive. Then you discover that a sensor calibration step has been omitted. How should you change your
ETL process to carry out sensor calibration systematically in the future?
A
Youre training a model to predict housing prices based on an available dataset with real estate properties. Your plan is to
train a fully connected neural net, and youve discovered that the dataset contains latitude and longitude of the property.
Real estate professionals have told you that the location of the property is highly influential on price, so youd like to engineer
a feature that incorporates this physical dependency.
What should you do?
B
Explanation:
Reference: https://cloud.google.com/bigquery/docs/gis-data
You need to set access to BigQuery for different departments within your company. Your solution should comply with the
following requirements:
Each department should have access only to their data.
Each department will have one or more leads who need to be able to create and update tables and provide them to their
team. Each department has data analysts who need to be able to query but not modify data.
How should you set access to the data in BigQuery?
D