In this article I share my experience studying for and passing the Google Cloud Certified – Professional Data Engineer exam.
Intro and Exam Summary
In in the feedback to the Cloud Architect article, you asked for more technical detail, so I have adapted the style of this article to include more focused guidance. To try and keep the content concise, I have used bullet points where possible in a ‘top 5’ style for each topic.
The Data Engineer certification is one of the toughest exams but most enjoyable learning experiences. To help bridge the gaps in my Big Data experience, I signed up with Qwiklabs and found the scenarios to be really helpful. Codelabs offer similar challenges that can be used in conjunction with your free tier account as a cost-free alternative.
Case studies didn’t feature as much as in the Cloud Architect exam. Machine Learning was the most heavily featured topic so pay particular attention to ML in your preparation.
Other exam topics to be aware of include the Apache Hadoop ecosystem, make sure you’re familiar with Hive, Pig, Spark and MapReduce, how to migrate from HDFS to Google Cloud (Cloud Storage). Dataproc is the managed service for the Hadoop ecosystem – map scenarios using existing Hadoop workloads to Dataproc.
Don’t forget exam strategy – narrow down to the least wrong answer for questions you’re unsure of and mark items for review rather than spending too much time contemplating. Time wasn’t too much of an issue though and I had 10 minutes on the clock when I hit the submit button.
Case Studies
Flowlogistic
- Existing workloads
- Map solutions to Pub/Sub > Dataproc (Dataproc as they have an existing Hadoop solution)
- Storage = Cloud Storage, Bigtable, BigQuery
MJTelco
- Greenfield / no existing infra (Google recommends Dataflow over Dataproc for greenfield data processing)
- Long-term data storage in Cloud Storage or BigQuery (depending on the question e.g. analytics)
- Grant access to data = IAM roles
Data Engineering Concepts
Think of data as a tangible object to be collected, stored, and processed. The life cycle from initial collection to final visualisation. You need to be familiar with the steps, what GCP services should be used and how they link together.
Recommended reading: https://cloud.google.com/solutions/data-lifecycle-cloud-platform
Database types
Exam expectations for database types
- Understand descriptions between database types
- Know which database version matches which description
- Example: Need database with high throughput, ACID compliance not necessary, choose three possible options
What is streaming data?
- ‘Unbounded’ data
- Batch data is ‘Bounded’ data
- These terms are likely to be on the exam
- Always flowing, never completes, infinite
- Examples: Traffic or weather sensors
Machine Learning
Is the process of combining inputs to produce useful predictions on never-before-seen data. Essentially the process of a machine learning from one or more datasets to make predictions on future data. TensorFlow is an open source library for numerical computation that makes machine learning faster and easier.
Machine learning and Tensorflow are featured heavily on the exam and I recommend reading up on these topics. Make sure you understand the difference between supervised (regression and classification) and unsupervised (clustering) learning, hyperparameter tuning, feature engineering, underfitting and overfitting. I recommend the Linux Academy course and Qwiklabs for conceptual and hands on experience respectively.
Recommended read: https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76
IAM
Comes up in relation to random services on the exam. Research use of service accounts as well as the predefined roles for each service.
Resources that helped me prepare for the exam
Online self-paced training:
- Linux Academy – Matthew Ulasien’s course on Linux Academy is great
- Qwiklabs – Absolute must for me. Without the hands on experience from running through these labs, I wouldn’t have passed the exam
Product Specific Guidance
Cloud SQL
- Managed MySQL/PostgreSQL database service (not NoOps)
- Runs on top of Compute Engine
- Red replicas are restricted to the same region
- Disk size limited to 10TB
Cloud Spanner
- Fully managed, highly scalable/available, relational database
- Google describe Spanner as NewSQL (there are adaptations compared to other relational databases)
- Consider for higher workloads than Cloud SQL can support when considering adaptions
- Horizontal scale with strong replication consistency
- Research interlative tables
Cloud Datastore
- Fully managed NoOps NoSQL database
- Highly scalable with automatic scaling and sharding and multi-region capability
- ACID transactions
- Single Datastore database per project
- Research exploding indexes
Cloud Bigtable
- High performance, massively scalable NoSQL database
- Recommended minimum data size of 1TB (for cost effectiveness)
- Not a NoOps solution (instance management is a factor)
- High throughput analytics and huge datasets
- Research row key and column structure and how to avoid hotspotting
Recommended reads:
https://cloud.google.com/bigtable/docs/schema-design
https://cloud.google.com/bigtable/docs/performance
BigQuery
- Fully managed (NoOps) data warehouse (think analytics when thinking BigQuery)
- Autoscaling to petabyte range datasets
- Use case for store and analyse
- Query with standard and legacy SQL
- Extremely fast read performance, poor write (update) performance
Remember colours of query performance data for the exam (you may get a question that doesn’t specify purple is read for example):
BQ featured heavily on the exam as well as query optimisations such as partitioned tables, views and caching.
Cloud Pub/Sub
- Global-scale messaging buffer/coupler based on Apache Kafka
- Guaranteed at-least-once delivery
- Pub/Sub does not guarantee messages will be delivered in order
- Push subscribers must be Webhook endpoints that accept POST over HTTPS – default is pull
Cloud Dataproc
- Managed service for the Hadoop ecosystem (customers migrating existing workloads)
- Managed, but not no-ops – configure cluster, not auto-scaling
- HDFS migrate to Cloud Storage
- Can only change number workers/preemptible instances (cannot change instance type on existing clusters)
- Use pre-emptible VMs for cost saving
Cloud Dataflow
- Google recommended solution (over Dataproc) for data processing of greenfield workloads built on Apache Beam
- Truly NoOps data processing for streaming and batch workloads
- Use with Cloud ML for machine learning (not Spark ML which would map to Dataproc)
- Know windows, watermarks, triggers, max workers, pcollections,
- When to use windows (global, fixed, sliding) tripped me up on exam day
Recommended reads:
https://cloud.google.com/dataflow/model/windowing
https://cloud.google.com/dataflow/model/triggers
Cloud ML Engine
- Fully managed Tensorflow platform
- Scales to multiple CPU/GPU/TPU workloads
- Automate the platform elements of machine learning
- Currently only runs Tensorflow
Dataprep
- Partnered with Trifacta for data cleaning/processing service
- Supported file types:
- Input – CSV, JSON (including nested), Plain text, Excel, LOG, TSV, and Avro
- Output – CSV, JSON, Avro, BigQuery table
- CSV/JSON can be compressed or uncompressed
Data Studio
- Easy to use data visualization and dashboards
- Part Of G Suite, not Google Cloud
- Two types of cache – query and pre-fetch
- Exam questions around caching of data resulting in issues pulling data from BigQuery etc
- Free to use!
Datalab
- Interactive tool for exploring and visualizing data in Notebook format
- Built on Jupyter (formerly iPython)
- Visual analysis of data in BigQuery, ML Engine, Compute Engine, Cloud Storage, and Stackdriver
- Underpinned by Compute Engine (Datalab launches on an instance)
I wish you good luck with the exam. I hope this write up helps with your preparations. As always, get in touch if you would like any more specific advice or to talk tech in general!