GCP data engineer Exam Topics
Section 1: Designing data processing systems
1.1 Selecting the appropriate storage technologies. Considerations include:
● Mapping storage systems to business requirements
● Data modeling
● Trade-offs involving latency, throughput, transactions
● Distributed systems
● Schema design
1.2 Designing data pipelines. Considerations include:
● Data publishing and visualization (e.g., BigQuery)
● Batch and streaming data (e.g., Dataflow, Dataproc, Apache Beam, Apache Spark and Hadoop ecosystem, Pub/Sub, Apache Kafka)
● Online (interactive) vs. batch predictions
● Job automation and orchestration (e.g., Cloud Composer)
1.3 Designing a data processing solution. Considerations include:
● Choice of infrastructure
● System availability and fault tolerance
● Use of distributed systems
● Capacity planning
● Hybrid cloud and edge computing
● Architecture options (e.g., message brokers, message queues, middleware, service-oriented architecture, serverless functions)
● At least once, in-order, and exactly once, etc., event processing
1.4 Migrating data warehousing and data processing. Considerations include:
● Awareness of current state and how to migrate a design to a future state
● Migrating from on-premises to cloud (Data Transfer Service, Transfer Appliance, Cloud Networking)
● Validating a migration
Section 2: Building and operationalizing data processing systems
2.1 Building and operationalizing storage systems. Considerations include:
● Effective use of managed services (Cloud Bigtable, Cloud Spanner, Cloud SQL, BigQuery, Cloud Storage, Datastore, Memorystore)
● Storage costs and performance
● Life cycle management of data
2.2 Building and operationalizing pipelines. Considerations include:
● Data cleansing
● Batch and streaming
● Transformation
● Data acquisition and import
● Integrating with new data sources
2.3 Building and operationalizing processing infrastructure. Considerations include:
● Provisioning resources
● Monitoring pipelines
● Adjusting pipelines
● Testing and quality control
Section 3: Operationalizing machine learning models
3.1 Leveraging pre-built ML models as a service. Considerations include:
● ML APIs (e.g., Vision API, Speech API)
● Customizing ML APIs (e.g., AutoML Vision, Auto ML text)
● Conversational experiences (e.g., Dialogflow)
3.2 Deploying an ML pipeline. Considerations include:
● Ingesting appropriate data
● Retraining of machine learning models (AI Platform Prediction and Training, BigQuery ML, Kubeflow, Spark ML)
● Continuous evaluation
3.3 Choosing the appropriate training and serving infrastructure. Considerations include:
● Distributed vs. single machine
● Use of edge compute
● Hardware accelerators (e.g., GPU, TPU)
3.4 Measuring, monitoring, and troubleshooting machine learning models. Considerations include:
● Machine learning terminology (e.g., features, labels, models, regression, classification, recommendation, supervised and unsupervised learning, evaluation metrics)
● Impact of dependencies of machine learning models
● Common sources of error (e.g., assumptions about data)
Section 4: Ensuring solution quality
4.1 Designing for security and compliance. Considerations include:
● Identity and access management (e.g., Cloud IAM)
● Data security (encryption, key management)
● Ensuring privacy (e.g., Data Loss Prevention API)
● Legal compliance (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children's Online Privacy Protection Act (COPPA), FedRAMP, General Data Protection Regulation (GDPR))
4.2 Ensuring scalability and efficiency. Considerations include:
● Building and running test suites
● Pipeline monitoring (e.g., Cloud Monitoring)
● Assessing, troubleshooting, and improving data representations and data processing infrastructure
● Resizing and autoscaling resources
4.3 Ensuring reliability and fidelity. Considerations include:
● Performing data preparation and quality control (e.g., Dataprep)
● Verification and monitoring
● Planning, executing, and stress testing data recovery (fault tolerance, rerunning failed jobs, performing retrospective re-analysis)
● Choosing between ACID, idempotent, eventually consistent requirements
4.4 Ensuring flexibility and portability. Considerations include:
● Mapping to current and future business requirements
● Designing for data and application portability (e.g., multicloud, data residency requirements)
● Data staging, cataloging, and discovery
Comments