Saturday, December 31, 2022

GCP data eng

 

How to prepare for the GCP Professional Data Engineer certification

Hi!

I recently passed the GCP Professional Data Engineer certification exam and some people asked me for tips for the exam and study materials, so I decided to write this post to explain the path I took.

This is certainly not the only possible way to prepare for this test, but it was the one that worked for me. So be open-minded as each person has different ways of studying and learning.

Expectations alignment

Unlike some other certifications, the Professional Data Engineer (PDE) is definitely not a simple exam, where just study for 8 hours and you'll be ready. Google writes its questions in a way that only someone with hands-on experience and an understanding of their services can get across.

It is important to note that a certification is the validation of the knowledge you gain. The goal is not for you to memorize questions, but to actually understand the services that the cloud offers to be able to apply them in your day to day work. Keep this in mind as you prepare.

For this exam, Google recommends the following background: 3+ years of industry experience including 1+ years designing and managing solutions using Google Cloud. At the time I took the test I had 3 years of experience in the field of data engineering but I had no background with data-oriented cloud services. It is important to say that I already had a knowledge base in Cloud as I used AWS for 6 months in my previous work, but it was in a devops context and not data engineering. Either way, I think it's important for a person to have basic knowledge of Cloud architecture before attempting a specific exam like the GCP PDE. Therefore, for the GCP Professional Data Engineer exam I needed 4 months of intense studies to feel minimally confident to take the exam.

The exam itself

Before starting your preparation, it is worth exploring the official website. It contains all up-to-date information about the test and its content. It is worth mentioning that on the official website the cost of the exam is 200 dollars, but because I live in Brazil, I only paid 120 dollars. Google provides a section of the site, called Exam Guide, to explain what is on the test. So it's worth reading before starting your study and also taking a look when you feel you've studied a considerable amount of material, to see where you stand on the requirements.

If your company is a Google Cloud partner, it is very likely that it has the benefit of vouchers for its employees. These vouchers are usually available to anyone who completes a knowledge path made available by Google on the Qwiklabs platform.

The exam has a total of 50 questions and is 2 hours long. I consider this time enough to answer all the questions and even review the ones that were left as doubts.

And the last comment about the test. Due to the COVID-19 pandemic, it is possible to take the exam online. Although this way is a bit boring because Google requires you to have a well-prepared environment for the test. Personally, I prefer to do it in person at an accredited center as it eliminates the risk of possible computer and internet problems that may occur at home as well as allowing a greater level of concentration for the test.

Preparation

Courses and training

There are many online and face-to-face courses that prepare you for the GCP Professional Data Engineer exam. I will register here the ones I had contact with and I can attest to the quality.

A Cloud Guru

This was the main course I used to prepare for the exam. This is probably the main study platform for Cloud certifications. This course pretty much covers everything you need to pass the exam and is very detailed. Although more expensive, this platform also includes labs to practice what is taught in class.

Udemy Course by Dan Sullivan

This was another course I took and I really enjoyed it. The instructor can make links between services in a very didactic way. It is also worth mentioning that Udemy always promotes courses. This one in particular I managed to buy for 25 reais. The course structure is constantly evolving along with the test content.

Qwiklabs

This is the platform where Google makes its own trainings available. Companies that are Google Cloud partners typically have access to this platform. I took most of the courses on the data engineering track but I consider them to be of average quality. The best thing about this platform are the numerous labs to practice. Cloud Guru also has labs but Qwiklabs has a lot more.

Practice experience

As I mentioned at the outset, practical experience is very important when studying for certification. While Google requires a year of cloud work to pass the exam, you can succeed in other ways. The first and most recommended is to create an account in the so-called Free Tier. This account will allow you to use Google Cloud Platform resources to practice what you learn in class. All the courses mentioned teach how to create this account. But if you want to check it out, you can take a look at this video tutorial.

The other way to practice is through platforms that offer labs. As I mentioned in the last section, the two platforms I've used that offer this feature are Cloud Guru and Qwiklabs, with Qwiklabs having a larger amount of labs to practice with.

I make it clear that I prefer the approach of creating a Free Tier account as it allows for a much more complete learning experience than taking a ready-to-use environment.

Extra study resources

In addition to the courses, other resources were essential for passing the exam.

Professional Data Engineer Study Guide Book by Dun Sullivan

This excellent book written by Dan Sullivan (Yes, the same author as the second course listed) contains almost all the information needed to pass the exam. It has well-organized chapters on GCP services and provides many key points to remember to solve the exam. At the end of each section, there are tests on the presented content. To tell you the truth, after reading this book, I was able to improve my knowledge by about 40-50%, and my results in the mock test improved significantly.

The Cloud Girl and Visualizing Google Cloud

The YouTube channel The Cloud Girl and the book Visualizing Google Cloud are produced by Priyanka Vergadia, a Google Cloud Developer Advocate. She manages to explain GCP's products and services through beautifully crafted illustrations. Both the book and the channel are great teaching resources and helped me a lot in understanding important concepts for the exam.

Google documentation

I consider reading the documentation of each GCP service and product very important not only for the test but also for playing the role of GCP Data Engineer in the day to day work. I always recommend consulting the documentation when any point of doubt about Google Cloud services arises, that is, it is a resource that you should use from the initial moment of preparation for the exam until the end.

Data Engineer Practice Exam

An extremely important step in your preparation is taking practical exams. This is the classic, the more the better. See which are the areas of study where you are getting less right and read the corresponding documentation again. Prepare yourself for many business scenario questions, asking which technology to use and for questions with multiple alternatives. In my exam, I found several questions to select two alternatives. Below I have listed the platforms I used throughout my preparation. An important detail is that on some of the platforms it is possible to pause the simulations, so you don't have to do all 50 questions at once. Although I recommend that you do at least three practical exams without interruptions.

I also watched the videos on the AwesomeGCP YouTube channel by Sathish VJ. It is an excellent source of study because in addition to tips and content itself, it also has several videos with commented questions.

Final tips

These are the main services that most fell on my test, that is, that had two or more questions about them:

  • ML concepts + AI at GCP (Vertex AI, Vision, NLP etc APIs).
  • BigQuery (some questions covering details about the tool like backup and SQL structures like window functions).
  • BigTable (the main detail is design of tables to avoid hot spots).
  • Cloud SQL.
  • Dataflow.
  • Dataproc (some issues addressing Dataproc + HDFS).
  • Pub/sub (especially used for scenarios where there needs to be decoupling between systems).
  • Cloud Storage (main topic was storage classes).

Other services appeared with specific questions, with a maximum of two questions each:

  • Cloud Spanner
  • Composer
  • Compute Engine
  • Data Loss Prevention
  • Dataprep
  • Firestore

It is also very important to be familiar with cloud concepts such as IAM, which pertains to permissioning resources in the Cloud.

Conclusion

In this post I presented the steps I followed to prepare for the GCP Professional Data Engineer exam. As explained at the beginning, there is no “silver bullet” to succeed in the test. Some things work better with one person than another, and I'm sure there are other (and better) ways to prepare.

Honestly, even studying very intensively for 4 months, after the third question on the test my feeling was that I only knew a little bit about everything and that it wouldn't be enough to pass, since several questions asked small details about the tools. For several moments in the test I was sure I wouldn't be tested and in those moments, anxiety and despair took over me. I wasted at least 5 minutes of exam time staring at nothing and regretting that I wasn't doing well with the answers. But I tried to stay calm and concentrate. At the end of the exam I received the result of "Pass". I was extremely happy and relieved to know that all the effort and sacrifices paid off. After 3 days Google finally sent me the result confirmation and the GCP Professional Data Engineer certificate.

If you have any questions about the preparation, feel free to message me. I hope I was able to convey useful information to you, and I wish you good luck in the exam!



Here, you can find a list of links to help you achieve these certifications.

OCI:

How to study for 1z0-1105-22 - Oracle Cloud Data Management 2022 Foundations Associate


How to study for 1z0-1085-22 - Oracle Cloud Infrastructure 2022 Certified Foundations Associate

http://alexzaballa.blogspot.com/2022/06/how-to-study-for-1z0-1085-22-oracle.html

How to study for 1z0-1067-22 - Oracle Cloud Infrastructure 2022 Cloud Operations Professional

http://alexzaballa.blogspot.com/2022/07/how-to-study-for-1z0-1067-22-oracle.html

How to study for 1Z0-1104-22 - Oracle Cloud Infrastructure 2022 Security Professional

http://alexzaballa.blogspot.com/2022/07/how-to-study-for-1z0-1104-22-oracle.html

How to study for 1Z0-1084-22 - Oracle Cloud Infrastructure 2022 Certified Developer Professional

http://alexzaballa.blogspot.com/2022/07/how-to-study-for-1z0-1084-22-oracle.html

How to study for 1Z0-1072-22 - Oracle Cloud Infrastructure 2022 Architect Associate

http://alexzaballa.blogspot.com/2022/07/how-to-study-for-1z0-1072-22-oracle.html

How to study for 1z0-1094-22 - Oracle Cloud Database Migration and Integration 2022 Professional

http://alexzaballa.blogspot.com/2022/08/how-to-study-for-1z0-1094-22-oracle.html

How to study for 1z0-1093-22 - Oracle Cloud Database Services 2022 Professional

http://alexzaballa.blogspot.com/2022/09/how-to-study-for-1z0-1093-22-oracle.html

How to study for 1z0-931-22 - Oracle Autonomous Database Cloud 2022 Professional

http://alexzaballa.blogspot.com/2022/09/how-to-study-for-1z0-931-22-oracle.html

How to study for 1Z0-997-22 - Oracle Cloud Infrastructure 2022 Architect Professional

http://alexzaballa.blogspot.com/2022/09/how-to-study-for-1z0-997-22-oracle.html


Oracle Database:

How to study for 1z0-116 - Oracle Database Security Administration

http://alexzaballa.blogspot.com/2022/05/how-to-study-for-1z0-116-oracle.html

How to study for 1z0-084 - Oracle Database 19c: Performance Management and Tuning

http://alexzaballa.blogspot.com/2022/06/how-to-study-for-1z0-084-oracle.html

How to study for 1z0-149 - Oracle Database PL/SQL Developer Certified Professional

http://alexzaballa.blogspot.com/2022/09/how-to-study-for-1z0-149-oracle.html

How to study for 1Z0-076 Oracle Certified Professional, Oracle Database 19c: Data Guard Administrator

http://alexzaballa.blogspot.com/2022/12/how-to-study-for-1z0-076-oracle.html

How to study for 1z0-078 - Oracle Certified Professional, Oracle Database 19c: RAC, ASM, and Grid Infrastructure Administrator

http://alexzaballa.blogspot.com/2022/12/how-to-study-for-1z0-078-oracle.html


Others:

How to study for 1Z0-902 - Oracle Exadata Database Machine X9M Implementation Essentials

http://alexzaballa.blogspot.com/2022/06/how-to-study-for-1z0-902-oracle-exadata.html

How to study for 1z0-106 - Oracle Linux 8 Advanced System Administration Exam

http://alexzaballa.blogspot.com/2022/12/how-to-study-for-1z0-106-oracle-linux-8.html


AWS (Courses from https://learn.acloud.guru):

AWS Certified Database – Specialty

AWS Certified Solutions Architect – Associate

AWS Certified Cloud Practitioner



GCP (Renew - Courses from https://go.qwiklabs.com):

Google Cloud Certified Professional Cloud Architect

Google Cloud Certified Professional Data Engineer

Google Cloud Certified Professional Database Engineer

Friday, December 30, 2022

Setting Up Apache Airflow with Docker-Compose in 5 Minutes

 

Create a development environment and start building DAGs

Although being pretty late to the party (Airflow became an Apache Top-Level Project in 2019), I still had trouble finding an easy-to-understand, up-to-date, and lightweight solution to installing Airflow.

Today, we’re about to change all that.

In the following sections, we will create a lightweight, standalone, and easily deployed Apache Airflow development environment in just a few minutes.

Docker-Compose will be our close companion, allowing us to create a smooth development workflow with quick iteration cycles. Simply spin up a few docker containers and we can start to create our own workflows.

Note: The following setup will not be suitable for any production purposes and is intended to be used in a development environment only.

Why Airflow?

Apache Airflow is a batch-oriented framework that allows us to easily build scheduled data pipelines in Python. Think of “workflow as code” capable of executing any operation we can implement in Python.

Airflow is not a data processing tool itself. It’s an orchestration software. We can imagine Airflow as some kind of spider in a web. Sitting in the middle, pulling all the strings and coordinating the workload of our data pipelines.

A data pipeline typically consists of several tasks or actions that need to be executed in a specific order. Apache Airflow models such a pipeline as a DAG (directed acyclic graph). A graph with directed edges or tasks without any loops or cycles.

This approach allows us to run independent tasks in parallel, saving time and money. Moreover, we can split a data pipeline into several smaller tasks. If a job fails, we can only rerun the failed and the downstream tasks, instead of executing the complete workflow all over again.

Airflow is composed of three main components:

Step-By-Step Installation

Now that we shortly introduced Apache Airflow, it’s time to get started.

Step 0: Prerequisites

Since we will use docker-compose to get Airflow up and running, we have to install Docker first. Simply head over to the official Docker site and download the appropriate installation file for your OS.

Step 1: Create a new folder

We start nice and slow by simply creating a new folder for Airflow.

Just navigate via your preferred terminal to a directory, create a new folder, and change into it by running:

Step 2: Create a docker-compose file

Next, we need to get our hands on a docker-compose file that specifies the required services or docker containers.

Via the terminal, we can run the following command inside the newly created Airflow folder

or simply create a new file named docker-compose.yml and copy the below content.

The above docker-compose file simply specifies the required services we need to get Airflow up and running. Most importantly the scheduler, the webserver, the metadatabase (postgreSQL), and the airflow-init job initializing the database.

At the top of the file, we make use of some local variables that are commonly used in every docker container or service.

Step 3: Environment variables

We successfully created a docker-compose file with the mandatory services inside. However, to complete the installation process and configure Airflow properly, we need to provide some environment variables.

Still, inside your Airflow folder create a .env file with the following content:

The above variables set the database credentials, the airflow user, and some further configurations.

Most importantly, the kind of executor Airflow we will utilize. In our case, we make use of the LocalExecutor.

Note: More information on the different kinds of executors can be found here.

Step 4: Run docker-compose

And this is already it!

Just head over to the terminal and spin up all the necessary containers by running

After a short period of time, we can check the results and the Airflow Web UI by visiting http://localhost:8080. Once we sign in with our credentials (airflow: airflow) we gain access to the user interface.

A Quick Test

With a working Airflow environment, we can now create a simple DAG for testing purposes.

First of all, make sure to run pip install apache-airflow to install the required Python modules.

Now, inside your Airflow folder, navigate to dags and create a new file called sample_dag.py.

We define a new DAG and some pretty simple tasks.

The EmptyOperator serves no real purpose other than to create a mockup task inside the Web UI. By utilizing the BashOperator, we create a somewhat creative output of “HelloWorld!”. This allows us to visually confirm a proper running Airflow setup.

Save the file and head over to the Web UI. We can now start the DAG by manually triggering it.

Note: It may take a while before your DAG appears in the UI. We can speed things up by running the following command in our terminal docker exec -it --user airflow airflow-scheduler bash -c "airflow dags list"

Running the DAG shouldn’t take any longer than a couple of seconds.

Once finished, we can navigate to XComs and inspect the output.

And this is it!

We successfully installed Airflow with docker-compose and gave it a quick test ride.

Note: We can stop the running containers by simply executing docker compose down.

Conclusion

Airflow is a batch-oriented framework that allows us to create complex data pipelines in Python.

In this article, we created a simple and easy-to-use environment to quickly iterate and develop new workflows in Apache Airflow. By leveraging docker-compose we can get straight to work and code new workflows.

However, such an environment should only be used for development purposes and is not suitable for any production environment that requires a more sophisticated and distributed setup of Apache Airflow.

You can find the full code here on my GitHub.

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...