Saturday, March 30, 2024

The future of the data engineer — Part I

 

Introduction

In this post we explore the shifting focus of Data Engineering from Data Infrastructure and Data Integration to Accessible Analytics, in Meta and the industry at large. However the tools that we use are largely focused on the former two which presents a major challenge for the function. We will showcase challenges this mismatch created during the evolution of Data Engineering at Meta and will formulate the properties of a next-generation data platform. In subsequent posts we will review how Data Engineering at Meta is evolving to address these challenges.

But first let’s define what we mean by Accessible Analytics. We are not using the term Accessible Analytics as it is commonly understood, i.e., the act of making data accessible for extracting insights through tools. Instead, we are also drawing attention to the fact that for data to truly be deemed accessible it will have to be self-describing to an extent that it doesn’t require specialized skills to draw meaningful insights from it.

The evolving role of the Data Engineer

Let’s take a look at what Functions modern analytics teams serve, and also what business Outcomes they drive via their Work Product.

As the analytics organization matures, the focus of data engineering shifts from the bottom layers of the pyramid to the top. Often times the first data engineers in the company spend most of their time setting up the data infrastructure. As the company grows, requirements for analytics will also grow, which is typically addressed by investing more engineering rigor into creating data assets. At first, this primarily involves writing ETL pipelines to build foundational datasets (i.e., data integration). However, the focus eventually shifts more towards creating abstractions for consistent analysis, enabling analytic applications on top of them.

What is an analytic application? An analytic application is a software program or platform that processes, manages, and presents data in a meaningful way to support decision-making, analysis, or automation. To give an example, consider a “what if analysis” spreadsheet calculating revenue and growth rate for a startup based on sales and marketing projections, something many of you will be familiar with. It’s an analytics application that was built for decision making and was used to generate insights. Creating a spreadsheet each time someone wants to look at the metrics is inconvenient, so typically people prefer to create a dashboard instead, which is a more advanced kind of an analytic application.

In a mature analytics team data engineers spend an increasing amount of time on the below activities:

  1. Making sure that the data foundation is consistently representing the underlying business enabling consumers — be it data analysts querying a table or business users looking at a dashboard — to draw accurate conclusions without spending a lot of effort on trying to understand the data.
  2. Building high quality analytic applications for example:
  • Interactive data exploration (dashboards) — Enabling end users to access data without having knowledge of SQL or the underlying nuances of each dataset.
  • Common analysis workflow automation — Automating routine tasks such as alerts to stakeholders upon significant metric movement or A/B testing results.
  • Reverse ETL — Sharing relevant metrics with operational systems (e.g. CRM, ERP) derived from the data warehouse.

The main goal of these activities is to make the process of drawing insights from data, more accessible to both specialist and non-specialist users. Achieving consistently modeled high quality data at scale necessitates shared representation of business concepts, in other words shared schema and business logic. Consequently the data development workflow has to adopt software engineering best practices such as reusable abstractions to share the concepts and automated testing to ensure quality. Consistent abstraction-based data modeling makes data more accessible not only to humans but also to systems, unlocking a variety of automation opportunities.

The below diagram shows how the focus areas of data engineering shift as the analytics organization evolves.

Based upon this illustration, we can observe three distinct focus areas for the role:

  • Data Infrastructure: One example of a problem being solved in this instance might be setting up a spark cluster for users to issue HQL queries against data on S3.
  • Data Integration: An example task would be creating a dataset via SQL query, joining tens of other datasets, and then scheduling the query to run daily using the orchestration framework.
  • Data Accessibility: An example could be enabling end-users to analyze significant metrics movements in a self-serve manner.

In the upcoming sections we’ll explore the journey towards Accessible Analytics at Meta and how we are evolving our tooling and development workflow to solve these challenges.

The evolution of Data Engineering at Meta

The evolution of the data engineer at Meta from Data Integration to Accessible Analytics has primarily been influenced by ever-increasing scale, analytics complexity and privacy. The underlying data infrastructure and the data development lifecycle correspondingly have continued to evolve throughout.

We have been working toward evolving the role of Data Engineering from merely building data pipelines to integrate disunited and heterogeneous data processing stacks (across logging, batch, real time, ML etc.) to writing richer, smarter and modular analytic applications on unified data processing stacks (i.e., unified metadata, language, lineage etc.). Disunited refers to the siloed and disconnected nature of the various development workflows (e.g.,authoring, consuming, operating) that arise due to data processing across databases, compute engines, metadata and storage repositories.

Data Integration: Data at Scale

In the early years at Meta, Data Engineering was small and centralized. There was a big appetite for data and insights into the performance of a rapidly growing product ecosystem. The objective was to integrate the data at scale and build curated datasets, primarily for exploration, reporting and experimentation. The purpose of data integration was to unify metrics & dimensions produced in siloed ETL systems and scale common analytical patterns, as the picture below depicts. Specifically, we built frameworks and tools to automate data processing by abstracting away compute and storage. In a nutshell, we wrote code — what some might call Functional ETL — to generate SQL queries along with a DAG of tasks to be executed given a data processing specification.

Of course, we have since moved away from templatizing formatted SQL strings to dataframe-type constructs to ensure correctness of the generated SQL. For the same reason, we are also moving away from configuration-driven business logic generation at run-time (i.e., dynamic) versus at authoring-time (i.e., static).

Regardless, in the early years, these patterns were essential in enabling us to scale the function of Data Engineering!

Product Measurement: Multiple Data Processing Stacks

Over the years, Meta has grown to become a family of apps, exponentially increasing the number of users and interactions. Needless to say, this resulted in enormous growth in both data and data engineers. The objective of our function was to influence product strategy through consistent product measurement and insights. This led to the analytics function being embedded within product teams and, consequently, data engineering becoming decentralized. Diversity in product ecosystems led to a corresponding diversity in analytics use cases and data ecosystems. The demand for insights led to an explosion of data processing. There were a multitude of analytics use cases (such as understanding performance across product workflows, predictive analytics and so on) across the entire data processing landscape. Our data infrastructure scaled accordingly, as the picture below depicts.

Development Lifecycle: End-to-End Workflows

In time, logging evolved to become more consistent with reusable event schema specifications, type-safety, and validation. Pipeline authoring continued to become more modular in nature, driven by shared business logic. Meanwhile, the data engineering development lifecycle across the various data processing stacks was becoming increasingly interconnected. Consequently, we made tremendous progress in advancing the end-to-end authoring and operational life cycle, as seen in the picture below. This was critical in enabling Data Engineers at Meta to continue to build high quality datasets. These datasets continue to power critical dashboards, core experiments and strategic analyses in the company.

Analytics Complexity & Privacy: Higher Level Abstractions

Meanwhile, the product ecosystem was becoming more connected and unified bringing with it an ever increasing set of challenging analytics scenarios and, at the same time, heightened privacy requirements.

On the one hand, the cross-product experience needed to be consistently measured (i.e., data-modeled and definitionally aligned) across siloed data sources and data processing stacks. We built a semantic layer (i.e., metric abstractions) to unify the business representation of data at the end of all the data processing to enable a consistent consumption experience. These days, unsurprisingly, we are seeing a lot of excitement around similar versions of semantic layers. However, we needed semantic metadata to be reusable and governed across the entire data processing dependency graph, not just at the end of all the processing a metric undergoes from its origin as a business process event. As seen below, a hypothetical metric that counts actions (some_action.id) cannot be consistently consumed without a shared data model closer to the event source (some_action_event) in addition to the lineage of definitional transformations the metric undergoes across every stage of data processing.

On the other hand, policy-driven use of personal data needed to be enforced in every data processing instance across the same gigantic dependency graph. As should be evident from the picture below, we needed to apply a global lens to institute privacy-forward policies that respect user consent for data use. To ensure compliance, we needed unified privacy and security metadata across the entire data lineage.

Unsurprisingly, both scenarios (analytics and privacy) require trying to understand the meaning and purpose of data across a very vast data lineage.

In simplest terms:

  • What does the data in the columns mean?
  • Where does the data in the columns flow?

Challenges with Disunited Workflows

Nowhere in the data engineering workflow are we explicitly encoding meaning and purpose. Meaning and purpose are logical concepts that require semantically richer constructs (i.e., logical datasets with richer, more descriptive, column types). Meanwhile, all the production and consumption (for over a decade!!) has implicitly happened on physical constructs (i.e., relationship unaware physical tables with primitive column types). In the long run, stitching together a consistent and purpose-aware data flow through a semantically poor set of disunited stacks becomes a monumental task. Thus making consistent product measurement while enforcing privacy constraints challenging across a vastly siloed data lineage.

Inconsistent Representation

Specifically, consider this hypothetical data table (some_user_table) in the picture below. A table like this would most likely have been produced in a batch pipeline, several levels downstream of some event in the online world (i.e., way upstream outside of the data warehouse):

One can observe that the schema, which is defined through primitive types, doesn’t convey the meaning of these sensitive columns. In fact, the meaning was never propagated throughout the lineage of transformations, starting from its origin. Dynamic (i.e., MAPs) and unstructured (i.e., JSON) schemas complicate this even further. In this hypothetical example, one cannot answer questions about this data such as:

  • Where did that user_id column come from?
  • What is in the country column? Is that country name or code? Are the values consistent with ISO standards?
  • What are the privacy policies on the data? What purposes can age be used for and not?

However trustworthy the data might be, any analytics on such data must inevitably be limited in the long run.

Inconsistent Business Logic

To make matters worse, a data engineer will have to be adept at accounting for differences in the SQL dialect, not just in defining schema (i.e., VARCHAR vs STRING between Spark and Presto) but also in writing common business logic. Consider a simple function that brackets age. We should be able to define this canonical logic in one place and reuse it across all data processing stacks (batch or real time) without duplication. Instead, it continues to be repeated to suit the context of the data application.

In reality, the business logic is usually much more complex with different flavors of SQL constructs (i.e. UDFsBroadcast JOINs etc.) involving attributes from different entities (i.e. usercontent). Needless to say, the more complex the business logic the higher the risk of inconsistency in it.

Beyond the limitations of physical data assets, we could continue examining many more challenges across the entire development lifecycle — discovering, authoring, governing, change managing, maintaining, and so on. Instead, we will end this note and leave you to ponder the following questions as you reflect upon your data development lifecycle.

  • Do you know which of your dataset attributes are dimensions and which are metrics?
  • Are you able to define, discover and reuse schema (e.g., ISOAlpha2CountryCode) & business logic (e.g., age_bracket)?
  • Do you know which datasets contain canonical values and attributes for a given dimension?
  • Are you able to automatically enrich (i.e., age_bracket(..)) your dimensional data in a standardized way?
  • Is the source of truth for your schema governed by a database (e.g., Hive Metastore), by a configuration system, by hard to govern naming conventions / wiki’s or by code?
  • Can you distinguish your tables based on their functiontype and granularity (i.e., staging/private/public, anonymized, fact/dimension/rollup, user/aggregate)?
  • How do you ensure correctness of all the SQL that is dynamically generated from templatized string constructs?
  • Are you able to validatechange-manage and investigate a new version of data, schema, pipeline or business logic across a sub-graph of the data lineage?

We will dive into how we are answering these questions in the rest of the blog series.

Tuesday, March 12, 2024

AWS Architecture in Motion: Creating Animated GIF

 

AWS Architecture in Motion: Creating Animated GIFs

Several days ago, I saw on LinkedIn a great article by Ankit Jodhani concerning one of his projects, and his architecture diagram impressed me. There was one common question in his post: “How did you do it?”. All the people were referring to their diagram not being static, it had animations.

In this guide, I’ll try to provide a step-by-step guide on how to make your own animated AWS architecture diagrams using PowerPoint and GIFs.

GIFs make the diagrams more interactive and help visualize how infrastructure behaves. I hope this will encourage you to learn and develop more skills.

Some key benefits of documenting architecture with diagrams are:

  • Visualize infrastructure clearly and organized.
  • Identify how components communicate with each other.
  • Detect failure points or bottlenecks in the design.
  • Explain architecture to new team members.
  • Make changes and updates to infrastructure. Let’s get started!
Prerequisites
  • Basic knowledge of AWS core services (EC2, VPC, S3, etc.)


  • PowerPoint installed on your computer

  • AWS Icons plugin for PowerPoint (link to download)

  • Imgur's account to host GIFs (or any other image hosting site) Link

    • Basic PowerPoint skills:

    • Inserting shapes and icons

    • Aligning and organizing objects

      • Basic PowerPoint skills:

      • Inserting shapes and icons

      • Aligning and organizing objects

        • Uploading files

        • Copying image URLs

        • Using <img> tag

      • Text editor or blog platform to create the post Having these skills and accounts setup will ensure you can easily follow along with the tutorial. The goal is for someone with a beginner-level understanding of AWS and PowerPoint to be able to create animated architecture diagrams by the end.

      • Bringing AWS Diagrams to Life with PowerPoint Animations

        Click on the AWS share link to download the file.

        Open the downloaded file in your PowerPoint:

        When you open the file in PowerPoint, you will have access to all the instructions and all the icons with which you can make your diagrams:

        In the file you will also find some examples, I will use the first one for this tutorial: (Git to S3 Webhooks):

        The next step is to copy and paste the necessary icons for your diagram according to the project you will be working on:

        Then, in “Insert” you add a format, in this example, we will use a “Circle” which you will find in “Forms” Then in “Flowcharts”

        Once selected, you assign a color to it in “Shape Fill” and in “Shape Effects” you add an effect, in my case “Illuminated”:

        Then, Copy and paste your shape to each position where you want it to move.

        You should open the Animation panel and add a shape exit animation. As a result, they will disappear once the Movement has ended.

        In order to maintain consistency, I also choose “Fade” for the exit animation. Make sure to click “Fade” on the red side as it represents output, while green is input. After that, arrange your animations in the Animation panel according to their sequence.

        Select “Add Animation” for each point. Then, select the “Custom Path” option under Motion Paths. Check the Animation panel to make sure these paths are in the right order.

        It is necessary to position yourself on each icon in order to add the animation trajectory.

        When multiple animations need to start at the same time, select them all and select “Start with Previous”. The animations should also be arranged in the following order: first green (entry animation), then blue (motion animation), and finally red (exit animation).

        With all the animations set up, play them back to make sure everything looks as expected. If necessary, adjust the order of the animations. Once you’re happy with your animation, export the slide as a GIF.

        You can take the saved GIF file and upload it to an online service like Imgur.

        Then copy the GIF URL and use it in an HTML image tag

        <img src="”gif-url”" />
        

        If you have access to edit the page code directly, you can embed the image by adding a <img> tag pointing to the image URL, for example:

        <img
          src="https://i.imgur.com/1kTuEtc.png"
          alt="AWS Diagram"
          title="AWS Architecture in Motion"
          width="500"
          height="300"
        />
        

        If the page has a content editor like WordPress, you can upload the image to the media library and insert it into the page when writing the post. The editor will automatically add the <img> code.

        You can also upload the image to a hosting service likeImgur, and copy the file URL to then add it in the <img> tag or through the page’s content editor.

        Another option is to upload the file to a storage service like Google Drive or Dropbox and generate a public URL for the file to include in the <img> tag’s src attribute.

        In summary, you need the public URL of the image to create the <img> tag pointing to that file. Then you include that code in the page, either by editing the HTML directly or through a content editor.

        The steps above will guide you through the process of creating your own AWS architecture diagram. Once again, I am very grateful to Ankit Jodhani for inspiring me and so many others on our journey to the cloud.


Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...