Saturday, February 14, 2026

Architecting with Databricks Lakeflow — Scaling Ingestion and Transformation with Lakeflow

 The “Modern Data Stack” is undergoing a massive consolidation. The era of “fragmented best-of-breed” tools is being replaced by unified Data Intelligence Platforms. At the heart of this shift for Databricks is Lakeflow — a native, vertically integrated suite for the full data engineering lifecycle.

In this deep dive, we’ll explore the technical architecture of Lakeflow’s three core pillars: ConnectPipelines, and Jobs.

Architecture diagram showing Databricks Lakeflow Connect, Pipelines, and Jobs integration

Press enter or click to view image in full size
Press enter or click to view image in full size

1. Lakeflow Connect: Intelligent Ingestion

Lakeflow Connect moves beyond traditional connectors. It integrates native Change Data Capture (CDC) and incremental processing directly into the Lakehouse.

  • Native CDC Integration: Unlike traditional ETL that relies on expensive full-table scans, Lakeflow Connect leverages database logs (e.g., Binlog for MySQL, Transaction Logs for SQL Server) to identify row-level changes.
  • Point-and-Click Scaling: It utilizes Serverless VFS (Virtual File System) to scale ingestion workers independently of your transformation clusters.
  • Automatic Schema Evolution: As source systems change (e.g., adding a column in Salesforce), Lakeflow Connect detects the drift and updates the destination Delta tables without breaking downstream dependencies.

The goal here is “Zero-Code Ingestion.” You define the source and the frequency; Lakeflow manages the state, checkpoints, and backfills.

Press enter or click to view image in full size
No code — All Configurable UI in Databricks

2. Lakeflow Pipelines: The Declarative Engine

Lakeflow Pipelines is the evolution of Delta Live Tables (DLT). It shifts the focus from imperative coding (telling the computer “how” to move data) to declarative modeling (telling the computer “what” the final state should look like).

  • Streaming Tables: These are managed tables that support incremental data processing. They only process new data arrived since the last refresh, drastically reducing compute costs for high-volume logs or IoT data.
  • Materialized Views (MVs): Unlike standard SQL views, MVs in Lakeflow pre-compute results. They use Incremental Refresh logic — the engine identifies exactly which input rows changed and updates only the affected parts of the view.
  • The Flow Graph: When you deploy a pipeline, Lakeflow constructs a Directed Acyclic Graph (DAG) of your transformations. It automatically handles the “Gold-from-Silver-from-Bronze” dependencies, ensuring data integrity across the Medallion Architecture.

3. Lakeflow Jobs: Unified Orchestration

Orchestration is often the “weakest link” in a data stack. Lakeflow Jobs solves the visibility gap by making orchestration “data-aware.”

Get Veilraj Duraipandian’s stories in your inbox

Join Medium for free to get updates from this writer.

Advanced Features:

  • Control Flow Logic: It supports complex branching (If/Else), For-Each loops, and nested tasks. This allows for sophisticated recovery patterns — for example, triggering a data quality cleanup job only if a specific expectation fails.
  • Unified Observability: Since Jobs is integrated with Unity Catalog, you get a “single pane of glass.” You can see the health of an ingestion task, the status of a transformation pipeline, and the refresh of a PowerBI/Tableau dashboard in one lineage view.
  • Triggering Mechanisms: Beyond simple CRON schedules, Lakeflow Jobs supports File Arrival Triggers and Continuous Execution for near real-time requirements.
Press enter or click to view image in full size
Press enter or click to view image in full size
Databricks Job View — With all controls

The Integrated Advantage: Unity Catalog

The “connective tissue” of Lakeflow is Unity Catalog.

  • Security: Every stage of the Lakeflow process inherits the security policies defined in Unity Catalog (UC).
  • Lineage: UC captures lineage at the column level. If a transformation in a Lakeflow Pipeline changes a calculation, you can immediately see which downstream Jobs and Dashboards are affected.

The Strategic Choice: Integrated vs. Fragmented

When deciding on your stack, the trade-off usually comes down to control versus convenience. Below is how Lakeflow stacks up against the traditional “Modern Data Stack.”

Press enter or click to view image in full size

Which one should you choose?

  • Choose Lakeflow if: Your organization is already “All-In” on Databricks. You value speed-to-market, unified governance, and want to eliminate the “maintenance tax” of managing multiple vendor contracts and integrations. It is ideal for teams that need to mix real-time streaming with heavy-duty batch processing.
  • Choose the DIY Stack if: You have a heterogeneous environment (e.g., your data is spread across Snowflake, BigQuery, and on-prem). If your team is purely SQL-focused and you require the absolute maximum number of niche SaaS connectors that only a specialized tool like Fivetran provides, the DIY route remains a strong contender.

Conclusion:

Lakeflow eliminates “tool sprawl.” By keeping Ingest, Transform, and Orchestration within a single, serverless environment governed by Unity Catalog, you reduce:

  1. Network Latency: Data doesn’t hop between different vendors’ clouds.
  2. Configuration Overhead: One IAM role, one set of credentials, one security model.
  3. The “Maintenance Tax”: No more updating API keys across four different SaaS tools.

The result is a more resilient, scalable, and observable data platform.

 

Saturday, July 19, 2025

How We Fix Misspelled Multilingual Queries with LLMs

 The rise of Large Language Models (LLMs) has generated significant excitement in the tech community. However, the real value lies in turning this innovation into practical, high-impact applications — especially in fast-paced, user-facing environments like quick commerce.

At Z, where a majority of our business is driven by in-app search, every query matters. A misspelled or poorly understood query can mean the difference between a successful purchase and a dropped session. In a landscape where users type in a mix of language & script variants — often phonetically or on the go — getting query understanding right is not just a nice-to-have; it’s mission-critical.

This blog dives into one such deceptively simple but deeply impactful problem: detecting and correcting misspelled multilingual queries, particularly those written in English. We will walk through how we built a robust spell correction system using LLMs — tailored to the nuances of our user base and designed to meaningfully improve search relevance, user experience, and ultimately, conversion.

With query correction, relevance skyrocketed — from 1 in 4 results being eggs to 4 out of 4.

The Core Challenge

The task of detecting and correcting multilingual misspellings is complicated by the fact that many of our users input queries in vernacular languages using English lettering (e.g., typing “kothimbir” for “coriander”, “paal” which stands for “milk” in tamil). Most existing language models are built around native scripts, making it difficult to accurately understand and correct such queries.

Our Approach: Building an MVP with Llama3

We began by prototyping a multilingual spell corrector using API-based LLM playgrounds. After testing multiple models, we selected Meta’s Llama3–8B model, which provided a good balance of performance and accuracy for our use case.

The next step was to scale this solution. Relying on external APIs posed two major challenges: cost and reliability. To mitigate this, we hosted the Llama3 model on in-house Databricks’ model serving layer, which allowed us to integrate it with Spark jobs, ensuring high throughput and parallel access.

Fine-Tuning for Accuracy

After deploying the model, we turned to instruct fine-tuning — a lightweight yet powerful alternative to full model fine-tuning. Instead of modifying model weights, we focused on prompt engineering and system instruction design to teach the model specific behaviors like spelling correction, vernacular normalization, and disambiguation of mixed-language inputs.

We experimented with:

  • Role-specific system messages: We explicitly instructed the LLM to behave as a multilingual spell corrector, helping it narrow its focus and reduce hallucinations. For instance, we defined its role as “You are a spell corrector that understands multilingual inputs.”
  • Few-shot examples: We included a diverse set of input-output pairs across English and vernacular queries (e.g., “skool bag” → “school bag”, “kothimbir” → “coriander leaves”) to demonstrate expected behavior, enabling the model to generalize better to unseen inputs.
  • Stepwise prompting: Instead of asking the model to directly return the final corrected query, we broke down the process into intermediate steps like: (1) detect incorrect words, (2) correct them, and (3) translate if needed. This decomposition improved the model’s accuracy and transparency.
  • Structured JSON outputs: To make downstream integration seamless, we enforced a consistent output schema — e.g., { “original”: “kele chips”, “corrected”: “banana chips” }. This reduced parsing errors and ensured clean handoff to our ranking and retrieval systems.

This iterative instruct-tuning significantly improved task accuracy without increasing inference cost. It also allowed us to evolve the model quickly across use cases without retraining.

However, one issue that surfaced was the treatment of brand names. The model would often flag brand names as misspelled words. Initially, we included a list of brands in the prompt itself, but this increased the context length and inference time, reducing efficiency.

Enter RAG: Retrieval Augmented Generation

To overcome the limitations of static prompts and reduce dependency on long context windows. We implemented Retrieval Augmented Generation (RAG) — a paradigm that enhances LLM performance by grounding it with relevant, real-time contextual information from an external knowledge base.

Our Architecture

We designed a two-stage architecture:

  1. Semantic Retrieval Layer
    Each incoming user query is converted into an embedding using a multilingual embedding model. This embedding is matched against product embeddings stored in a Vector DB using Approximate Nearest Neighbor (ANN) search.
  2. Contextual Prompt Construction
    The retrieved product results — including titles, brand names, and spelling variants (autosuggestion corpora) — are filtered, deduplicated, and compiled into a dynamically constructed prompt, which is passed to the LLM for correction.

Why This Matters

Each component of this setup was built to address real-world challenges specific to a multilingual, quick commerce setting:

1. Robustness to Noisy Inputs
The retrieval engine operates in semantic space, so even severely misspelled or phonetically written queries return meaningful results.

  • Example:
    Query: “balekayi cheeps
    → Retrieved: { “title”: “Banana Chips”, “brand”: “Haldiram” }
    → Corrected: “banana chips”
  • Example:
    Query: “kottimbeer pudina”
    → Retrieved: [“kothimbir (coriander)”, “pudina (mint)”]
    → Corrected: “coriander and mint leaves”

This helps us bypass brittle rule-based or edit-distance based approaches, which often fail in multilingual spelling variants.

2. Brand Awareness
Earlier, we tried stuffing the prompt with a hardcoded list of brands to prevent the model from “correcting” them. This bloated the prompt and degraded latency. With RAG, we inject only query-relevant brand terms.

  • Example:
    Query: “kellogs cornflex”
    → Retrieved: { “brand”: “Kellogg’s”, “product”: “Corn Flakes” }
    → Model output: “Kellogg’s Corn Flakes”
  • Example:
    Query: “valentaz
    → Retrieved: { “brand”: “Valentas”, “category”: “cardiac medication” }
    → Model confidently corrects: “Valentas

3. Prompt Efficiency & Latency Reduction
By including only the top-k retrieved entries (e.g., top 5 product variants), we keep the token count low while injecting high-precision signals. This reduced prompt size by 30–40% compared to static dictionary-based prompts and lowered average inference latency by ~18s for a batch of 1,000 queries in our benchmarks.

  • Before RAG:
    Prompt token count ~4,500+ tokens (with entire catalog dictionary & brand list)
  • After RAG:
    Prompt token count ~1,200–1,400 tokens (only contextually relevant)

4. Dynamic Learning Capability
RAG enables rapid adaptation as the catalog evolves. If a new product or brand enters the catalog, it’s automatically included in the recall without needing to retrain or reconfigure the prompt.

  • Example:
    When we introduced “Sundrop Superlite Advanced Oil”, users began typing “sundrap oil” or “sundorp advanced”. These were dynamically grounded using vector retrieval without any manual intervention.

Outcome

This RAG-enhanced setup transforms our LLM into a domain-aware, multilingual spell corrector that:

  • Understands noisy, phonetically typed inputs
  • Differentiates brands from regular words
  • Adapts to evolving product catalogs
  • Delivers corrections with low latency and high precision

This system is now a core enabler of our vernacular query understanding pipeline, and a blueprint we’re extending to adjacent use cases like voice-to-text query correction and personalized product explanations.

Leveraging User Reformulations as Implicit Feedback

In parallel to the LLM-driven correction pipeline, we also incorporate user behavior signals to identify likely spelling corrections. Specifically, we monitor query reformulations within a short time window — typically within a few seconds of the initial search.

If a user quickly reformulates a query (e.g., from “banan chips” to “banana chips”) and the reformulated query shows a higher conversion rate, we treat it as a strong implicit signal that the original query was misspelled. These reformulation pairs help us:

  • Auto-learn new misspelling variants,
  • Improve prompt coverage and examples,
  • Enrich training datasets for future supervised fine-tuning,
  • And reinforce the spell corrector’s decision-making in production.

This behavior-driven feedback loop operates in the background and complements our LLM + RAG architecture with real-world user corrections — grounding the model in observed search behavior.

We have been building several other iterations to continuously improve the model, which we plan to cover in the future sequel to this blog.

Results: Speed, Accuracy, and Scalability

By fine-tuning our prompts and using RAG, we achieved a solution that was both fast and accurate. Hosting the model on Databricks ensured scalability, while instruct fine-tuning and RAG minimized costs and maintained high performance.

Impact

The implementation of this multilingual spell corrector had a significant positive impact on user experience and business metrics. By addressing incorrect, multilingual, and mixed grammar queries, we observed a 7.5% increase in conversion rates for the impacted queries. This improvement highlights the critical role that accurate search query understanding plays in driving user engagement and overall performance in a quick-commerce setting. The project not only enhanced the search experience but also contributed to the bottom line by helping users find the right products more efficiently, leading to increased sales and customer satisfaction.

Key Takeaways for Building with LLMs

  1. API Access vs. Self-Hosting: While API-based LLM services are great for prototyping, hosting models internally can provide better scalability and cost control in production.
  2. Instruct Fine-Tuning: Fine-tuning via instruct prompts, rather than the entire model, can save resources and improve task specificity.
  3. RAG for Efficiency: Retrieval Augmented Generation helps reduce context size and improve efficiency by providing only the necessary information to the model.
  4. Multi-Task Learning: LLMs can handle multiple tasks with careful prompt engineering, breaking down complex tasks into smaller, more manageable steps.

Credits

Architecting with Databricks Lakeflow — Scaling Ingestion and Transformation with Lakeflow

  The “Modern Data Stack” is undergoing a massive consolidation. The era of “fragmented best-of-breed” tools is being replaced by unified Da...