Saturday, July 19, 2025

How We Fix Misspelled Multilingual Queries with LLMs

 The rise of Large Language Models (LLMs) has generated significant excitement in the tech community. However, the real value lies in turning this innovation into practical, high-impact applications — especially in fast-paced, user-facing environments like quick commerce.

At Z, where a majority of our business is driven by in-app search, every query matters. A misspelled or poorly understood query can mean the difference between a successful purchase and a dropped session. In a landscape where users type in a mix of language & script variants — often phonetically or on the go — getting query understanding right is not just a nice-to-have; it’s mission-critical.

This blog dives into one such deceptively simple but deeply impactful problem: detecting and correcting misspelled multilingual queries, particularly those written in English. We will walk through how we built a robust spell correction system using LLMs — tailored to the nuances of our user base and designed to meaningfully improve search relevance, user experience, and ultimately, conversion.

With query correction, relevance skyrocketed — from 1 in 4 results being eggs to 4 out of 4.

The Core Challenge

The task of detecting and correcting multilingual misspellings is complicated by the fact that many of our users input queries in vernacular languages using English lettering (e.g., typing “kothimbir” for “coriander”, “paal” which stands for “milk” in tamil). Most existing language models are built around native scripts, making it difficult to accurately understand and correct such queries.

Our Approach: Building an MVP with Llama3

We began by prototyping a multilingual spell corrector using API-based LLM playgrounds. After testing multiple models, we selected Meta’s Llama3–8B model, which provided a good balance of performance and accuracy for our use case.

The next step was to scale this solution. Relying on external APIs posed two major challenges: cost and reliability. To mitigate this, we hosted the Llama3 model on in-house Databricks’ model serving layer, which allowed us to integrate it with Spark jobs, ensuring high throughput and parallel access.

Fine-Tuning for Accuracy

After deploying the model, we turned to instruct fine-tuning — a lightweight yet powerful alternative to full model fine-tuning. Instead of modifying model weights, we focused on prompt engineering and system instruction design to teach the model specific behaviors like spelling correction, vernacular normalization, and disambiguation of mixed-language inputs.

We experimented with:

  • Role-specific system messages: We explicitly instructed the LLM to behave as a multilingual spell corrector, helping it narrow its focus and reduce hallucinations. For instance, we defined its role as “You are a spell corrector that understands multilingual inputs.”
  • Few-shot examples: We included a diverse set of input-output pairs across English and vernacular queries (e.g., “skool bag” → “school bag”, “kothimbir” → “coriander leaves”) to demonstrate expected behavior, enabling the model to generalize better to unseen inputs.
  • Stepwise prompting: Instead of asking the model to directly return the final corrected query, we broke down the process into intermediate steps like: (1) detect incorrect words, (2) correct them, and (3) translate if needed. This decomposition improved the model’s accuracy and transparency.
  • Structured JSON outputs: To make downstream integration seamless, we enforced a consistent output schema — e.g., { “original”: “kele chips”, “corrected”: “banana chips” }. This reduced parsing errors and ensured clean handoff to our ranking and retrieval systems.

This iterative instruct-tuning significantly improved task accuracy without increasing inference cost. It also allowed us to evolve the model quickly across use cases without retraining.

However, one issue that surfaced was the treatment of brand names. The model would often flag brand names as misspelled words. Initially, we included a list of brands in the prompt itself, but this increased the context length and inference time, reducing efficiency.

Enter RAG: Retrieval Augmented Generation

To overcome the limitations of static prompts and reduce dependency on long context windows. We implemented Retrieval Augmented Generation (RAG) — a paradigm that enhances LLM performance by grounding it with relevant, real-time contextual information from an external knowledge base.

Our Architecture

We designed a two-stage architecture:

  1. Semantic Retrieval Layer
    Each incoming user query is converted into an embedding using a multilingual embedding model. This embedding is matched against product embeddings stored in a Vector DB using Approximate Nearest Neighbor (ANN) search.
  2. Contextual Prompt Construction
    The retrieved product results — including titles, brand names, and spelling variants (autosuggestion corpora) — are filtered, deduplicated, and compiled into a dynamically constructed prompt, which is passed to the LLM for correction.

Why This Matters

Each component of this setup was built to address real-world challenges specific to a multilingual, quick commerce setting:

1. Robustness to Noisy Inputs
The retrieval engine operates in semantic space, so even severely misspelled or phonetically written queries return meaningful results.

  • Example:
    Query: “balekayi cheeps
    → Retrieved: { “title”: “Banana Chips”, “brand”: “Haldiram” }
    → Corrected: “banana chips”
  • Example:
    Query: “kottimbeer pudina”
    → Retrieved: [“kothimbir (coriander)”, “pudina (mint)”]
    → Corrected: “coriander and mint leaves”

This helps us bypass brittle rule-based or edit-distance based approaches, which often fail in multilingual spelling variants.

2. Brand Awareness
Earlier, we tried stuffing the prompt with a hardcoded list of brands to prevent the model from “correcting” them. This bloated the prompt and degraded latency. With RAG, we inject only query-relevant brand terms.

  • Example:
    Query: “kellogs cornflex”
    → Retrieved: { “brand”: “Kellogg’s”, “product”: “Corn Flakes” }
    → Model output: “Kellogg’s Corn Flakes”
  • Example:
    Query: “valentaz
    → Retrieved: { “brand”: “Valentas”, “category”: “cardiac medication” }
    → Model confidently corrects: “Valentas

3. Prompt Efficiency & Latency Reduction
By including only the top-k retrieved entries (e.g., top 5 product variants), we keep the token count low while injecting high-precision signals. This reduced prompt size by 30–40% compared to static dictionary-based prompts and lowered average inference latency by ~18s for a batch of 1,000 queries in our benchmarks.

  • Before RAG:
    Prompt token count ~4,500+ tokens (with entire catalog dictionary & brand list)
  • After RAG:
    Prompt token count ~1,200–1,400 tokens (only contextually relevant)

4. Dynamic Learning Capability
RAG enables rapid adaptation as the catalog evolves. If a new product or brand enters the catalog, it’s automatically included in the recall without needing to retrain or reconfigure the prompt.

  • Example:
    When we introduced “Sundrop Superlite Advanced Oil”, users began typing “sundrap oil” or “sundorp advanced”. These were dynamically grounded using vector retrieval without any manual intervention.

Outcome

This RAG-enhanced setup transforms our LLM into a domain-aware, multilingual spell corrector that:

  • Understands noisy, phonetically typed inputs
  • Differentiates brands from regular words
  • Adapts to evolving product catalogs
  • Delivers corrections with low latency and high precision

This system is now a core enabler of our vernacular query understanding pipeline, and a blueprint we’re extending to adjacent use cases like voice-to-text query correction and personalized product explanations.

Leveraging User Reformulations as Implicit Feedback

In parallel to the LLM-driven correction pipeline, we also incorporate user behavior signals to identify likely spelling corrections. Specifically, we monitor query reformulations within a short time window — typically within a few seconds of the initial search.

If a user quickly reformulates a query (e.g., from “banan chips” to “banana chips”) and the reformulated query shows a higher conversion rate, we treat it as a strong implicit signal that the original query was misspelled. These reformulation pairs help us:

  • Auto-learn new misspelling variants,
  • Improve prompt coverage and examples,
  • Enrich training datasets for future supervised fine-tuning,
  • And reinforce the spell corrector’s decision-making in production.

This behavior-driven feedback loop operates in the background and complements our LLM + RAG architecture with real-world user corrections — grounding the model in observed search behavior.

We have been building several other iterations to continuously improve the model, which we plan to cover in the future sequel to this blog.

Results: Speed, Accuracy, and Scalability

By fine-tuning our prompts and using RAG, we achieved a solution that was both fast and accurate. Hosting the model on Databricks ensured scalability, while instruct fine-tuning and RAG minimized costs and maintained high performance.

Impact

The implementation of this multilingual spell corrector had a significant positive impact on user experience and business metrics. By addressing incorrect, multilingual, and mixed grammar queries, we observed a 7.5% increase in conversion rates for the impacted queries. This improvement highlights the critical role that accurate search query understanding plays in driving user engagement and overall performance in a quick-commerce setting. The project not only enhanced the search experience but also contributed to the bottom line by helping users find the right products more efficiently, leading to increased sales and customer satisfaction.

Key Takeaways for Building with LLMs

  1. API Access vs. Self-Hosting: While API-based LLM services are great for prototyping, hosting models internally can provide better scalability and cost control in production.
  2. Instruct Fine-Tuning: Fine-tuning via instruct prompts, rather than the entire model, can save resources and improve task specificity.
  3. RAG for Efficiency: Retrieval Augmented Generation helps reduce context size and improve efficiency by providing only the necessary information to the model.
  4. Multi-Task Learning: LLMs can handle multiple tasks with careful prompt engineering, breaking down complex tasks into smaller, more manageable steps.

Credits

Monday, June 16, 2025

Microsoft Fabric : Dynamic Data Masking

 

Mastering Dynamic Data Masking in Microsoft Fabric: A Comprehensive Guide

In the realm of data security and privacy, Dynamic Data Masking (DDM) stands as a pivotal feature in Microsoft Fabric. This article delves into DDM, its significance, and its practical implementation in a Microsoft Fabric Warehouse.

Understanding Dynamic Data Masking

Dynamic Data Masking is a feature designed to mask parts of data within warehouse tables. Its primary purpose is to limit the exposure of sensitive data to individuals who do not require access to unredacted information. For instance, an email address like ‘ruicarvalho@xyz.com’ can be masked to display only the first letter and the domain, such as ‘r…@xyz.com’, for users without unmasking permissions.

Fabric Scenario

In Fabric, we are looking at a Warehouse table with information on the User, username, password, email, date of birth, etc…

We, as admins, want to mask some of this data for other users that are viewing this data.

Warehouse Users table

Role-Based Data Access

A crucial aspect of DDM in Microsoft Fabric is role-based data access. In the scenario, I´ve set up two users, an admin user and a viewer. It’s important to note that admin, member, or contributor roles can view unmasked data, while the viewer role cannot.

Workspace Manage Access

Implementing Masking Rules

As the admin user, who has full access to the warehouse, we need to mask sensitive data in the ‘employee’ table from the viewer.

Types of Masks and Their Application

Default Masking Rule: This rule is versatile and can be applied to various field types including text (like VARCHAR fields), numeric (such as INTBIGINT, or FLOAT), and date fields (DATE or DATETIME). The default masking alters the data based on the field type. For example, in the example table Users, the PasswordHash column (a VARCHAR field) was masked with 'X's, making it impossible to see the password of each individual in the dataset. Similarly, the birth date column was masked to show a uniform date of January 1st, 1900, instead of the actual birth dates.

--Default DDM
ALTER TABLE [DW_WWI].[dbo].[Users]
ALTER COLUMN PasswordHash ADD MASKED WITH (FUNCTION = 'default()')

ALTER TABLE [DW_WWI].[dbo].[Users]
ALTER COLUMN DateOfBirth ADD MASKED WITH (FUNCTION = 'default()')
Columns Default Masked

Email Mask: Tailored specifically for email addresses, this mask transforms the email field such that only the first letter and the domain suffix (like .com) are visible.

ALTER TABLE [DW_WWI].[dbo].[Users]
ALTER COLUMN Email ADD MASKED WITH (FUNCTION = 'email()')
Email Masked

Random Mask: Ideal for numeric fields where confidentiality is key, like salaries or income. The random mask generates a number within a specified range. In this example, we apply this DDM function to the Revenue field, where the actual revenue figures were replaced with random numbers between a defined range (100000–200000), thus concealing the real income figures.

--Random DDM
ALTER TABLE [DW_WWI].[dbo].[Users]
ALTER COLUMN Revenue ADD MASKED WITH (FUNCTION = 'random(100000, 200000)')
Revenue Masked

Custom String Mask: This mask allows for more tailored masking, where specific parts of a string can be exposed while the rest is masked. In the Users table, we can apply this to the Contact column where we will keep visible the first 3 characters of the user's phone number and mask the rest with X´s.

--Custom DDM
ALTER TABLE [DW_WWI].[dbo].[Users]
ALTER COLUMN Contact ADD MASKED WITH (FUNCTION = 'partial(3,"XXX-XXXX",0)')
Contact Masked

Check DDM Rules

There´s a table called sys.masked_columns that has all the information about the columns that have some DDM rule applied.

SELECT c.name, tbl.name as table_name, c.is_masked, c.masking_function
FROM sys.masked_columns AS c
JOIN sys.tables AS tbl
ON c.[object_id] = tbl.[object_id]
WHERE is_masked = 1;
Current DDM Rules

Drop DDM Rules

If you want to remove any of the DDM rules you applied, it´s very simple:

--DROP MASK
ALTER TABLE [DW_WWI].[dbo].[Users]
ALTER COLUMN Email DROP MASKED;
Email Unmasked

Conclusion

Dynamic Data Masking in Microsoft Fabric is a powerful tool for data security and privacy. By understanding and implementing DDM, organizations can ensure that sensitive data is adequately protected while still being accessible for necessary business operations. This step-by-step provides a practical and insightful guide to DDM, making it a great resource for data engineers and security professionals.

What’s more? For just $5 a month, become a Medium Member and enjoy the liberty of limitless access to every masterpiece on Medium. By subscribing via my page, you not only contribute to my work but also play a crucial role in enhancing the quality of my work. Your support means the world! 😊






How We Fix Misspelled Multilingual Queries with LLMs

  The rise of Large Language Models (LLMs) has generated significant excitement in the tech community. However, the real value lies in turni...