Subrat's Technical Blog: How to do case-insensitive and accent-insensitive search in Oracle Database

Most applications store a wide variety of text such as names, addresses, and descriptions. When searching on these values, often you want letters in the values to match the search pattern disregarding the case.

For example, the Olympic data set stores athlete names in this format:

FAMILY NAME, Given Names

To find all the people who have Barry for any of their names, you can (upper- or) lowercase the column and search string:

select athlete_name 
from   olym_athletes
where  upper ( athlete_name ) like upper ( '%barry%' )
order  by athlete_name;

ATHLETE_NAME                                                                                                                                                                                                                                                   
BARRY, Kevin
BARRY, William Louis
DAGGER, Barry Edward
DANCER, Barry
DAVIS, Barry Alan
DEMET-BARRY, Deirdre
KELLY, Barry
MAGEE, Arthur Barry
MAISTER, Barry John
WEITZENBERG, Charles Barry

While simple, it's easy to forget to standardize the case and makes it harder for the optimizer to use indexes.

Plus it only solves some of the text searching challenges. What if you want accent insensitive searches too? Or to ignore case when sorting the values? Or take language-specific cases into consideration, like German ‘ß’ matching ‘SS’?

From Oracle Database 12.2 these are all easy. In this post you'll see how to do:

If you want to try the examples yourself, you can get the scripts on Live SQL.

Photo by Andrea Piacquadio from Pexels

How to do case-insensitive searches

Oracle Database 12.2 added the collate operator. This allows you to specify the search and sort semantics for text in a query. This comes after the expression you want to apply these to.

The format for this is:

<expression> collate <collation name>

The collation defines the comparison rules; binary_ci does a case-insensitive comparison using the numeric value of the characters. So to find all the athletes with "barry" in their name in any case, write:

select athlete_name 
from   olym_athletes
where  athlete_name like '%barry%' 
         collate binary_ci
order  by athlete_name;

ATHLETE_NAME                 
BARRY, Kevin                  
BARRY, William Louis          
DAGGER, Barry Edward 
...

This applies the collation to the literal %barry%. This takes priority over any object or session collation settings.

You can use collations anywhere you compare character values (varchar2, char, nvarchar2, and nchar). This includes order by, group by, set operators, and many character functions. Refer to the documentation for a complete list.

They also apply to the standard comparison conditions: =, <, >=, in, between, etc. So to find how many athletes whose family name starts with the letter "n" or later in the alphabet, use:

select count (*)
from   olym_athletes
where  athlete_name >= 'n' 
         collate binary_ci;
         
COUNT(*)   
       8203

Without the collate clause this returns:

select count (*)
from   olym_athletes
where  athlete_name >= 'n';

COUNT(*)   
         12

Lowercase comes after uppercase in a binary collation and all the names in the table start with uppercase letters. So who are the athletes with lowercase family names?

select athlete_name
from   olym_athletes
where  athlete_name >= 'n';

ATHLETE_NAME                
ÖRSTED, Hans-Henrik          
ÖRTEGREN, Ruben              
ÖRVIG, Erik                  
ÖRVIG, Olav                  
ÖRVIG, Thor                  
ÖSTENSEN, Östen              
ÖSTERVOLD, Henrik            
ÖSTERVOLD, Jan Olsen         
ÖSTERVOLD, Kristian Olsen    
ÖSTERVOLD, Ole Olsen         
ÖSTMO, Ole                   
ÖSTRAND, Per-Olof

Interesting.

Their names are in uppercase, but start with "Ö" (O-umlaut). Characters with diacritics (acute, cedilla, etc.) are after all other letters in a binary collation. So O < n, but Ö > n.

This raises a couple of questions: how can we do accent-insensitive searches and how can we sort diacritics to the correct location in an alphabet?

How to do accent-insensitive searches

To ignore diacritics in comparisons, use an accent-insensitive collation. Oracle Database has a wide range of collations. These are of the form:

<collation name>{_(CI|AI)}

The suffix determines the comparison rules:

CI = case-insensitive, accent-sensitive
AI = accent- and case-insensitive

If you omit the suffix, the collation is case and accent sensitive. There are no case-sensitive but accent insensitive collations.

So to find all athletes with a name like "Helene" including the accented variations (é, etc.), use:

select substr ( 
         athlete_name, 
         instr ( athlete_name, ', ' ) + 2 
       ) given_names,
       athlete_name
from   olym_athletes
where  athlete_name like '%helene%' 
         collate binary_ai
order  by given_names;

GIVEN_NAMES      ATHLETE_NAME             
Helene           CORTIN, Helene            
Helene           JUNKER, Helene            
Helene           MAYER, Helene             
Helene           MADISON, Helene           
Helene "Leni"    SCHMIDT, Helene "Leni"    
Hélène           PREVOST, Hélène           
Marie-Helene     PREMONT, Marie-Helene

Note that Hélène appears last in this list of Helenes; Helene "Leni" appears above it.

You can change Hélène's position in the results by defining a collation for the sort.

Image by Michael Schwarzenberger from Pixabay

How to do case- or accent-insensitive sorting

You can state which collation to use when sorting data by either:

Including the collation in the order by
Defining the collation in the select list, alias the expression, reference the alias in the order by

For example:

select substr ( 
         athlete_name, instr ( athlete_name, ',' ) + 2 
       ) given_names,
       athlete_name
from   olym_athletes
where  athlete_name like '%helene%' 
         collate binary_ai
order  by given_names
  collate binary_ai;
  
GIVEN_NAMES      ATHLETE_NAME             
Helene           CORTIN, Helene            
Helene           JUNKER, Helene            
Helene           MAYER, Helene             
Helene           MADISON, Helene           
Hélène           PREVOST, Hélène           
Helene "Leni"    SCHMIDT, Helene "Leni"    
Marie-Helene     PREMONT, Marie-Helene

-- this returns the same output as above
select substr ( 
         athlete_name, instr ( athlete_name, ',' ) + 2 
       ) collate binary_ai given_names,
       athlete_name
from   olym_athletes
where  athlete_name like '%helene%' 
         collate binary_ai
order  by given_names;

With the collation binary_ai, Helene has the same value as Hélène. So the output of these queries are non-deterministic. Hélène could appear anywhere in the list of Helenes.

In many languages, some accented characters have a specific position in their alphabet. To ensure these letters appear in the correct location for that language, you can use a linguistic collation. In general these match the name of the target language, for example:

French (_ci, _ai)
German (_ci, _ai)
Czech (_ci, _ai)

Using linguistic accent-sensitive collations you can change where Hélène appears. This shows the difference between French and Czech ordering:

select substr ( 
         athlete_name, instr ( athlete_name, ',' ) + 2 
       ) collate czech as given_names,
       athlete_name
from   olym_athletes
where  athlete_name like '%, helene%' 
         collate binary_ai
order  by given_names;

GIVEN_NAMES      ATHLETE_NAME             
Helene           CORTIN, Helene            
Helene           JUNKER, Helene            
Helene           MAYER, Helene             
Helene           MADISON, Helene           
Helene "Leni"    SCHMIDT, Helene "Leni"    
Hélène           PREVOST, Hélène 

select substr ( 
         athlete_name, instr ( athlete_name, ',' ) + 2 
       ) collate french as given_names,
       athlete_name
from   olym_athletes
where  athlete_name like '%, helene%' 
         collate binary_ai
order  by given_names;

GIVEN_NAMES      ATHLETE_NAME             
Helene           CORTIN, Helene            
Helene           JUNKER, Helene            
Helene           MAYER, Helene             
Helene           MADISON, Helene           
Hélène           PREVOST, Hélène           
Helene "Leni"    SCHMIDT, Helene "Leni"

In French, Hélène is above Helene "Leni"; in Czech the opposite is true.

If you're unsure which order diacritic characters appear in a language, you can check their sort key using nlssort. Just pass the characters to this with your desired collation:

select substr ( 
         athlete_name, instr ( athlete_name, ',' ) + 2 
       ) collate french as given,
       nlssort ( 
         substr ( 
           athlete_name, instr ( athlete_name, ',' ) + 2 
         ) collate french 
       ) as french_sort_value,
       nlssort ( 
         substr ( 
           athlete_name, instr ( athlete_name, ',' ) + 2 
         ) collate czech
       ) as czech_sort_value
from   olym_athletes
where  athlete_name like '%, helene%' 
         collate binary_ai
order  by given;

GIVEN   FRENCH_SORT_VALUE             CZECH_SORT_VALUE                                           
Helene  37284B2855280001020202020200  37284B2855280002010101010100                                
Helene  37284B2855280001020202020200  37284B2855280002010101010100                                
Helene  37284B2855280001020202020200  37284B2855280002010101010100                                
Helene  37284B2855280001020202020200  37284B2855280002010101010100                                
Hélène  37284B2855280001040206020200  37284B55280002030100E8010100                                
...

The collate operator gives you lots of flexibility to search and sort text exactly as you want. But it suffers from the same issues as using upper or lower. Namely you have to remember to do it and it can limit the optimizer's ability to use indexes.

To overcome these you can push collation definitions down to the column itself.

Make case-insensitive the default with column-level collation

When creating tables you can specify a collation for each text column and a default for the whole table:

create table olym_athletes (	
  id number 
    not null, 
  athlete_name varchar2(255) 
    collate binary_ai 
    not null, 
  athlete_gender varchar2(10) 
    collate binary_ci 
    not null
) default collation binary_ai;

NOTE: To define collations at the column, table or schema level you need to enable extended types. You must also set compatible to 12.2 or higher.

Any columns with unspecified collations inherit this from the table default. If you omit this, the database takes the user's default collation.

You can check the default collation for a table and any column-level overrides with this query:

select table_name,  
       default_collation,
       column_name, 
       collation
from   user_tables
join   user_tab_cols
using  ( table_name )
where  table_name = 'OLYM_ATHLETES';

TABLE_NAME     DEFAULT_COLLATION  COLUMN_NAME     COLLATION   
OLYM_ATHLETES  BINARY_AI          ID                     
OLYM_ATHLETES  BINARY_AI          ATHLETE_NAME    BINARY_AI    
OLYM_ATHLETES  BINARY_AI          ATHLETE_GENDER  BINARY_CI

You can also change these at any time with alter table:

alter table olym_athletes
  modify athlete_name collate french_ci;

These are metadata changes, so are instant no matter how many rows are in the table.

NOTE: You must drop any indexes on the target column before you can change its collation and recreate them afterwards. Creating an index on a large table can take a long time. There are also several other restrictions for when you can change the collation of a column.

With column-level collation in place, comparisons use the specified collation by default:

select athlete_name from olym_athletes
where  athlete_name like '%barry%';

ATHLETE_NAME                 
WEITZENBERG, Charles Barry    
DAGGER, Barry Edward          
DANCER, Barry                 
DAVIS, Barry Alan             
DEMET-BARRY, Deirdre          
KELLY, Barry                  
MAGEE, Arthur Barry           
MAISTER, Barry John           
BARRY, Kevin                  
BARRY, William Louis

When you change the default for a table or schema, existing columns keep their current collation. Only new columns use the new default. So if you want to make all the character columns in an existing schema case-insensitive, you should:

Change the default for the user
Change the default for all the tables
Drop indexes on all character columns
Change the collation for every column
Re-create the indexes

It is rare you'll want to do this for an entire schema. While it's handy to default people's names, addresses, product descriptions, etc. to a case-insensitive collation, applications typically store many other types of text. In some cases, these need to be case-sensitive, for example, Oracle usernames.

A better default collation is binary. This is the fastest collation for comparisons and can use standard indexes. It also makes comparisons of column values immune to session settings, which are used in the legacy method for case-insensitive search. This is often critical for correct results.

If you've never used the collate keyword in any create or alter statements, all columns have the collation using_nls_comp, which means the columns use the legacy method.

You can also define collations in virtual columns and views. So if you often search a column with different language or sensitivity rules, you could add these for each common collation you'll work with:

alter table olym_athletes
  add (
    athlete_name_french as 
      ( athlete_name collate french ),
    athlete_name_czech as
      ( athlete_name collate czech ),
    athlete_name_binary_ci as
      ( athlete_name collate binary_ci ) 
  );

NOTE: You must omit the data type of the virtual column when defining a collation. If you include the data type, the virtual column will use the table's default collation.

So this is all great if you're on 12.2 or newer, but what if you're stuck on an ancient version of Oracle Database?

How to do accent- and case-insensitive search on Oracle Database 12.1 and earlier

At this point you may be wondering: how do I do all this on vintage versions of Oracle Database I have?

We've already seen the most common workaround – converting all the characters to a standard case. You can also do case-insensitive comparisons using regexp_like.

But what if you want this to happen transparently, like with column-level collation? Or need accent-insensitive searches?

You can do this by setting the nls_sort and nls_comp parameters. Check their current values with this query:

select *
from   nls_session_parameters
where  parameter in ( 'NLS_COMP', 'NLS_SORT' );

BINARY
BINARY

To use a different collation for comparisons, set nls_comp to linguistic and nls_sort to the collation you want to use:

alter session set nls_sort = binary_ai;
alter session set nls_comp = linguistic;

You can continue to use these parameters to control searching and sorting from 12.2 onwards. If you do, the collate operator and any column-level collations you've specified take precedence over these parameters.

Note that setting nls_comp and nls_sort parameters affect all comparisons in a query. For complex queries, with many comparisons and joins, this may be a big performance problem. The collate operator gives you the flexibility to apply more expansive linguistic processing only where it is really needed.

If you've defined column-level collations and want to revert back to using the NLS parameters, change the column to use using_nls_comp:

alter table olym_athletes
  modify athlete_name collate using_nls_comp;

Whichever method you use to do insensitive queries, you'll want the database to use an index to make them fast. The question is: how?

How to index case-insensitive queries

If you're relying on NLS settings to do case-insensitive search, create a function-based index using nlssort:

create index alth_name_nls
  on olym_athletes ( 
    nlssort ( athlete_name, 'nls_sort = binary_ai') 
  );

select * 
from   olym_athletes 
where  athlete_name = 'latynina, larisa';

ID      ATHLETE_NAME       ATHLETE_GENDER   
  10549 LATYNINA, Larisa   Women    

-------------------------------------------------------------               
| Id  | Operation                           | Name          |               
-------------------------------------------------------------               
|   0 | SELECT STATEMENT                    |               |               
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| OLYM_ATHLETES |               
|   2 |   INDEX RANGE SCAN                  | ALTH_NAME_NLS |               
-------------------------------------------------------------

Queries using the collate operator can also range scan these indexes. So if you have an existing application with lots of nlssort indexes you can start using collate and the optimizer will still be able to use these indexes.

But once you're on 12.2 it's easier and more obvious to use collate within the index itself! For example:

alter session set nls_comp = binary;
alter session set nls_sort = binary;

select * 
from   olym_athletes 
where  athlete_name = 'latynina, larisa' 
         collate binary_ai;

-------------------------------------------------------------        
| Id  | Operation                           | Name          |        
-------------------------------------------------------------        
|   0 | SELECT STATEMENT                    |               |        
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| OLYM_ATHLETES |        
|   2 |   INDEX RANGE SCAN                  | ALTH_NAME_NLS |        
-------------------------------------------------------------
  
create index athlete_name_ci
  on olym_athletes ( 
    athlete_name 
      collate binary_ci 
  );
  
select * 
from   olym_athletes 
where  athlete_name = 'latynina, larisa' 
         collate binary_ci;
         
---------------------------------------------------------------      
| Id  | Operation                           | Name            |      
---------------------------------------------------------------      
|   0 | SELECT STATEMENT                    |                 |      
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| OLYM_ATHLETES   |      
|   2 |   INDEX RANGE SCAN                  | ATHLETE_NAME_CI |      
---------------------------------------------------------------

So however you choose to do your case-insensitive searches, you can index them for fast access.

There is one final thing to be aware of - using collate on long strings can lead to unexpected outcomes.

Insensitive searches on long text strings

Collation keys are raw values. This presents a couple of problems:

The maximum size of raw is 2,000 bytes if max_string_size = standard and 32,767 bytes if it's extended
Collation keys are typically much longer than the source text

So when comparing long strings with the collate operator, there's a good chance the collated values will go beyond the raw limit. What happens in these cases depends on whether you have extended data types enabled or not.

If these aren't (the default), then the collate operator will only convert characters at the start of the text. Trailing parts of it are ignored.

This can lead to surprising results where strings with different endings are considered equal:

select * 
from   olym_athletes 
where  rpad ( athlete_name, 2000, 'trailing chars' ) = 
         rpad ( 'LATYNINA, Larisa', 2000, 'trailing chars' ) 
         || 'extra chars'; 
         
no rows selected

select * 
from   olym_athletes 
where  rpad ( athlete_name, 2000, 'trailing chars' ) = 
         rpad ( 'LATYNINA, Larisa', 2000, 'trailing chars' ) 
         || 'extra chars'
         collate binary_ai;
         
ID      ATHLETE_NAME       ATHLETE_GENDER   
  10549 LATYNINA, Larisa   Women

As collations apply to all comparisons, this goes beyond basic equality checks. You can find that group by and order by queries give incorrect results.

Remember this happens silently. Beware!

If you are using extended data types, the behaviour changes. When the sort key overflows the (much larger) raw limit, you'll get an ORA-12742 error:

select * 
from   olym_athletes 
where  rpad ( athlete_name, 30000, 'trailing chars' ) = 
         rpad ( 'LATYNINA, Larisa', 30000, 'trailing chars' ) 
         || 'extra chars'
         collate french_ai;
         
ORA-12742: unable to create the collation key

This can lead to a poor user experience, but there's a bigger problem lurking. Hackers may be able to exploit this to carry out DoS attacks! To avoid this, ensure that the maximum length of values you collate are:

21,844 for the collation binary_ci
4,094 bytes for monolingual or multilingual collations
1,560 bytes for a UCA collation

In practice, this means you need to take one or more of these defensive actions:

Only use collate on columns storing short strings such as names
Design your applications to ensure one user can't insert values that causes another's to overflow
Verify only safe strings are inserted into the table

You can do the final check in code or by adding a check constraint to the table that rejects values that could lead to an error, for example:

alter table olym_athletes
  add constraint max_collation_c
  check ( 
    vsize ( 
      nlssort ( athlete_name collate UCA1210_DUCET ) 
    ) != -1
  );

The comparison in the above condition is irrelevant and is always true. However, as nlssort is evaluated when you add rows, the insert fails with etc.ORA-12742 instead of the later queries.

Note that the above issue only affects operations that use materialized collation keys behind the scene (implicitly add calls to nlssort into expression evaluation) and does not affect many other operations such as regular expression matching, select distinct, max, min, replace, instr, etc.

For more details on this issue, read the documentation on avoiding ORA-12742 errors.

Summary

So there you have it. Collations are an easy way to find and sort text according to the case and accent rules you want to use. Using the collate operator gives you fine-grained control over the rules you use in any textual comparison (searching, sorting, grouping, etc.).

To simplify your code, you can push collation definitions to the columns themselves. Generally you'll want to define columns using the collation binary. Reserve linguistic collations for the few columns where all searches need to be accent- or case-insensitive. If you regularly search data with different language settings, you can add virtual columns for each you use.

If you're on Oracle Database 12.1 or earlier, you can use the session parameters nls_comp and nls_sort to enable case-insensitivity. These - along with their nlssort indexes - will continue to work when you upgrade. So you can gradually migrate your code to use collate.

Remember that however you do linguistic comparisons, they're only safe on short strings. When used on long text, you can get wrong results. Ensure you write your application to defend against this!

We hope you enjoy the new power these options give you; we'd love to hear how you're using these.

Do you work with text in many languages? Have you used column-level collations yet? Let us know the comments!

Get the code for this post on Live SQL

Looking for more SQL tips and tricks? Take Databases for Developers: Next Level to improve your SQL skills.

Subrat's Technical Blog

Thursday, April 1, 2021

How to do case-insensitive and accent-insensitive search in Oracle Database