How to do case-insensitive and accent-insensitive search in Oracle Database
Most applications store a wide variety of text such as names, addresses, and descriptions. When searching on these values, often you want letters in the values to match the search pattern disregarding the case.
For example, the Olympic data set stores athlete names in this format:
FAMILY NAME, Given Names
To find all the people who have Barry for any of their names, you can (upper- or) lowercase the column and search string:
select athlete_name from olym_athletes where upper ( athlete_name ) like upper ( '%barry%' ) order by athlete_name; ATHLETE_NAME BARRY, Kevin BARRY, William Louis DAGGER, Barry Edward DANCER, Barry DAVIS, Barry Alan DEMET-BARRY, Deirdre KELLY, Barry MAGEE, Arthur Barry MAISTER, Barry John WEITZENBERG, Charles Barry
While simple, it's easy to forget to standardize the case and makes it harder for the optimizer to use indexes.
Plus it only solves some of the text searching challenges. What if you want accent insensitive searches too? Or to ignore case when sorting the values? Or take language-specific cases into consideration, like German ‘ß’ matching ‘SS’?
From Oracle Database 12.2 these are all easy. In this post you'll see how to do:
- Case-insensitive search
- Accent-insensitive search
- Sort rows regardless of case or accents
- Define search and sorting rules for a column
- Ignore case or diacritics before Oracle Database 12.2
- Create indexes for fast searches
- A word of caution when collating long text
If you want to try the examples yourself, you can get the scripts on Live SQL.
How to do case-insensitive searches
Oracle Database 12.2 added the collate
operator. This allows you to specify the search and sort semantics for text in a query. This comes after the expression you want to apply these to.
The format for this is:
<expression> collate <collation name>
The collation defines the comparison rules; binary_ci
does a case-insensitive comparison using the numeric value of the characters. So to find all the athletes with "barry" in their name in any case, write:
select athlete_name from olym_athletes where athlete_name like '%barry%' collate binary_ci order by athlete_name; ATHLETE_NAME BARRY, Kevin BARRY, William Louis DAGGER, Barry Edward ...
This applies the collation to the literal %barry%
. This takes priority over any object or session collation settings.
You can use collations anywhere you compare character values (varchar2
, char
, nvarchar2
, and nchar
). This includes order by
, group by
, set operators, and many character functions. Refer to the documentation for a complete list.
They also apply to the standard comparison conditions: =, <, >=, in, between
, etc. So to find how many athletes whose family name starts with the letter "n" or later in the alphabet, use:
select count (*) from olym_athletes where athlete_name >= 'n' collate binary_ci; COUNT(*) 8203
Without the collate clause this returns:
select count (*) from olym_athletes where athlete_name >= 'n'; COUNT(*) 12
Lowercase comes after uppercase in a binary collation and all the names in the table start with uppercase letters. So who are the athletes with lowercase family names?
select athlete_name from olym_athletes where athlete_name >= 'n'; ATHLETE_NAME ÖRSTED, Hans-Henrik ÖRTEGREN, Ruben ÖRVIG, Erik ÖRVIG, Olav ÖRVIG, Thor ÖSTENSEN, Östen ÖSTERVOLD, Henrik ÖSTERVOLD, Jan Olsen ÖSTERVOLD, Kristian Olsen ÖSTERVOLD, Ole Olsen ÖSTMO, Ole ÖSTRAND, Per-Olof
Interesting.
Their names are in uppercase, but start with "Ö" (O-umlaut). Characters with diacritics (acute, cedilla, etc.) are after all other letters in a binary collation. So O < n, but Ö > n.
This raises a couple of questions: how can we do accent-insensitive searches and how can we sort diacritics to the correct location in an alphabet?
How to do accent-insensitive searches
To ignore diacritics in comparisons, use an accent-insensitive collation. Oracle Database has a wide range of collations. These are of the form:
<collation name>{_(CI|AI)}
The suffix determines the comparison rules:
- CI = case-insensitive, accent-sensitive
- AI = accent- and case-insensitive
If you omit the suffix, the collation is case and accent sensitive. There are no case-sensitive but accent insensitive collations.
So to find all athletes with a name like "Helene" including the accented variations (é, etc.), use:
select substr ( athlete_name, instr ( athlete_name, ', ' ) + 2 ) given_names, athlete_name from olym_athletes where athlete_name like '%helene%' collate binary_ai order by given_names; GIVEN_NAMES ATHLETE_NAME Helene CORTIN, Helene Helene JUNKER, Helene Helene MAYER, Helene Helene MADISON, Helene Helene "Leni" SCHMIDT, Helene "Leni" Hélène PREVOST, Hélène Marie-Helene PREMONT, Marie-Helene
Note that Hélène appears last in this list of Helenes; Helene "Leni" appears above it.
You can change Hélène's position in the results by defining a collation for the sort.
How to do case- or accent-insensitive sorting
You can state which collation to use when sorting data by either:
- Including the collation in the
order by
- Defining the collation in the
select
list, alias the expression, reference the alias in theorder by
For example:
select substr ( athlete_name, instr ( athlete_name, ',' ) + 2 ) given_names, athlete_name from olym_athletes where athlete_name like '%helene%' collate binary_ai order by given_names collate binary_ai; GIVEN_NAMES ATHLETE_NAME Helene CORTIN, Helene Helene JUNKER, Helene Helene MAYER, Helene Helene MADISON, Helene Hélène PREVOST, Hélène Helene "Leni" SCHMIDT, Helene "Leni" Marie-Helene PREMONT, Marie-Helene -- this returns the same output as above select substr ( athlete_name, instr ( athlete_name, ',' ) + 2 ) collate binary_ai given_names, athlete_name from olym_athletes where athlete_name like '%helene%' collate binary_ai order by given_names;
With the collation binary_ai
, Helene has the same value as Hélène. So the output of these queries are non-deterministic. Hélène could appear anywhere in the list of Helenes.
In many languages, some accented characters have a specific position in their alphabet. To ensure these letters appear in the correct location for that language, you can use a linguistic collation. In general these match the name of the target language, for example:
- French (_ci, _ai)
- German (_ci, _ai)
- Czech (_ci, _ai)
Using linguistic accent-sensitive collations you can change where Hélène appears. This shows the difference between French and Czech ordering:
select substr ( athlete_name, instr ( athlete_name, ',' ) + 2 ) collate czech as given_names, athlete_name from olym_athletes where athlete_name like '%, helene%' collate binary_ai order by given_names; GIVEN_NAMES ATHLETE_NAME Helene CORTIN, Helene Helene JUNKER, Helene Helene MAYER, Helene Helene MADISON, Helene Helene "Leni" SCHMIDT, Helene "Leni" Hélène PREVOST, Hélène select substr ( athlete_name, instr ( athlete_name, ',' ) + 2 ) collate french as given_names, athlete_name from olym_athletes where athlete_name like '%, helene%' collate binary_ai order by given_names; GIVEN_NAMES ATHLETE_NAME Helene CORTIN, Helene Helene JUNKER, Helene Helene MAYER, Helene Helene MADISON, Helene Hélène PREVOST, Hélène Helene "Leni" SCHMIDT, Helene "Leni"
In French, Hélène is above Helene "Leni"; in Czech the opposite is true.
If you're unsure which order diacritic characters appear in a language, you can check their sort key using nlssort
. Just pass the characters to this with your desired collation:
select substr ( athlete_name, instr ( athlete_name, ',' ) + 2 ) collate french as given, nlssort ( substr ( athlete_name, instr ( athlete_name, ',' ) + 2 ) collate french ) as french_sort_value, nlssort ( substr ( athlete_name, instr ( athlete_name, ',' ) + 2 ) collate czech ) as czech_sort_value from olym_athletes where athlete_name like '%, helene%' collate binary_ai order by given; GIVEN FRENCH_SORT_VALUE CZECH_SORT_VALUE Helene 37284B2855280001020202020200 37284B2855280002010101010100 Helene 37284B2855280001020202020200 37284B2855280002010101010100 Helene 37284B2855280001020202020200 37284B2855280002010101010100 Helene 37284B2855280001020202020200 37284B2855280002010101010100 Hélène 37284B2855280001040206020200 37284B55280002030100E8010100 ...
The collate
operator gives you lots of flexibility to search and sort text exactly as you want. But it suffers from the same issues as using upper
or lower
. Namely you have to remember to do it and it can limit the optimizer's ability to use indexes.
To overcome these you can push collation definitions down to the column itself.
Make case-insensitive the default with column-level collation
When creating tables you can specify a collation for each text column and a default for the whole table:
create table olym_athletes ( id number not null, athlete_name varchar2(255) collate binary_ai not null, athlete_gender varchar2(10) collate binary_ci not null ) default collation binary_ai;
NOTE: To define collations at the column, table or schema level you need to enable extended types. You must also set compatible to 12.2 or higher.
Any columns with unspecified collations inherit this from the table default. If you omit this, the database takes the user's default collation.
You can check the default collation for a table and any column-level overrides with this query:
select table_name, default_collation, column_name, collation from user_tables join user_tab_cols using ( table_name ) where table_name = 'OLYM_ATHLETES'; TABLE_NAME DEFAULT_COLLATION COLUMN_NAME COLLATION OLYM_ATHLETES BINARY_AI IDOLYM_ATHLETES BINARY_AI ATHLETE_NAME BINARY_AI OLYM_ATHLETES BINARY_AI ATHLETE_GENDER BINARY_CI
You can also change these at any time with alter table
:
alter table olym_athletes modify athlete_name collate french_ci;
These are metadata changes, so are instant no matter how many rows are in the table.
NOTE: You must drop any indexes on the target column before you can change its collation and recreate them afterwards. Creating an index on a large table can take a long time. There are also several other restrictions for when you can change the collation of a column.
With column-level collation in place, comparisons use the specified collation by default:
select athlete_name from olym_athletes where athlete_name like '%barry%'; ATHLETE_NAME WEITZENBERG, Charles Barry DAGGER, Barry Edward DANCER, Barry DAVIS, Barry Alan DEMET-BARRY, Deirdre KELLY, Barry MAGEE, Arthur Barry MAISTER, Barry John BARRY, Kevin BARRY, William Louis
When you change the default for a table or schema, existing columns keep their current collation. Only new columns use the new default. So if you want to make all the character columns in an existing schema case-insensitive, you should:
- Change the default for the user
- Change the default for all the tables
- Drop indexes on all character columns
- Change the collation for every column
- Re-create the indexes
It is rare you'll want to do this for an entire schema. While it's handy to default people's names, addresses, product descriptions, etc. to a case-insensitive collation, applications typically store many other types of text. In some cases, these need to be case-sensitive, for example, Oracle usernames.
A better default collation is binary
. This is the fastest collation for comparisons and can use standard indexes. It also makes comparisons of column values immune to session settings, which are used in the legacy method for case-insensitive search. This is often critical for correct results.
If you've never used the collate
keyword in any create
or alter
statements, all columns have the collation using_nls_comp
, which means the columns use the legacy method.
You can also define collations in virtual columns and views. So if you often search a column with different language or sensitivity rules, you could add these for each common collation you'll work with:
alter table olym_athletes add ( athlete_name_french as ( athlete_name collate french ), athlete_name_czech as ( athlete_name collate czech ), athlete_name_binary_ci as ( athlete_name collate binary_ci ) );
NOTE: You must omit the data type of the virtual column when defining a collation. If you include the data type, the virtual column will use the table's default collation.
So this is all great if you're on 12.2 or newer, but what if you're stuck on an ancient version of Oracle Database?
How to do accent- and case-insensitive search on Oracle Database 12.1 and earlier
At this point you may be wondering: how do I do all this on vintage versions of Oracle Database I have?
We've already seen the most common workaround – converting all the characters to a standard case. You can also do case-insensitive comparisons using regexp_like
.
But what if you want this to happen transparently, like with column-level collation? Or need accent-insensitive searches?
You can do this by setting the nls_sort
and nls_comp
parameters. Check their current values with this query:
select * from nls_session_parameters where parameter in ( 'NLS_COMP', 'NLS_SORT' ); BINARY BINARY
To use a different collation for comparisons, set nls_comp
to linguistic
and nls_sort
to the collation you want to use:
alter session set nls_sort = binary_ai; alter session set nls_comp = linguistic;
You can continue to use these parameters to control searching and sorting from 12.2 onwards. If you do, the collate
operator and any column-level collations you've specified take precedence over these parameters.
Note that setting nls_comp
and nls_sort
parameters affect all comparisons in a query. For complex queries, with many comparisons and joins, this may be a big performance problem. The collate
operator gives you the flexibility to apply more expansive linguistic processing only where it is really needed.
If you've defined column-level collations and want to revert back to using the NLS parameters, change the column to use using_nls_comp:
alter table olym_athletes modify athlete_name collate using_nls_comp;
Whichever method you use to do insensitive queries, you'll want the database to use an index to make them fast. The question is: how?
How to index case-insensitive queries
If you're relying on NLS settings to do case-insensitive search, create a function-based index using nlssort
:
create index alth_name_nls on olym_athletes ( nlssort ( athlete_name, 'nls_sort = binary_ai') ); select * from olym_athletes where athlete_name = 'latynina, larisa'; ID ATHLETE_NAME ATHLETE_GENDER 10549 LATYNINA, Larisa Women ------------------------------------------------------------- | Id | Operation | Name | ------------------------------------------------------------- | 0 | SELECT STATEMENT | | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED| OLYM_ATHLETES | | 2 | INDEX RANGE SCAN | ALTH_NAME_NLS | -------------------------------------------------------------
Queries using the collate operator can also range scan these indexes. So if you have an existing application with lots of nlssort
indexes you can start using collate
and the optimizer will still be able to use these indexes.
But once you're on 12.2 it's easier and more obvious to use collate within the index itself! For example:
alter session set nls_comp = binary; alter session set nls_sort = binary; select * from olym_athletes where athlete_name = 'latynina, larisa' collate binary_ai; ------------------------------------------------------------- | Id | Operation | Name | ------------------------------------------------------------- | 0 | SELECT STATEMENT | | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED| OLYM_ATHLETES | | 2 | INDEX RANGE SCAN | ALTH_NAME_NLS | ------------------------------------------------------------- create index athlete_name_ci on olym_athletes ( athlete_name collate binary_ci ); select * from olym_athletes where athlete_name = 'latynina, larisa' collate binary_ci; --------------------------------------------------------------- | Id | Operation | Name | --------------------------------------------------------------- | 0 | SELECT STATEMENT | | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED| OLYM_ATHLETES | | 2 | INDEX RANGE SCAN | ATHLETE_NAME_CI | ---------------------------------------------------------------
So however you choose to do your case-insensitive searches, you can index them for fast access.
There is one final thing to be aware of - using collate
on long strings can lead to unexpected outcomes.
Insensitive searches on long text strings
Collation keys are raw
values. This presents a couple of problems:
- The maximum size of
raw
is 2,000 bytes ifmax_string_size = standard
and 32,767 bytes if it'sextended
- Collation keys are typically much longer than the source text
So when comparing long strings with the collate
operator, there's a good chance the collated values will go beyond the raw
limit. What happens in these cases depends on whether you have extended data types enabled or not.
If these aren't (the default), then the collate
operator will only convert characters at the start of the text. Trailing parts of it are ignored.
This can lead to surprising results where strings with different endings are considered equal:
select * from olym_athletes where rpad ( athlete_name, 2000, 'trailing chars' ) = rpad ( 'LATYNINA, Larisa', 2000, 'trailing chars' ) || 'extra chars'; no rows selected select * from olym_athletes where rpad ( athlete_name, 2000, 'trailing chars' ) = rpad ( 'LATYNINA, Larisa', 2000, 'trailing chars' ) || 'extra chars' collate binary_ai; ID ATHLETE_NAME ATHLETE_GENDER 10549 LATYNINA, Larisa Women
As collations apply to all comparisons, this goes beyond basic equality checks. You can find that group by
and order by
queries give incorrect results.
Remember this happens silently. Beware!
If you are using extended data types, the behaviour changes. When the sort key overflows the (much larger) raw
limit, you'll get an ORA-12742
error:
select * from olym_athletes where rpad ( athlete_name, 30000, 'trailing chars' ) = rpad ( 'LATYNINA, Larisa', 30000, 'trailing chars' ) || 'extra chars' collate french_ai; ORA-12742: unable to create the collation key
This can lead to a poor user experience, but there's a bigger problem lurking. Hackers may be able to exploit this to carry out DoS attacks! To avoid this, ensure that the maximum length of values you collate are:
- 21,844 for the collation
binary_ci
- 4,094 bytes for monolingual or multilingual collations
- 1,560 bytes for a UCA collation
In practice, this means you need to take one or more of these defensive actions:
- Only use
collate
on columns storing short strings such as names - Design your applications to ensure one user can't insert values that causes another's to overflow
- Verify only safe strings are inserted into the table
You can do the final check in code or by adding a check constraint to the table that rejects values that could lead to an error, for example:
alter table olym_athletes add constraint max_collation_c check ( vsize ( nlssort ( athlete_name collate UCA1210_DUCET ) ) != -1 );
The comparison in the above condition is irrelevant and is always true. However, as nlssort
is evaluated when you add rows, the insert fails with etc.ORA-12742
instead of the later queries.
Note that the above issue only affects operations that use materialized collation keys behind the scene (implicitly add calls to nlssort
into expression evaluation) and does not affect many other operations such as regular expression matching, select distinct, max, min, replace, instr,
etc.
For more details on this issue, read the documentation on avoiding ORA-12742
errors.
Summary
So there you have it. Collations are an easy way to find and sort text according to the case and accent rules you want to use. Using the collate
operator gives you fine-grained control over the rules you use in any textual comparison (searching, sorting, grouping, etc.).
To simplify your code, you can push collation definitions to the columns themselves. Generally you'll want to define columns using the collation binary
. Reserve linguistic collations for the few columns where all searches need to be accent- or case-insensitive. If you regularly search data with different language settings, you can add virtual columns for each you use.
If you're on Oracle Database 12.1 or earlier, you can use the session parameters nls_comp
and nls_sort
to enable case-insensitivity. These - along with their nlssort
indexes - will continue to work when you upgrade. So you can gradually migrate your code to use collate
.
Remember that however you do linguistic comparisons, they're only safe on short strings. When used on long text, you can get wrong results. Ensure you write your application to defend against this!
We hope you enjoy the new power these options give you; we'd love to hear how you're using these.
Do you work with text in many languages? Have you used column-level collations yet? Let us know the comments!
Read more about linguistic sorting and matching in the documentation.
Get the code for this post on Live SQL
Looking for more SQL tips and tricks? Take Databases for Developers: Next Level to improve your SQL skills.
Comments