Skip to Main Content

Accents

I encountered some strange behaviour when trying to remove all characters from a varchar2, except a-z, A-Z and 0-9.
Using regular expressions, you can define that in a number of ways, such as [^A-Za-z0-9] or [^[:alnum:]]

Example 1 works as expected (see the query below):
For A1 the ë (ascii 235) is not included, since it falls outside of the [^A-Za-z0-9] range.
For A2 the ë is included, since it is first converted to a regular e character.

Example 2 is (a dummy version of) the exact same query on another (also 11gR2) database and client. Here, the result for A1 was a bit different:
For A1 the ë was included.
A2 was the same.

Probably the characterset settings of the databases and/or clients were the cause? I have included the info gathered by PL/SQL Developer for both setups.
My favourite online regular expressions tool regex101.com agrees that example 1 is as expected.

Accents Example 1

A1A2
Ab3X1Ab3eX1
Character size: 4 byte(s)
CharSetID: 873
NCharSetID: 2000
Unicode Support: True
NLS_LANG: AMERICAN_AMERICA.WE8MSWIN1252
NLS_CHARACTERSET: AL32UTF8
NLS_NCHAR_CHARACTERSET: AL16UTF16

Accents Example 2

A1A2
Ab3ëX1Ab3eX1
Character size: 1 byte(s)
CharSetID: 31
NCharSetID: 2000
Unicode Support: True
NLS_LANG: DUTCH_THE NETHERLANDS.WE8MSWIN1252
NLS_NCHAR_CHARACTERSET: AL16UTF16
NLS_CHARACTERSET: WE8ISO8859P1

Value Pairs

A simple example how to split string like 'ABC:123;XYZ:789' into multiple value pairs.

Value Pairs Example

RValue 1Value 2
ABC:123ABC123
ABCD:1234ABCD1234
WXYZ:6789WXYZ6789
XYZ:789XYZ789

Code

Region

Identification
Sequence
Title
Type
Source
Source Type
Region Source

Region

Identification
Sequence
Title
Type
Source
Source Type
Region Source