Regex To Remove Non-letter Characters But Keep Accented Letters

February 28, 2024 Post a Comment

I have strings in Spanish and other languages that may contain generic special characters like (),*, etc. That I need to remove. But the problem is that it also may contain special

Solution 1:

I would suggest using Steven Levithan's excellent XRegExp library and its Unicode plug-in.

Here's an example that strips non-Latin word characters from a string: http://jsfiddle.net/b3awZ/1/

var regex =XRegExp("[^\\s\\p{Latin}]+", "g");
var str ="¿Me puedes decir la contraseña de la Wi-Fi?"var replaced =XRegExp.replace(str, regex, "");

See also this answer by Steven Levithan himself:

Regular expression Spanish and Arabic words

Solution 2:

Instead of whitelisting characters you accept, you could try blacklisting illegal characters:

var desired = stringToReplace.replace(/[-'`~!@#$%^&*()_|+=?;:'",.<>\{\}\[\]\\\/]/gi, '')

Solution 3:

Note! Works only for 16bit code points. This answer is incomplete.

Short answer

The character class for all arabic digits and latin letters is: [0-9A-Za-z\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u02af\u1d00-\u1d25\u1d62-\u1d65\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1eff\u2090-\u2094\u2184-\u2184\u2488-\u2490\u271d-\u271d\u2c60-\u2c7c\u2c7e-\u2c7f\ua722-\ua76f\ua771-\ua787\ua78b-\ua78c\ua7fb-\ua7ff\ufb00-\ufb06].

To get a regex you can use, prepend /^ and append +$/. This will match strings consisting of only latin letters and digits like "mérito" or "Schönheit".

To match non-digits or non-letter characters to remove them, write a ^ as first character after the opening bracket [ and prepend / and append +/.

How did I find that out? Continue reading.

Long answer: use metaprogramming!

Because Javascript does not have Unicode regexes, I wrote a Python program to iterate over the whole of Unicode and filter by Unicode name. It is difficult to get this right manually. Why not let the computer do the dirty and menial work?

import unicodedata
import re
import sys

defunicodeNameMatch(pattern, codepoint):
  try:
    return re.match(pattern, unicodedata.name(unichr(codepoint)), re.I)
  except ValueError:
    returnNonedefregexChr(codepoint):
  returnchr(codepoint) if32 <= codepoint < 127else"\\u%04x" % codepoint

names = sys.argv
prev = None

js_regex = ""for codepoint inrange(pow(2, 16)):
  ifany([unicodeNameMatch(name, codepoint) for name in names]):
    if prev isNone: js_regex += regexChr(codepoint)
    prev = codepoint
  else:
    ifnot prev isNone: js_regex += "-" + regexChr(prev)
    prev = Noneprint"[" + js_regex + "]"

Invoke it like this: python char_class.py latin digit and you get the character class mentioned above. It's an ugly char class but you know for sure that you catched all characters whose names contain latin or digit.

Browse the Unicode Character Database to view the names of all unicode characters. The name is in uppercase after the first semicolon, for example for A its the line

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;

Try python char_class.py "latin small" and you get a character class for all latin small letters.

Edit: There is a small misfeature (aka bug) in that \u271d-\u271d occurs in the regex. Perhaps this fix helps: Replace

if not prev is None: js_regex += "-" + regexChr(prev)

ifnot prev isNoneand prev != codepoint: js_regex += "-" + regexChr(prev)

Solution 4:

var desired = stringToReplace.replace(/[\u0000-\u007F][\W]/gi, '');

might do the trick.

See also this Javascript + Unicode regexes question.

Solution 5:

If you must insist on whitelisting here is the rawest way of doing it:

Test if string contains only letters (a-z + é ü ö ê å ø etc..)

It works by keeping track of 'all' unicode letter chars.

JavaScript Gen