Regex To Remove Non-letter Characters But Keep Accented Letters
Solution 1:
I would suggest using Steven Levithan's excellent XRegExp library and its Unicode plug-in.
Here's an example that strips non-Latin word characters from a string: http://jsfiddle.net/b3awZ/1/
var regex =XRegExp("[^\\s\\p{Latin}]+", "g");
var str ="¿Me puedes decir la contraseña de la Wi-Fi?"var replaced =XRegExp.replace(str, regex, "");
See also this answer by Steven Levithan himself:
Solution 2:
Instead of whitelisting characters you accept, you could try blacklisting illegal characters:
var desired = stringToReplace.replace(/[-'`~!@#$%^&*()_|+=?;:'",.<>\{\}\[\]\\\/]/gi, '')
Solution 3:
Note! Works only for 16bit code points. This answer is incomplete.
Short answer
The character class for all arabic digits and latin letters is: [0-9A-Za-z\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u02af\u1d00-\u1d25\u1d62-\u1d65\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1eff\u2090-\u2094\u2184-\u2184\u2488-\u2490\u271d-\u271d\u2c60-\u2c7c\u2c7e-\u2c7f\ua722-\ua76f\ua771-\ua787\ua78b-\ua78c\ua7fb-\ua7ff\ufb00-\ufb06]
.
To get a regex you can use, prepend /^
and append +$/
. This will match strings consisting of only latin letters and digits like "mérito"
or "Schönheit"
.
To match non-digits or non-letter characters to remove them, write a ^
as first character after the opening bracket [
and prepend /
and append +/
.
How did I find that out? Continue reading.
Long answer: use metaprogramming!
Because Javascript does not have Unicode regexes, I wrote a Python program to iterate over the whole of Unicode and filter by Unicode name. It is difficult to get this right manually. Why not let the computer do the dirty and menial work?
import unicodedata
import re
import sys
defunicodeNameMatch(pattern, codepoint):
try:
return re.match(pattern, unicodedata.name(unichr(codepoint)), re.I)
except ValueError:
returnNonedefregexChr(codepoint):
returnchr(codepoint) if32 <= codepoint < 127else"\\u%04x" % codepoint
names = sys.argv
prev = None
js_regex = ""for codepoint inrange(pow(2, 16)):
ifany([unicodeNameMatch(name, codepoint) for name in names]):
if prev isNone: js_regex += regexChr(codepoint)
prev = codepoint
else:
ifnot prev isNone: js_regex += "-" + regexChr(prev)
prev = Noneprint"[" + js_regex + "]"
Invoke it like this: python char_class.py latin digit
and you get the character class mentioned above. It's an ugly char class but you know for sure that you catched all characters whose names contain latin
or digit
.
Browse the Unicode Character Database to view the names of all unicode characters. The name is in uppercase after the first semicolon, for example for A
its the line
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
Try python char_class.py "latin small"
and you get a character class for all latin small letters.
Edit: There is a small misfeature (aka bug) in that \u271d-\u271d
occurs in the regex. Perhaps this fix helps: Replace
if not prev is None: js_regex += "-" + regexChr(prev)
by
ifnot prev isNoneand prev != codepoint: js_regex += "-" + regexChr(prev)
Solution 4:
var desired = stringToReplace.replace(/[\u0000-\u007F][\W]/gi, '');
might do the trick.
See also this Javascript + Unicode regexes question.
Solution 5:
If you must insist on whitelisting here is the rawest way of doing it:
Test if string contains only letters (a-z + é ü ö ê å ø etc..)
It works by keeping track of 'all' unicode letter chars.
Post a Comment for "Regex To Remove Non-letter Characters But Keep Accented Letters"