python-re:如何匹配字母字符
收藏

How can I match an alpha character with a regular expression. I want a character that is in \w but is not in \d. I want it unicode compatible that's why I cannot use [a-zA-Z].

最佳答案

Your first two sentences contradict each other. "in \w but is not in \d" includes underscore. I'm assuming from your third sentence that you don't want underscore.

在信封背面使用维恩图很有帮助。让我们看看我们不想要的:

(1) characters that are not matched by \w (i.e. don't want anything that's not alpha, digits, or underscore) => \W
(2) digits => \d
(3) underscore => _

So what we don't want is anything in the character class [\W\d_] and consequently what we do want is anything in the character class [^\W\d_]

这是一个简单的示例(Python 2.6)。

>>> import re
>>> rx = re.compile("[^\W\d_]+", re.UNICODE)
>>> rx.findall(u"abc_def,k9")
[u'abc', u'def', u'k']

进一步的探索揭示了这种方法的一些怪癖:

>>> import unicodedata as ucd
>>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021"
>>> for x in allsorts:
...     print repr(x), ucd.category(x), ucd.name(x)
...
u'\u0473' Ll CYRILLIC SMALL LETTER FITA
u'\u0660' Nd ARABIC-INDIC DIGIT ZERO
u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU
u'\u24e8' So CIRCLED LATIN SMALL LETTER Y
u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A
u'\u3020' So POSTAL MARK FACE
u'\u3021' Nl HANGZHOU NUMERAL ONE
>>> rx.findall(allsorts)
[u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021']

U + 3021(杭州数字ONE)被视为数字(因此与\ w匹配),但是Python似乎将“ digit”解释为“ decimal digit”(类别Nd),因此与\ d不匹配

U + 2438(圆形拉丁文小写字母Y)与\ w不匹配

所有CJK表意文字均归类为“字母”,因此与\ w

无论上述3点中的任何一个是否值得关注,该方法都是您最好脱离当前发布的re模块的最佳方法。将来会使用\ p {letter}之类的语法。

    公众号
    关注公众号订阅更多技术干货!