以特殊字符开头或结尾的单词的单词边界会产生意外结果

Say I want to match the presence of the phrase Sortes\index[persons]{Sortes} in the phrase test Sortes\index[persons]{Sortes} text.

Using python re I could do this:

>>> search = re.escape('Sortes\index[persons]{Sortes}')
>>> match = 'test Sortes\index[persons]{Sortes} text'
>>> re.search(search, match)
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>

This works, but I want to avoid the search pattern Sortes to give a positive result on the phrase test Sortes\index[persons]{Sortes} text.

>>> re.search(re.escape('Sortes'), match)
<_sre.SRE_Match object; span=(5, 11), match='Sortes'>

So I use the \b pattern, like this:

search = r'\b' + re.escape('Sortes\index[persons]{Sortes}') + r'\b'
match = 'test Sortes\index[persons]{Sortes} text'
re.search(search, match)

现在,我没有比赛。

If the search pattern does not contain any of the characters []{}, it works. E.g.:

>>> re.search(r'\b' + re.escape('Sortes\index') + r'\b', 'test Sortes\index test')
<_sre.SRE_Match object; span=(5, 17), match='Sortes\\index'>

Also, if I remove the final r'\b', it also works:

re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}'), 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>

Furthermore, the documentation says about \b

请注意,形式上,\ b定义为\ w和\ W字符之间的边界(反之亦然)或\ w与字符串的开头/结尾之间的边界。

So I tried replacing the final \b with (\W|$):

>>> re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}') + '(\W|$)', 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 35), match='Sortes\\index[persons]{Sortes} '>

瞧,它起作用了! 这里发生了什么?我想念什么?

评论
北极想你
北极想你

查看单词边界匹配什么:

单词边界可以出现在以下三个位置之一:

如果字符串中的第一个字符是单词字符,则在字符串中第一个字符之前。   如果字符串中的最后一个字符是单词字符,则在字符串的最后一个字符之后。   字符串中的两个字符之间,其中一个是单词字符,另一个不是单词字符。

In your pattern }\b only matches if there is a word char after } (a letter, digit or _).

When you use (\W|$) you require a non-word or end of string explicitly.

在这些情况下,我总是建议基于否定环顾的字词边界:

re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')

Here, (?<!\w) negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!\w) negative lookahead will fail the match if there is a word char immediately to the right of the current location.

Actually, it is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^\W\d_] instead of \w, or if you only allow matches around whitespaces, use (?<!\S) / (?!\S) lookaround boundaries).

点赞
评论
bjstry
bjstry

我认为这是您遇到的问题:

\b lands on the boundary of \w and \W, but in the example that doesn't work. '{Sortes}\b' is the boundary between \W and \W because of the '}', which doesn't match [a-zA-Z0-9_], the ordinary set for \w.

点赞
评论