Revision 955d4f53b528ed8836147f33f54cbb049e73126e authored by David Roberts on 14 May 2021, 06:04:55 UTC, committed by GitHub on 14 May 2021, 06:04:55 UTC
The ml_classic tokenizer creates two (or more) tokens for email
addresses, at minimum splitting on the @ symbol.

This change makes the new ml_standard tokenizer preserve email
addresses as a single token.

Tokens that contain an @ symbol but are otherwise purely numeric
are ignored as though they were just numbers. Additionally @
symbols are ignored at the beginning and end of tokens.
1 parent d6e9d18
History

back to top