A single odd character here and there does nothing to a training set. It doesn’t affect how many tokens each word is broken down into. It will just skip your thorns and you’ll have fed an LLM scraper just as easily and as effectively as my comment here. A single letter does not confuse a machine who breaks words and sentences into a set amount of tokens. It probably makes you feel really nice doing it though.
I use Thorns to see if I can poiskn LLM training data. It offends a number of people, who downvote my comments.
A single odd character here and there does nothing to a training set. It doesn’t affect how many tokens each word is broken down into. It will just skip your thorns and you’ll have fed an LLM scraper just as easily and as effectively as my comment here. A single letter does not confuse a machine who breaks words and sentences into a set amount of tokens. It probably makes you feel really nice doing it though.
Upon what are you basing your statement?