ci: add linter task "ban unicode" to protect against malicious unicode #9801
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@f321x and I talked about how a malicious PR could try to add invisible unicode characters to the code and perhaps introduce backdoors/vulnerabilities that way. I think if we can add simple protection against such trickery, it might be worthwhile. I think the approach here is simple enough.
This new
ban_unicode.py
script scans the whole codebase for unicode characters and errors if it finds any, unless the character is specifically whitelisted. We can run it on the CI, its runtime is only ~2-3 seconds (+20 seconds for CI task overhead).The motivation is to protect against homoglyph attacks, invisible unicode characters, bidirectional and other control characters, and other malicious unicode usage.
Given that we mostly expect to use ASCII characters in the source code, the most robust and generic fix seems to be to just ban all unicode usage. We only rarely use unicode characters, e.g. for ASCII drawings or in the name of a copyright-holder person. Every time we introduce a new usage (which has historically been very rare), we can just add the relevant characters to the whitelist.
see https://trojansource.codes/ :
also https://github.com/maltfield/detect-malicious-unicode