1. 6
    1. 12

      Could also be titled “regex based security tools are garbage”. Your linter needs to follow the same parsing rules as the language it’s linting.

      1. 2

        I was surprised that the Python parser uses normalization when parsing Unicode. I fail to see the rationale to parse “𝘀𝘦𝘭𝘧” and “self” as the same token.

        1. 5

          Part of the rational is documented here: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-42574

          In short, Unicode is easy to abuse to craft code that look unsuspicious but end up doing much more than what the eye sees. Python approach seems to be the “What you see is what you get” (With some edge-case).

        2. 5

          How to allow “arbitrary” Unicode in identifiers is something that has standard recommendations from the Unicode Consortium, including a whole section on normalization.

          But there are some intuitive issues involved here:

          • If you and I both work on a codebase that allows most-of-Unicode in identifiers, and my system inputs things using decomposed (combining sequences) forms while yours inputs things using composed forms whenever possible, not doing normalization means you and I are really typing different identifiers.
          • If someone who reviews patches uses a font that doesn’t (sufficiently) visually distinguish some distinct-but-compatibility-equivalent sequences, not doing normalization means they might get tricked into accepting a patch that does something different than what it “looks like”.


          And so it’s not just Python which needs to be aware of this; last I checked Rust, for example, had similar UAX31-based processing to account for the fact that it allows more than just ASCII in identifirs.

          1. 2

            Thanks for your reply, along with the one from @isra17. I’m pretty sure the Python development team have thought longer and harder on this than I have 😉

            (btw was unicode allowed in Python scripts (as identifiers) prior to Python 3?)

            I agree with @Student, the problem isn’t Python allowing different letter-like symbols in code, it’s naive “code scanning” software that doesn’t account for the possibility.

            1. 3

              (btw was unicode allowed in Python scripts (as identifiers) prior to Python 3?)

              • Python 2: Python source code files were assumed ASCII by default unless they declared an alternative encoding with a magic comment at the top, and identifiers were restricted to using ASCII letters, digits, and underscore, and must begin with a letter or underscore.
              • Python 3: Python source code files are assumed UTF-8 by default (though you can still add a magic encoding comment to select another encoding) and identifiers must begin with a code point with Unicode derived property XID_Start, with all following code points having XID_Continue.