1. 12

  2. 4

    Wow. Bleach has been a staple in my Python toolkit for years - v1 was released about 12 years ago, and I’m fairly certain Bitbucket was using it when I worked there almost 8 years ago. Thank-you for maintaining it for so long. I hope another library steps up to take its place.

    Though, outside of compatibility updates, and major bug fixes, is there much that needs to be done with bleach? It seems like part of the reason for its longevity is that it’s had a similar API for a long time, and it’s been very stable. Maybe it could be changed to use lxml.html under the hood to get around html5lib seeming to be unmaintained, but it doesn’t seem like a very high priority unless there’s a security issue or major bug with the underlying library.

    1. 2

      While it is sad to see Bleach deprecated, as it is a library I’ve used for SO long time, I expect some amount of projects like nh31 to emerge soon to fill the void.

      Which even can be seen as win-win situation, as ammonia2 crate declares itself as 15x time faster alternative to Bleach.

      1. 2

        This feels like it’s been a while coming. Gloriously, html5lib deprecated its sanitizer in favor of Bleach in 2020, but the project’s owners haven’t passed the torch.

        Around that time I looked into forking html5lib due to the lack of maintenance (they aren’t great about merging PRs) and slow performance. My thought was to type annotate it enough to run mypyc on it. However, after triaging all the open issues and digging into the implementation I don’t think it’s really worth salvaging, for a number of reasons:

        • It’s surprisingly incomplete, lacking support for not-exactly-new stuff like <wbr> and <ol reversed> (and there are many more omissions in the sanitizer)
        • The parsing is done character-by-character and with a substantial amount of indirection — there’s no clear way to improve its performance without a radical re-architecture
        • There is a ton of internal layering to support generating different tree representations (ElementTree, DOM) that adds a ton of complexity and weird asymmetry between what the parser produces and what the serializer consumes (you can’t stream the parser’s tokens into the serializer directly; you must go through a tree builder even if you know the input is well-formed because the tokens are different)
        • The serializer tokens are dictly-typed, so pointlessly slow on modern Pythons
        • It doesn’t pass its own test suite

        It feels like the victory of the project was html5lib-tests, which were used to build html5ever, not so much the actual software product.

        I’ll probably end up porting my HTML processing code to Rust so I can use html5ever directly.

        1. 1

          If using rust, I can highly recommend https://github.com/rust-ammonia/ammonia which is already sitting on top of html5ever.