    Since AVX is detected at runtime when for instance slices of bytes are compared, skipping the check at runtime should provide a tiny speedup.

      what is the penalty for detecting AVX support at runtime?

        Very little, just a move and a compare, but everything counts if instructions are being executed in a loop.