1. 17

I’ve wanted to check user passwords against Have I Been Pwned (HIBP) for awhile, but have been hesitant to add a network connection to our user signup flow.

This motivated me to build a bloom filter based on the HIBP master list to perform offline checks. It only checks the top 11 million passwords (rather than the full 550 million) to keep the size of the bloom filter down.

  1. 6

    Just use look(1).

    % time look $(echo -n bla | sha1 | tr \[:lower:\] \[:upper:\]) pwned-passwords-1.0.txt
    FFA6706FF2127A749973072756F83C532E43ED02
    look $(echo -n bla | sha1 | tr \[:lower:\] \[:upper:\])   0.00s user 0.00s system 85% cpu 0.001 total
    

    It uses binary search so it’s fast (so make sure you have the file sorted by hash).

    It’s a bit slower the first time around, but that might just be because of my storage.

    % time look 5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8 pwned-passwords-1.0.txt
    5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8
    look 5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8 pwned-passwords-1.0.txt  0.00s user 0.00s system 0% cpu 0.574 total
    
    % time look 5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8 pwned-passwords-1.0.txt
    5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8
    look 5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8 pwned-passwords-1.0.txt  0.00s user 0.00s system 75% cpu 0.002 total
    
    1. 3

      nice! The joys of Unix, you are always bound to find a neat new utility hiding away that solves your problem! I was not aware of look! much simpler than my solution, and mine isn’t very complicated either. Thanks for sharing!

    2. 5

      huh, I did this @ work by just shoving the 20GB file into sqlite (which turned it into a 51GB sqlite3 db file) but it’s still plenty fast enough to check that I never bothered to trim it down. It was also so simple that I never thought to share it anywhere.

      For others interested, here is my README file for what I did:

      Created like this: download torrent file from: https://haveibeenpwned.com/Passwords Currently using v2 dated 22 Feb 2018 download torrent unzip with 7zip create a sqlite file with scema: CREATE TABLE pwned (hash TEXT, count INTEGER, PRIMARY KEY (hash));

      sed -e "%s/:/,/g" input >output
      

      sqlite3 .mode csv import output from above

      create hash: python -c “import hashlib;print(hashlib.sha1(‘password’).hexdigest().upper())”

      lookup via sqlite: sqlite3 pwned-passwords.db ‘select count from pwned where hash=“5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8”;’

      lookup via search.py: search.py 5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8

      $ time sqlite3 pwned-passwords.db 'select count from pwned where hash="5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8";'
      3303003
      sqlite3 pwned-passwords.db   0.00s user 0.00s system 33% cpu 0.009 total
      

      and this is across an NFS share. Plenty fast enough.

      1. 2

        As commented downthread (and added to the readme!), this gem is 32mb, which is small enough to pull off the network / store locally without changing how you deploy your system.

        I specifically wanted to avoid depending on a network service during the user signup flow; an NFS share doesn’t pass that requirement.

        1. 1

          I just happened to run the time command across an NFS share, there is zero requirement that it be deployed in production that way(and it isn’t ran across NFS on my production systems).

          Obviously your 32MB is WAY smaller than my 51GB sqlite3 file. I’m not saying your way is bad, for your use case, I don’t even know your use case. I’m just sharing what I did, I wasn’t, and am still not, trying to get you to convert to my method.

          If one want’s to check the entire list, my method or jomane’s method of using look are probably your best bets.

          Anyways, your solution is interesting, and if it works for you, I’m not trying to convert you.

          1. 1

            No problem - figured it was worth contrasting the two approaches, since a few people had mentioned ‘why not check the whole list’.

      2. 2

        It’s worth checking against the entire list. Checking against the top passwords provides a degree of brute force prevention, but the real reason to check against a leaked password list is to prevent credential stuffing. That is – if a user re-uses their password on another site that gets leaked, the exact user/password combo is out there and attacks can try it on various sites to see if there is a match. This applies even if it is a unique password at the tail of that 550 million!

        1. 1

          There’s a tradeoff to be made between the false positive rate, the number of passwords checked, and the amount of disk/network bandwidth used.

          The full list is ~11gb compressed, and the smallest bloom filter that’ll get an acceptable false positive rate on the full list is ~1gb. This gem is 32mb.

        2. 1

          Interesting. I had the chance to sit down with the safepass.me guys and go through their approach, which is equally about optimal coverage rather than mindlessly comparing against the db itself (safepass assert coverage higher than HIBP due to the way their algorithm works).

          It’s good to see innovation in this space. With rotating passwords finally being accepted as a suboptimal idea, it’s even more important that passwords chosen are good enough to withstand password cracking.

          1. 1

            Nice work. Do you plan to keep the filter updated on list changes?

            1. 1

              I might publish an updated one in a year or two, but the most frequently used passwords tend to be simpler and don’t change often.

              1. 1

                Ok, cool. As an idea, depending on how you build the filter, you could automatically rebuild it and release an updated version based on changes to the list or just on a fixed interval maybe.

                1. 1

                  It could be more scripted / automated, but I’d probably schedule it manually since the master list is ~20gb.

                  It’s very easy to cut a new release though - just point the data prep script at the master file and it’ll regenerate the bloom filter.

            2. -4

              this is a bad idea

              1. 3

                Why? Which part?