1. 10

  2. 7

    I worked on a startup doing this years ago.

    There’s actually a lot of low-hanging fruit in web scraping, with the usual associated legal issues.

    One challenge with both scraping and user-inputted product data is normalization. “KLNX 6PK 3CT POUCH” is understandable to both a grocer and a POS operator as a 6 pack of 3-pouch containers of Kleenex but extracting that information into a machine-readable format is somewhat difficult.

    Once you start trying to use the manufacturer prefix from the EAN/UPC to assist with manufacturer normalization, there’s another problem: often GS1 prefixes are either made up or transferred-by-acquisition so many times that they have nothing to do with the products bearing their prefix.

    Open-source NLP pipeline projects (like Stanford CoreNLP) are valuable in this pursuit; many research NERs are already trained to recognize company names and quantities as entities so there’s a lot to work off of.

    I posit that an open “crowdsourced” product database should be heavily user-dependent for normalization and should adhere to a strict schema to start; “scan a GTIN and upload a picture and name” will become almost more frustrating than having no data very quickly.

    1. 1

      What schema would you use?

      1. 5

        We used:

        Manufacturer has many manufacturers recursively (really "subsidiaries" I suppose)
        Manufacturer has many brands
        Brand has many products
        Product has many SKUs
        SKUs have many SKUs (recursive, to represent i.e. "case of bottles" or "pallet of cases of bottles")
        SKU has an association with a numeric (count) quantity
        SKU has an association with an absolute (measure) quantity
        SKUs have many barcodes

        So for example (bottom up, for a real product)

        79285032899 (UPC) -> 0.5oz, 1ct, generated ID (SKU) -> Nude Lip Gloss (Product) -> Burt’s Bees (Brand) -> Clorox

        This product is listed as “BURTS B LIP GLOSS NUDE 0.5OZ” in a SKU table from a retailer I found online, so you can see how the simple “UPC, name” tuple often doesn’t quite cut it as useful information.

        For user input I suppose an aggressive auto-complete based system with a lot of required fields could be useful, although there’s a balance between maximizing data ingress / difficulty and data quality.

        We provided a facility for users to upload pictures of products we didn’t know about and then we would manually input the information (ourselves, interns, etc.) to ensure the best possible normalization and data quality.

        Some databases I’ve seen attach even more metadata to SKU, for example “Form” (in the Burt’s Bees example, that’d be “stick”).

        One difficulty with this schema was handling GS1 prefixes to try to automatically infer a manufacturer or brand when we didn’t have the specific product. We first tried associating them with brand - that got messy fast. Then we tried associating them with manufacturer, because that’s how the registration process is supposed to work, but that turned out to be lossy. Prefixes ended up being their own enormous independent hierarchy of prefix-brands and prefix-organizations that would eventually resolve to either a brand -or- manufacturer.

        1. 2

          What about just going directly to manufacturers and asking them to donate the data? I know they have to submit the data to GS1.

    2. 2

      I need an open Driver’s License database w/ images of licenses from all 50 states (of varying ages, too).

      1. 1

        That seems like it could very easily be abused…

        1. 3

          There is that, but it turns out to be very handy if you write apps that require scanning driver’s licenses to get data from them for some real-world (not evil) use cases.

          1. 1

            What data? I think the PDF417 barcode on the back is standardized by AAMVA - without any knowledge of what your use case is, could simply scanning that be a possibility?

            1. 2

              This is a new project for me, but I guess we want to scan because: (1) not all states support this, (2) and a few states (the smart ones?) encrypt the data in the barcode.


              1. 1

                Interesting; I think the linked article is out of date as I know that at least Georgia used to use an “encrypted” barcode format provided by “L-1 Solutions” but as of 2012 moved to the AAMVA-recommended PDF417 approach.

                But if you need legacy support, I see where a historic driver’s license archive could be useful.

                Often bar bouncers will have a “big book” which they use when attempting to authenticate licenses; have you looked into obtaining one of those for your product? It sounds like “open” is useful but not a necessity.

                1. 1

                  Great idea.

                  Like this?


      2. 1

        Good news: I’ve talked to the Open Product Data team and the license for their database will be ODbl (the same as OpenStreetMaps).