1. 7

Hi friends,

I’m looking for a site that recommends books based on preferences. All the sites I found depend on matching up user profiles. That is to say, you like the books that these other people like and so here are some more books for you. The flaw I see here is that you don’t usually get to see unknown authors.

I’m looking for a site/program that uses the actual text of books to match up your interests directly with books, rather than other peoples opinions of books.

If there is no such site, are you, or anyone you know, building such an engine or site?



  2. 3

    I think this would be fascinating, and I’d be especially curious how much of an improvement it actually is on discovering new or relatively unknown authors.

    This isn’t built on book texts as far as I know, but to the broader point of finding new authors, I have played around with https://www.literature-map.com/ in the past with decent results, in case you haven’t come across it yet.

    I think to do what you are suggesting though, you could do it via public domain as the @dvaun suggested (but then you have a lack-of-content problem making it less useful), or get indexing agreements with publishers (which can be difficult).

    We kind of do this for scientific articles at my current job, and beyond the challenges of getting more indexing agreements to make it useful, there’s a non-trivial infrastructure cost for storing & processing (and potentially re-processing) all of that content.

    Would be neat if there was someone with a boat load of money willing to fund this! :)

    1. 3

      The https://www.literature-map.com/ resource and other Gnod tools have proven useful to me in the past. Though, the mapping of authors (or music, movies, and other Gnod projects) is quite broad in scope.

      What would be neat to utilize is an engine that would work well in topic exploration within various subjects, including fiction categories (e.g. thrillers, fantasy, historical fiction) as well as fields of research.

      Having a resource like that on-hand, paired with a community of folk interested in putting together curricula for entering and diving into new domains would be awesome. I’d imagine a resource that had the ability to analyze articles and other sources (e.g. Wikipedia entries) in addition to books, in conjunction with an ability to put together these “curriculums” or “paths” would be great for any auto-didacts exploring a new field.

      I am out of my field when discussing this, as I don’t have any ML-experience. If there were a project that worked on this, though, I’d be happy to contribute in other ways.

      Would something like this need to be monetized to be sustainable?

      1. 2

        There is the cost of hosting. However, text is cheap. I can see a tool suite that is as follows

        1. There is a command line tool that, given plain text will spit out a feature vector
        2. The feature vectors are submitted to a central repository and collected into a file of feature vectors, this could be github
        3. You can checkout/download the feature vector file, run another command line tool (or a simple GUI or SPA) on it, mark out your favorite books and the tool will propose an ordered list of books you might like based on the content.

        I really like the idea of trying this out on copyright lapsed books hosted on gutenberg, but surely someone must have done this already.

    2. 2

      I’m looking for a site/program that uses the actual text of books to match up your interests directly with books, rather than other peoples opinions of books.

      How would one go about analyzing text from recently published books and books that haven’t entered the public domain? I can think of a few sources for retrieving digital texts…but those would be illicit and not permissible for use like this—I would think it would apply whether the project was for-profit or not.

      That being said, it would be neat to use books in the public domain from sites like Project Gutenberg.

      1. 1

        I would imagine a tie up between publishing houses and the site. Maintaining the confidentiality of the text via algorithmic or legal means, the aggregator would analyze the full text of the book for the analysis.

        As a POC the site would use publicly available text.

        Another way is to have a program that takes the copyright text you have access to and then generates a feature vector. This feature vector can then be uploaded to the site. I don’t know if this legally constitutes a derived work, but from a common sense view I can’t reconstruct the original text from it, so it should be ok and not a copyright violation because it’s not a reproduction.