Last week, I pulled the trigger on opening up new registrations on
PDFDATA.io (our PDF data
extraction-as-a-service). Things have settled in for our first customers without
any crises, so I’m feeling pretty good about that, as well as our choice to roll
out on Heroku as a first step.
(We’ve had a couple of requests for an option to deploy “hands-off” instances of
the service via AWS Marketplace, so it seems that people are actually using it
for more than e.g. Sharepoint and Wordpress installs. Thankfully, walking our
Heroku-expecting app over to “bare” AWS infra will be ~cake.)
The first client library we planned was to be for Node.js, and I mostly finished
the spike of pdfdata-node over the
weekend, and even got it [published to npm](). This was a bigger deal for me
than you might otherwise expect: I’ve consciously avoided working with node
directly (though I’ve abused it as a runtime for ClojureScript and such), so
this was my first foray. If anyone with suitable experience/expertise wants to
critique pdfdata-node, that would be
I just deployed a new rev of our API docs that include prose and examples for
using the Node library. I continue to enjoy using
slate for those docs. The interleaving of
examples and prose for different “platforms” (curl, Node, Ruby + Python soon,
etc) has worked out as well as you could expect for a strictly text-based
documentation system. I wrote about slate some in
a previous WaYWoTW.
This week, I’m starting on the bones of an in-browser toolchain that
PDFDATA.io users will use to try and configure the various data extraction
operations we offer. I’m actually vacillating some on the stack to be used
there: most of my recent front end work has been in ClojureScript, but insofar
as I am not the only one on the project, and I’m not planning on even being the
technical lead ~forever, a more common toolchain might make more sense right
Beyond that, I’ll probably be working on polishing the sales copy turds, and
hopefully getting some beach time.
PDFDATA sounds exactly like something I need. But I need it for personal use from a shitty home-made electron app, for 5-20 documents a month. I’d love to see an option involving say a one-off sign-up fee (to weed out time wasters) and simply paying 20c per document like you would on the “Freelancer” plan once you’re over the monthly limit. I’ve been using a commercial OCR app and an Automator script, which frankly sucks for my use case.
I’d be happy to hook you up with an API key that would allow that level of usage, so that I could understand your use case; email us, or ping me on twitter or irc. As you might imagine, any plan @ ~ $4.00 / mo is just not feasible for a variety of reasons.
Most of my effort this week is going to be spent tying up pre-“launch” loose
ends on PDFDATA.io, our PDF data extraction
as-a-service. I wrote about it previously
Just yesterday, I flipped the bit to make the initial pricing public, based on
feedback from our initial customers. It’s only a first cut, and pricing is
always fraught, but what’s there now will do for at least a little while.
I’ve been idly investigating alternative distribution channels like offering
PDFDATA.io via AWS Marketplace,
Heroku add-ons, and similar. It’s hard to
tell just how viable any of them are for very specialized services like this;
nearly all of the obviously-successful offerings in those venues are general
purpose stuff like databases, operations stuffs, CMSs, and so on. I’d love to
hear from anyone that’s done well in those sorts of channels with something that
is domain-specific to some extent.
On the technical side, I’ll be smoothing over the rough
edges on the shape of the actual API, i.e. what requests and responses look
like, making sure things are set up to make the rest of the initial roadmap easy
to integrate, etc. It’s Yet Another JSON API, so I blew a half a day properly
reading up on JSON-LD before coming to my senses and
rage-closing all of the corresponding browser tabs. As much as I’d like to have
a “proper” schema for something like this, nothing I’ve seen around JSON has
even close to a proportional payoff (either for us or users) given the
immaturity and structural problems. If I were shipping around XML, then
publishing a schema would be an obvious win, but XML just isn’t culturally
acceptable at this point in most of the HTTP API market AFAICT.
What I am doing is applying
prismatic schema judiciously at API and
certain backend boundaries to make sure that data going back and forth is
“valid”. That’s not a big deal, but I do want to figure out how best to flush
those schemas out to something suitable for (automated) inclusion in the
slate docs. There’s a bunch of additional
tooling that the community has built up around schema that I haven’t explored
yet, and I’m hoping to find a gem I can build on there.
This isn’t something I’ve ever participated in, but here’s to new things.
I’ve been working on PDFDATA.io for around a month, but very intensively last week and this one. It’s probably my third swing at some kind of PDF data extraction as-a-service over the last 10 years or so. Maybe I’ll get it right this time!
More seriously, since PDFxStream (a commercial Java/.NET library in the same domain) has been and remains my primary asset, doing this well makes too much sense. Past failures include poor target market selection (and faulty pricing models to go with ‘em), fuzzy / overly broad product ambitions, bad technical execution, and, if I’m being really honest, a fundamental lack of motivation on my part. I think I’m set up to not make any of those errors again.
As part of this, I’ve been doing a really broad survey of modern API design, a lot of it rooted in the tradition that it’s safe to say Stripe started. A couple of things in that vein worth sharing IMO:
It’s really hard to think of API documentation from any other firm that is as pleasant to work with. Some folks originally with Tripit wrote slate, a middleman-based static site generator/style to mimic many of the key design elements that Stripe’s API reference popularized. (Slate’s original author, Robert Lord, wrote a post some years back summarizing the big ones.) I had some “fun” getting started with it earlier today (mostly due to my aggressive ineptitude with the ruby toolchain), but all’s well now.
It’s really very pleasant to work with and looks like it will end up saving me many days of rolling my own pretty / usable doc site. The only TODOs it’s created for me so far include:
Enough on that for now. I’d love any feedback anyone has on PDFDATA.io; I know there’s not much public there yet, but thoughts re: the concept, use cases you might have, etc. would be appreciated if you have them.
Oh, and I just registered grunt.fm, for yet another crazy idea. oy!
Out of interest how are you planning on dealing with the PDF’s where they have converted the data into an image to prevent the data being extracted?
In the past I’ve had to export the pdf as an image file, use imagemagick to increase the contrast and then feed the resulting image into gocr which allowed me extract most of the words which was quicker than retyping the document :~)
There’s a lot of low-hanging fruit to pick before I get to OCR topics, but it’s certainly something we’ve done a lot with in conjunction with PDFxStream. Image-based PDF documents (that don’t include a hidden text “layer”) aren’t often so in order to thwart text extraction; the most common case is that the document is a result of a scanning process that didn’t include an OCR step. Even then, image-based PDF documents are vastly outnumbered by others.
Anyways, getting quality data out of such files isn’t a huge problem; PDFxStream provides image extraction as well, so no separate conversion step is required, and the end results are pretty good as long as the source bitmaps are of a high-enough resolution, and one is using a quality OCR toolchain.
Somewhat more interesting are cases where some useful data just happens to have been rendered as an image (figures in a chart are a common example, for reasons passing understanding), but the rest of the document is “regular” text. I’m in a good position to eliminate that as a complication: there’s just one document model, and text from an embedded PDF image via OCR will flow together with text from the PDF itself.
Thanks for the information.
@cemerick and I have been discussing CRDTs and coordination (particularly, the tuplespace style of coordination, since the article mentioned tuplespaces) and it seems like that conversation can merge (heh) with the discussion here.
Can CRDTs be used for coordination?
Two examples to think about:
A distributed service for brokering 2 player chess games. Each player should be assigned to exactly one other player, subject to the constraint that the assignments are mutual. In other words, the service creates two-element sets that are all disjoint from each other. Can CRDT sets handle this kind of coordination? (A tuplespace approach to that example: , but this is not an AP system.)
A distributed service for unique assignment of tasks to workers. Two workers should not work on the same task. If a worker dies, the task should be assigned to another worker. Simply put, each task is performed exactly once. (In tuplespace terms: .)
Just to clarify the @QuiltProject tweet in question, what I was suggesting is not a coordination or consensus mechanism baked into the semantics of CRDTs (that is definitionally excluded). Of course, one can use an out-of-band consensus mechanism to inform reconciliations that are not possible within their constraints; this is the explicit advice in the literature (and has been used in practice), but that is unattractive to me for a variety of reasons.
There’s no reason why particular computational services might not require more constrained semantics, while still using the shared CRDT substrate as a reliable communications medium. For example, it would not require much novelty to operate a consensus protocol on top of a CRDT, with the leader acknowledging certain proposed writes/operations (“blessing” them, you might say) by signing them (or otherwise indicating assent, if e.g. the CRDT is being operated in an implicitly trusted environment).
Are CRDT counters idempotent? (Or, do CRDTs always have to be idempotent?)
Because, I don’t understand how they can be. If they are and anyone can point me towards some writing or any implementation then please do.
It depends. CRDT counters as implemented in Riak are not idempotent, but for good reason.
If every client was an actor in a PN-Counter, and you wanted every game player of your mobile ‘phone game to increment the count, you would have a very wide PN-Counter (an entry per actor.) If these clients all read a CRDT from a database with a RYOW consistency level (or store their own writes locally, durably), then you can have idempotent counters, by dint of the fact that the merge function (the LUB) of a CRDT is idempotent.
Carlos Baquero et al did some work around this limitation in this paper: http://arxiv.org/abs/1307.3207
In Riak, we bound the number of actors to be the replica vnodes for a data item. This means that many, many, clients send operations to Riak (inc/dec.) without changing the size of the counter. The downside is “counter drift.”
If Riak receives an operation, and executes it on a vnode, but fails to replicate to W-1 vnodes, or if Riak succeeds, but the client never receives an “OK!” Then the operation is a partial success. This is indistinguishable from an actual failure from the client’s perspective. If the client resends the operation and there was a failure, OK! But if there was a partial success then we double counted. This is positive counter drift. If the user choses not to resend the op and there was a real failure, this is negative drift.
How to make this system model idempotent is the subject of current work. Trivially, you can make an idempotent counter CRDT like this:
If the client never re-uses the id, it can replay the op safely. Downside? That’s going to be a big set.
I’ve seen implementations where an ID lasts for some period of time, and is then dropped. And so you keep a set of ids + some roll up integer per actor, and you’re idempotent for failures that last for some period of time, after which retrying will lead to drift.
In general the update/mutate function on state based CRDTs are not idempotent, but the merge function is. If you send CRDTs between nodes in your system (including the clients) and you can always RYOW (either client local storage, or from a database) then you have idempotent operations.
Usually they are not. At least the ones implemented by Basho for Riak are not.
Yes, the operations performed over CRDTs are always idempotent. Riak’s counters (implemented here AFAIK) are state-based PN-counters, described in Section 3.1.3 of the Shapiro et al. CRDT techreport.
Each actor (or client, in the case of Riak) maintains its own count of increment and decrement operations; upon read, these are merged to yield the current value of the counter. The idempotency isn’t that an actor can issue an increment multiple times, and only the first will be acknowledged; it’s that one actor’s impact on the counter is not applied multiple times despite repeated replications / merging of its state.
If you do want actors to only be able to increment once, then you can build a CRDT counter comprised of two sets, one to track increments, the other to track decrements; getting the counter’s value requires taking the difference of the sets' cardinalities. An increment in “userland” would add an element to the increment set with a tag uniquely identifying the actor, likewise for decrements.
(I don’t work for Basho, so perhaps someone who does can correct me if I’m wrong somehow.)
I like this article, but I think that it assumes that all APIs are RPC. Many are, but ‘real REST’ is different enough that it tackles a lot of the things the author has beef with.
It would be nice for the author to not belabor the API, but the statefulness of historical RPC mechanisms. We all should be talking about how systems are composed using protocols, which ‘real REST’ is an instance of. APIs are fine when failure has be abstracted away, either by being rare or by being handled for you.
(Author here, hi.)
I picked on HTTP APIs a.k.a. “REST” the most because that’s what people use. I don’t believe that ‘Real REST’ addresses any of the substantive issues I described that are rooted in the RPC heritage and the programming models within which we might implement REST today. It is quite different than RPC, but still shares many of the same failings.
‘Real REST’ does provide a set of semantics that are more useful than e.g. HTTP APIs, but they either (a) continue to deny the fundamental nature of the network, or (b) leave the details up to every individual implementing a REST service or client. It also opens up a whole new can of worms with “hypertext” (viz. HATEOS) which is both under- and over-specified as a data representation and a mechanism for coordinating activity between two actors. Speaking of “two actors”, REST is predicated on two-party client/server interactions, and all that that entails. I could go on, but then it’d be another blog post. :-)