1. 5

    OK I LOVE LOVE LOVE this idea in concept, but for me the big question is security. You’re inviting people to expose their machines to the internet from behind NAT and other things that generally protect end users.

    What questions have the authors taken to ensure that people aren’t un-intentionally hanging a big ole’ “HACK ME I’M CLUELESS!” sign out for everyone to see?

    1. 9

      I’m the author, hello!

      I think it’s important to differentiate between incidental risks which are “just part of the deal”, versus specific risks associated with “whatever the self-hoster chooses to do”.

      It sounds like you are specifically concerned with “whatever the self-hoster chooses to do”. I haven’t spent much time attempting to address this yet. Even just minimizing the incidental risks, the stuff that I have control over as a Greenhouse developer, has been challenging.

      I have thought about it a lot, though. It’s not in the application yet, but I thought about implementing sophisticated “Are you sure you want to do this?” dialogs. I could come up with multiple different heuristic factors (well known folder locations, file permissions, total number of files, histogram of file types, etc etc) to attempt to warn users when they are trying to serve something publicly that maybe they shouldn’t.

      The same goes for ports. If you serve port 22 publicly on Linux, your SSH server configuration had better be up to snuff! I could identify certain ports or names of programs that open ports which should probably never be served publicly, or should at least incur a stern warning to the self-hoster.

      Ultimately, I think the responsibility for security has to fall at least in part on the self-hoster. That’s just part of the deal with self-hosting. With great power comes great responsibility. Greenhouse is supposed to empower self-hosters. If they don’t have enough power to become dangerous, I would have failed!

      I’d like to point out that it’s possible to publish stuff you didn’t mean to publish with google drive or other SAAS as well. Isn’t that how a whole lot of “hacks” happen? Someone finds an open S3 bucket somewhere with global anonymous read permissions? This isn’t a problem unique to Greenhouse.

      The self-hoster’s responsibility for not just the configuration and data, but also the process(es) is unique to greenhouse though. So insecure processes like ancient apache versions, vulnerable WordPress plugins, etc, are a concern.

      But I’d like to point out, if someone is setting up wordpress on their own server, like a Raspberry Pi for example, I don’t think they are risking much more than they would be if they were setting up wordpress on a cloud instance.

      In terms of the incidental risks that all greenhouse users have to take, I’ve done almost everything I can to minimize them and I am committed to continuously improving the security of the system in general.

      • It’s not using a VPN, it’s using a reverse tunnel. This means that the absolute minimum amount (only what the user requested) is available to the servers that Greenhouse operates and the internet at large.

      • The software running on the self-hoster’s computer is in charge of what ports can be connected to / what folders are public. It doesn’t take orders from anywhere. The “client” (the self-hosting software) is as authoritative as possible. All of this helps mitigate attacks that originate on the servers that Greenhouse operates.

      • The self-hosting software runs under its own operating system user account. This helps minimize the blast radius in case it were to become compromised somehow. It also allows the self-hosting software to have exclusive read access on its secret files, like its TLS encryption keys.

      • I am bundling in an up to date version of Caddy Server. As far as I know, Caddy Server does not contain any serious vulnerabilities.

      • All of the communications between the processes involved in the self-hosting software are using TLS. All but one of them is using mTLS. With some additional research into how all of the different operating systems handle secret values that specific programs should have access to, I can probably get it to 100% mTLS.

      1. 3

        Wow thank you for your thoughtful and considered response!

        I totally agree. I do think it’s ultimately the user’s responsibility. Part of why I asked is because I’ve struggled with this myself.

        Something I’ve been trying to write forever is this idea I have called OpenParlour. It’s meant to be software that will allow anyone to run a small forum server for their small group - e.g. friends, family, hobby groups or whatever. I’d always thought I’d use something like ngrok to enable them to open a port to the internet, but I’ve struggled with the idea of responsibility around this.

        Sure, I’ll do everything in my power to ensure that the server is secure, but ultimately that only goes so far. An attacker might leverage flaws in the language or tools I use as one example. And I keep thinking “So I’m inviting people to create a security risk for themselves that they might not be ABLE to understand or come to grips with!”.

        However I think what you’re proposing is something a little different, and I think the ethics might be a bit clearer for you because the the nature of the service you’re offering (host anything) implies that the user takes responsibility for the risks.

        Good luck and I look forward to tracking your progress!

        1. 4

          An attacker might leverage flaws in the language or tools I use … So I’m inviting people to create a security risk for themselves that they might not be ABLE to understand or come to grips with!

          I used to work in full stack development / DevOps for media and IoT companies… The way I see it, no matter where you look, it’s imperfect software all the way down, no matter whether you are using a service that’s offered by “the professionals” or something which is open, libre, and powered by “passionate hobbyists”.

          In my experience in the industry, often times the professionals and the executives steering them are not incentivized to give a shit about security anyways. So I would like to believe that in many cases the “little guy” products are actually more secure. Everyone is taking a huge risk by playing the centralized saas game anyways. I think that by comparison, its perfectly acceptable for us to ask folks to trust our software! Especially when we do everything in the open and discuss / accept contributions in the open as well!

      2.  

        from behind NAT

        NAT does not provide any protection or security for end users.

        other things that generally protect end users.

        Other things being firewall for restricting (but not preventing) access to the internal services, but ultimately authentication and encryption, i.e. mTLS, for preventing unauthorized access. And at this point the service should be secure enough to be potentially accessible on the Internet.

      1. 3

        This idea could be extended by using an actual cryptocurrency PoW (Or mining pool PoW) and use it as a captcha AND an income revenue for your user. You could provide easier challenges to solve it in a few seconds and every once in a while you might be able to find a solution to a harder challenge that yields actual currencies.

        1. 2

          I thought about this, and I decided I wanted the opposite. My reasons were:

          • complexity. I want the captcha to be as simple as possible, both to set up and to use.
          • ease-of-abuse. If I use a common scheme that has real hash power behind it, the “captcha solving botnet” that other folks posted about could probably be replaced by a single retired ASIC.

          So I specifically chose Scrypt parameters that are very different from those supported by Scrypt ASICs designed to mine Litecoin.

          I have heard that Monero mining can’t really be optimized much, is that true? I don’t know much about it. I suppose if there is a mining scheme out there that truely resisted being GPU’d or ASIC’d this could be possible. I wonder if the app would eventually get flagged as malicious by Google Safe Browsing because its running a known monero miner script on the user’s browser XD

          1. 3

            I feel that if the goal is to keep the bot out, you are kind of out of luck because the computing power of anyone running bot will overwhelmingly be much greater than any of your human user. Captcha is only good to send away the automated, generic bot or tools. Anyone that really wants to collect your site won’t be stopped by any captcha. So if we agree on that, the algorithm used should not really matter, even if specific hardware already exists for the PoW.

            As for Safe Browsing, I don’t think that would be an issue since you are mining from the website, not an extension or ads. Safe Browsing should only be flagging website that distributes malware/unwanted software executable and phishing.

            1. 1

              Anyone that really wants to […] won’t be stopped

              Exactly, this is drive-by-bot / scattershot deterrent. Agreed that when facing a targeted attack against a specific site, different strategy is needed.

              the algorithm used should not really matter, even if specific hardware already exists for the PoW.

              For SHA256, I can buy a USB ASIC for $200 that can hash faster than 2 million CPUs. I think that’s a meaningful difference. Much more meaningful than the difference between one user and a botnet, probably even more meaningful than the difference between a user’s patience and a bot’s patience.

              AFAIK, Litecoin uses a slightly modified version of Scrypt, and its “CPU and Memory Cost” (N) / “Block Size” (r) parameters are set quite low. This means that Scrypt ASICs designed for Litecoin can’t execute the hash with larger N and r parameters like the ones which would be used for key-derivation (or in my case, anti-spam).

              According to this 2014 paper the hash rates of GPU Scrypt implementations fall off quickly as N and r are increased. In fact, for the exact parameters I use, N = 4096 and r = 8, they cite the 2014 GPU performing a measly 3x faster than the CPU in my 2019 laptop (See section 5.2.4). So for a modern expensive GPU, that might be something like 100-300x faster? I’m not sure, but its certainly different from 2 million times faster. I believe this was actually a design goal of Scrypt from the beginning: it’s intentionally hard to accelerate it to insane hash rates.

              As an aside, I have a friend who took an alternate route, purely based on security through obscurity. They put a “write me a haiku” field on the registration page of their web app. It completely defeated every single bot. I opted for PoW instead of a pure “obscurity-based” approach because I wanted to show/argue that we can come up with bot deterrent strategies which truly scale, even if there are millions of sites using the same deterrent software, It should still be effective against folks who want to hit all 1 million of them. While I doubt my project will ever grow to that scale, I thought it was fun to try to design it to have that potential.

              1. 1

                Exactly, this is drive-by-bot / scattershot deterrent.

                Then why does it matter if using a known PoW allows some attacker to be 2 million faster? You can expect that any targeted attack will be a few thousand times better than your average user, even with custom Scrypt parameters. So does it really matter that an attacker is a few thousand or a million times faster? He’s probably done scraping or spamming your site by the end of the day either way. At least with the known PoW you might have made a few bucks.

                1. 2

                  It’s because like I said, I primarily care about dragnet / “spam all forms on the internet” type stuff.

                  The former is a privacy concern, the latter represents all spam I’ve ever had to deal with in my life.. no one has ever “targeted” me or my site, its just stupid crawlers that post viagra ads on every single un-secured form they can find. I think that being targeted is actually very rare and it happens for political/human social reasons, not as a way to make a profit.

                  The weight of the PoW (from the crawler’s perspective) matters because if its extremely light (sha256) they can simply accept it and hash it at scale for cheap. If its heavy (scrypt with fat memory cost parameter) They literally can’t. It would cost an insane amount to solve gazzillions of these per year. Even if they invest in the GPU farm, it will only make it hundreds of times faster, not millions. And if you have a GPU farm, can’t you make more money renting it to corporations for machine learning model development anyways??

                  Like others have mentioned, that cost can be driven down by botnets. But like I have argued in return, IMO that level of investment is unlikely to happen, and if it does, I’ll be pleasantly surprised.

        1. 4

          My first thought is it would be interesting to have this sort of thing as a an email spam filter, then I remembered that Hashcash is a thing and upon inspection does basically exactly this. Plus it can solve The Mailing List problem (there are valid use cases for sending majillions of emails) by having the client whitelist the sending email address/server; the mailing list server can just refuse to send an email to a client requiring proof of work, or can do the proof of work once to send an email to the user asking them to whitelist it.

          Anyone know why this sort of system hasn’t caught on? Just lack of support?

          1. 1

            I think so. Similar to the reason why we still have to support STARTTLS for email. Email is simply impossible to upgrade with breaking changes because of the proliferation of diverse, out of date email servers & the risk of messages dis-appearing into the ether once you start fully committing to deviation from the lowest common denominator.

            It could also be that a PoW requirement on email would be lobbied out of existence by the MailGuns and MailChimps of the world, as it would disproportionately impact them.

          1. 3

            I like the proof of work idea, but it would be extremely annoying to have such a captcha without anything to fill out. Just simply waiting would be too annoying. Something that happens while you type not so much (except for password manager users like me). Another thought: would it make it better to use input events somehow? Isn’t it so that trusted input events can’t be faked by scripts inside the typical browsers?

            1. 1

              If you wish to experience it yourself, here’s a test showing the captcha being used as a bot deterrent in front of a media file: every time you navigate to this URL it will re-direct you to a new captcha challenge w/ 5 bits of difficulty: https://picopublish.sequentialread.com/files/aniguns.png

              The difficulty is tweak-able. I think i used 8 bits of difficulty and specifically waited for one that took abnormally long when I was capturing the screencast I used as the GIF on the ReadMe page.

              Isn’t it so that trusted input events can’t be faked by scripts inside the typical browsers?

              Are you referring to the ways that facebook attempts to prevent script kiddie bots from interacting on their platform(s)? Yes, a simple version of such a thing may work as an effective heuristic to get rid of non-browser bots and simplistic browser automation bots without being privacy invasive. Maybe that’s a good idea for a feature of version 2 🙂

              1. 2

                It looks like this is tied to IP address + browser user agent. Once I load the above page once in my browser I can hit it as many times and as fast as I’d like with curl provided that I pass the same user-agent from my browser.

                1. 1

                  Heh, did you read the code or find that out yourself? I guess I’m impressed either way :P

                  Yes, that’s how I set it up for this particular “picopublish” app, independent of the PoW Captcha project. If you want to see the bot deterrent over and over, you have re-navigate to the original link without the random token, or else change your UA/IP. I got the idea from the way GoatCounter counts unique visits.

                  1. 2

                    I fiddled with it in the browser, I do some web scraping and from that am pretty familiar with the process of peeling away one option or feature at a time from http requests until the server finally denies the request.

                2. 1

                  Ah yes I see, the five bit version is absolutely bearable. Not much longer than an extensive page change animation. If this is enough to keep bots (I guess it is), I would totally go for it.

              1. 18

                Neat idea. I’m not sure this is a captcha, but rather just a rate limiter.

                1. 13

                  So much this. A proof-of-work scheme will up the ante, but not the way you think. People need to be able to do the work on the cheap (unless you want to put mobile users at a significant disadvantage) and malware/spammers can outscale you significantly.

                  Ever heard of parasitic computing? TLDR: It’s what kickstarted monero. Any website (or an ad in that website) can run arbitrary code on the device of every visitor. You can even shard the work, do it relatively low-profile if you have the scale. Even if pre-computing is hard, with ad networks and live-action during page views an attacker can get challenges solved just-in-time.

                  1. 9

                    The way I look at it, it’s meant to defeat crawlers and spam bots; they attempt to cover the whole internet, they want to spend 99% of their time parsing and/or spamming, but if this got popular enough to prompt bot authors to take the time to actually implement WASM/WebWorkers or a custom Scrypt shim for it, they might still end up spending 99% of their time hashing instead.

                    Something tells me they will probably give up and start knocking on the next door down the lane. And if I can force bot authors to invest in a $1M USD+ /year black hat “distributed computing” project so they can more effectively spam Cialis and Micheal Kors Handbags ads, maybe that’s a good thing? I never made $1M a year in my life, probably never will, I would be glad to be able to generate that much value tho.

                    If it comes down to a targeted attack on a specific site, captchas can already be defeated by captcha farm services or various other exploits (https://twitter.com/FGRibreau/status/1080810518493966337). Defeating that kind of targeted attack is a whole different problem domain.

                    This is just an alternate approach to put the thumb screws on the bot authors in a different way, without requiring the user to read, stop and think, submit to surveillance, or even click on anything.

                    1. 9

                      This sounds very much like greytrapping. I first saw this in OpenBSD’s spamd: the first time you got an SMTP connection from an IP address, it would reply with a TCP window size of 1, one byte per second, with a temporary failure error message. The process doing this reply consumed almost no resources. If the connecting application tried again in a sensible amount of time then it would be allowed to talk to the real mail server.

                      When this was first introduced, it blocked around 95% of spam. Spammers were using single-threaded processes to send mail and so it also tied each one up for a minute or so, reducing the total amount of spam in the world. Then two things happened. The first was that spammers moved to non-blocking spam-sending things so that their sending load was as small as the server’s. The second was that they started retrying failed addresses. These days, greytrapping does almost nothing.

                      The problem with any proof-of-work CAPTCHA system is that it’s asymmetric. CPU time on botnets is vastly cheaper than CPU time purchased legitimately. Last time I looked, it was a few cents per compromised machine and then as many cycles as you can spend before you get caught and the victim removes your malware. A machine in a botnet (especially one with an otherwise-idle GPU) can do a lot of hash calculations or whatever in the background.

                      Something tells me they will probably give up and start knocking on the next door down the lane. And if I can force bot authors to invest in a $1M USD+ /year black hat “distributed computing” project so they can more effectively spam Cialis and Micheal Kors Handbags ads, maybe that’s a good thing?

                      It’s a lot less than $1M/year that they spend. All you’re really doing is pushing up the electricity consumption of folks with compromised computers. You’re also pushing up the energy consumption of legitimate users as well. It’s pretty easy to show that this will result in a net increase in greenhouse gas emissions, it’s much harder to show that it will result in a net decrease in spam.

                      1. 2

                        These days, greytrapping does almost nothing.

                        postgrey easily kills at least half the SPAM coming to my box and saves me tonnes of CPU time

                        1. 1

                          The problem with any proof-of-work CAPTCHA system is that it’s asymmetric. [botnets hash at least 1000x faster than the legitimate user]

                          Asymmetry is also the reason why it does work! Users probably have at least 1000x more patience than a typical spambot.

                          I have no idea what the numbers shake out to / which is the dominant factor, and I don’t really care; the point is that I can still make the spammers lives hell & get the results I want right now (humans only past this point) even though I’m not willing to let Google/CloudFlare fingerprint all my users.

                          If botnets solving captchas ever becomes a problem, wouldn’t that be kind of a good sign? It would mean the centralized “big tech” panopticons are losing traction. Folks are moving to a more distributed internet again. I’d be happy to step into that world and work forward from there 😊.

                        2. 5

                          captchas can already be defeated by […] or various other exploits (https://twitter.com/FGRibreau/status/1080810518493966337)

                          An earlier version of google’s captcha was automated in a similar fashion: they scraped the images and did a google reverse image search on them!

                          1. 3

                            I can’t find a link to a reference, but I recall a conversation with my advisor in grad school about the idea of “postage” on email where for each message sent to a server a proof of work would need to be done. Similar idea of reducing spam. It might be something in the literature worth looking into.

                            1. 3

                              There’s Hashcash, but there are probably other systems as well. The idea is that you add a X-Hashcash header with a comparatively expensive hash of the content and some headers, making bulk emails computationally expensive.

                              It never really caught on; I used it for a while years ago, but I’ve never received an email with this header since 2007 (I just checked). It seems used in Bitcoin nowadays according to the Wikipedia page, but it started out as an email thing. Kind of ironic really.

                              1. 1

                                “Internet Mail 2000” from Daniel J. Bernstein? https://en.m.wikipedia.org/wiki/Internet_Mail_2000

                            2. 2

                              That is why we can’t have nice things… It is really heartbreaking how almost all technology advance can and will be turned for something evil.

                              1. 1

                                The downsides of a global economy for everything :-(

                            3. 3

                              Captchas are essentially rate limiters too, given enough determination from abusers.

                              1. 4

                                Maybe. The difference I would make is that a captcha attempts to assert that the user is human where this scheme does not.

                                1. 2

                                  I mean, objectively, yes. But, since spammers are automating passing the “human test” captchas, what is the value of that assertion? Our “human test” captchas come at the cost of impeding actual humans, and are failing to protect us from the sophisticated spammers, anyway. This proposed solution is better for humans, and will still prevent less sophisticated attackers.

                                  If it can keep me from being frustrated that there are 4 pixels on the top left tile that happen to actually be part of the traffic light than by all means, sign me the hell up!

                            1. 2

                              [a sequence of] cherry-pick merges work better than rebase

                              isn’t that usually how rebase is implemented though? at least with all of the git clients I have used, that seemed to be what it was doing under the hood :\

                              I thought it was fine, you don’t have to squash the commit history when you rebase, and honestly IMO its fine to give people the option, the team can make and enforce their own standards on how verbose/ugly or minimal/clean the commits are

                              1. 11

                                It’s time the software industry started serving its boss, the user.

                                I don’t think the user has ever really been the boss. The entity cutting the checks for developing the software is the boss, and they never really cared about the user, they just care about making the next sale or accumulating the next gigabyte of PII in their database to raise their stonks valuation.

                                This rant might as well be about industrial civilization as a whole, you could touch on all the same themes talking about hollywood movies, cheap plastic crap, gas station food / mcdonalds food, apple products, planned obsolescence light bulbs that are designed to burn out, etc. I feel like this is frustration with the human condition being channeled / projected onto frustration with software.

                                1. 2

                                  I think that ideas like this and Gemini are fine, and cool, but as soon as the author claims that this is the right direction to go for ALL web publishing, I get extremely skeptical.

                                  I think the theory behind these ideas, the problems with the web of today, the problems with web browsers, etc, is all very valid WITHIN ITS NICHE of blogging and self-publishing and I agree with it WHEN APPLIED TO THAT NICHE. However, I don’t think that’s the main value proposition of the internet. It’s an analysis of the current situation that only looks at the consumption side, and mostly only looks at the negative side effects of the consumption side.

                                  But there is more stuff going on than that. There is also the production side: Web sites and web applications that people use because they make our jobs easier, not because we want to read or watch something for fun. We can’t forget that the reason the web exploded like it did was not just because “consumers wanna consume”, it was also because it enabled things that previously seemed like economic miracles. Zero-congestion warehousing. Knowing things in advance before they happen. Getting a text on your phone when your package gets delivered. Increased safety for folks in traditionally dangerous occupations. Generally increased automation. Remote collaboration. The ability for our economy to survive a deadly and extremely infectious pandemic. The list goes on and on and on.

                                  We are not just consumers. There is more going on in the world than that. Web applications are economic miracle workers, and not just when they have business investment behind them. I think that as generations Y & Z age, they are gaining invaluable wisdom about the internet that their parents lacked… I think they might seek out healthier relationships with technology, personal and community ownership over technology, etc, both out of necessity and out of desire for the comfort of privacy and safety. And when they do, I doubt they will want to give up on web applications along the way.

                                  1. 1

                                    Thanks for this comment, it’s articulated a lot of what bugs me about the doom-and-gloom crowd who equate all Javascript with surveillance capitalism.

                                  1. 9

                                    Sponsors: Warner Music Group, Universal Music Group, DARPA

                                    What the fuck ?

                                    1. 41

                                      It is a joke, refers to GithHub copilot and its carelessness against licenses. If you look at the code, you will see that after some sleep time it will serve you back your original file :D

                                      1. 4

                                        I’ve assumed that there was a real NN behind this satire until I’ve read this thread. I think that the problem here is that the website miscommunicates its purpose.

                                        Also I couldn’t find any direct reference to the source code, and a quick search on DuckDuckGo and GitHub doesn’t show up anything.

                                        1. 5

                                          Is there the source code for Copilot available somewhere? I doubt it, but wondering would it change anything if it were.

                                          1. 4

                                            Control-U on the webpage

                                            I think sometimes we forget that websites are code too

                                          2. 5

                                            Lame. At least actually train a NN.

                                            1. 5

                                              If you receive copyrighted material and process it, in addition to costs associated to and computational power, how much would you risk in legal terms?

                                              1. 3

                                                Copyrighted material must be processed in order to play it, by the very nature of how computers work the material must be copied in part or in full a number of times during processing - there is actual exemption in copyright law to allow for this otherwise the very act of playing back material would be illegal by the letter of the law.

                                              2. 4

                                                Q: What would the NN actually do? You want just enough learning/wiggle room for it to be controversial like Microsoft Copilot, methinks. Perhaps a NN that generates a song inspired by the input song, with a slider for how similar you want the song to be.

                                                Then you could break it down by degree - at what point is the song “the same song with a note or two different”, vs “a different song that shares most of the notes”?

                                          1. 2

                                            I think custody of the data is the original & most relevant reason this kind of stuff is so important. Data and processes encode power, and I feel like people need to grab onto any scrap of power they can find in the sorta dystopic world we are living in. There is a reason why the author says:

                                            …pick social media software that [allows local-only posting]. I would go so far as to say this is a necessary feature for group cohesion, and my hope is that implementors of decentralized social media software come to understand it’s important.

                                            IMO, the data for everyone’s online life spreading everywhere uncontrollably, or worse, being locked behind a wealthy stranger’s razor wire and armed guards takes a huge psychological toll on people. A self-hosted server doesn’t have to completely replace the public corporate-owned web, it just has to offer a viable alternative so folks can try out something different and experience what it’s like, give them that outlet to post and otherwise use the server with peace of mind that ultimately, it’s operated by someone they know personally and can trust.

                                            I have been using these sort of self-hosted social media and chat servers (matrix, jitsi meet, mastodon, gitea, etc) ever since the beginning of the pandemic and it has really helped me a lot in terms of social wellbeing and I think mental health too. I would definitely recommend it. If you are interested in an invite, DM me.

                                            1. 1

                                              Are there any benefits in terms of data locality for this method vs. sequential file storage/access? It would be nice to see some benchmarks if so.

                                              1. 2

                                                Yeah, it would be fun to try some benchmarks. My intuition tells me that the method I’ve developed is mostly useful for large universes with lots of data and queries of different sizes and dimensions, tall, wide, small, large, etc. Or if you want to begin indexing data without knowing what the queries will look like in the future.

                                                If you only have to support one kind of query, like a square of a certain size, indexing the data as [x,y] or [y,x] with the first, most significant dimension rounded to approximately the size of your query rectangle would probably be just as performant for reads. You would do multiple range queries over those “rows” or “columns” like I described in my “naive approach” example. But as soon as the size of the queried area can vary by orders of magnitude this approach starts to break down, and you start seeing an unnecessary trade-off between wasted IO bandwidth and number of range queries.

                                                  1. 1

                                                    Nice, thanks, that’s awesome!

                                                    Can you give a little interpretation on this, though? The “number of range queries per rectangle” is high while the query time is small for small to tiny queries. This means that the Hilbert curve method makes many range queries in a rectangle compared to the sliced method but still outperforms it by 2x-7x?

                                                    1. 2

                                                      I think it’s because bandwidth is the limiting factor in my test in that case. Look at the “wasted bandwidth” metric for the tiny queries, the query time graph looks a lot like the wasted bandwidth graph. The CPU also has to do work to throw out all the entries outside the rectangle. A wasted bandwidth metric of 2 means that there were twice as many entries found outside of the query rectangle than there were entries found inside. And it goes up to 18 in the worst case!! in other words, for the 32-pixels-wide slice (universe is 16 slices tall) only about 5% of the data that the query was hitting was actually inside the rectangle.

                                                      I will say I did this benchmark in 1 day and I didn’t try to optimize it much so its possible that there are other bottlenecks getting in the way of the accuracy of the results. So take it with a grain of salt.

                                                      The main thing I wanted to show was how the space filling curve method works no matter what the scale of the query is. Even though they look different because the graphs have different scales, the amount of wasted bandwidth and range-queries-per-rectangle stays constant across all the different query sizes for the hilbert index.

                                                      Also, you can tune its performance characteristics a little bit at query time – in other words different kinds of queries can be tuned individually with no extra cost. While the sliced method outperforms the space filling curve method when the slice size is tuned for the query, the problem is you have to re-index the data every time you want to tune it to a different slice size.

                                                1. 1

                                                  What do you mean by sequential file storage/access? Do you mean scanning through the whole spatial index data?

                                                1. 4

                                                  I used to work in the geospatial space and around 10 years ago we build a spatial indexing system using z-order curves a.k.a. Morton Curves (https://en.wikipedia.org/wiki/Z-order_curve) on top of HBase. We took advantage of the sorted nature of HBase and the way one can do scans through the data. The keys were using base-4 encoding, which allowed to “zoom-in” and “zoom-out” by simply changing the key prefix lenght of the scan we were doing.

                                                  Good times!

                                                  1. 1

                                                    Oh I can totally see how that would work too. Its easy to see the benefit of the Morton Curve in terms of how the algorithm to generate the keys and query ranges is so simple by comparison.

                                                    It has the same problem that the Hilbert Curve does where there are some areas that are close in terms of [x,y] but far away in terms of how far along the curve they are. So if you have a query rectangle which spans those areas, its possible it would be better to split it up into two or three rectangles.

                                                    For the visually inclined here’s a picture showing how the base-4-numbered morton curve quadrants would be laid out:

                                                    https://picopublish.sequentialread.com/files/morton-comic.jpeg

                                                    And just for comparison here is the same, but with the hilbert curve and base-10 numbers:

                                                    https://picopublish.sequentialread.com/files/hilbert-comic.jpg

                                                    1. 1

                                                      Can you develop the idea of base-4 encoding and zooming?

                                                      1. 4

                                                        The z-order curve is really expressing a grid system using squares. Squares are further subdivided into squares etc. etc. You can take that as fine as you want, but typical is 7 or 8 digits (at least when we did it). The curve is calculated by interleaving the digitis (as integers) of coordinates. See Wikipedia for pictures.

                                                        That means base-4 tells you the sub-grid quite easily. All things that are in the same sub-grid have the same prefix in base-4 encoding and that logic works on any level. Imagine the world projected flat and you have one big square. The square can be divided into 4 more squares. Each of these can be further subdivided. Everything that starts with 0, is however in the first sub-square of our first level. If the interleaved value of the curve starts with a 1, we know that it can not be in the first sub-square. This goes on and on. So each digit in the number really becomes a zoom level.

                                                        Now if i want to do a spatial query i can calculate a prefix to scan for for a given bounding box and run the scan against a system like hbase. You basically tell Hbase where to start reading the data. The client can calculate the prefix by itself since the algorithm is so simple and static.

                                                        Does that make sense?

                                                    1. 2

                                                      If you think of the iDevices as something that Apple owns and consumers get to use for a fee, then dis-allowing side-loading makes sense. I think that consumers have been forced into this arrangement over time, and for the most part, they don’t even realize that this is the deal they have been offered. When it comes to ownership of computers, they’ve been slowly boiled like frogs ever since the 90s. So, if you ask consumers what they want, they will give confusing or non-sequitur answers, because their conscious-mind model of what’s going on with their phone is significantly different from what is actually happening.

                                                      I think most people probably have a subconscious simmering unease about the whole situation though, especially considering how the phone has become almost like an extension of the human body and mind. So folks are losing their body autonomy, slowly rights that were supposed to be protected by various national and international laws and conventions are slipping away. But most people don’t have the time, energy, resources etc to actually address these tensions and come to terms with the reality of their existence. So they just carry on with that weight on their subconscious.

                                                      I’m kind of confused about why it’s ok that I’m forced to install whatever software Apple decides, and even have my access to software limited by them, but it’s not ok for Apple to allow some 3rd party in Podunkville to attempt mandate their citizens/students/concertgoers/etc to install a sketchy 3rd party app. Especially when, like folks have mentioned,

                                                      iOS is safer is because of OS-level security features.

                                                      IMO It’s the same problem whether a 3rd party does it or Apple does it, and arguably it’s worse with only Apple, because then there are no less-sketchy alternatives (outside of, you know, not using apple products).

                                                      Ultimately I think that people who want to own their computers have given up on Apple since many years ago. So the question of ownership when it comes to Apple products is a bit of a moot point.

                                                      1. 4

                                                        If you think malware is great, if you think being ordered on pain of losing your livelihood to install surveillance apps is wonderful and should be encouraged as widely as possible, if you think the most important freedom is the freedom to be hurt, badly and repeatedly, and then be told it’s your own fault and you deserved it for not being sufficiently technical, then you might agree with the comment above.

                                                        If, on the other hand, you think that framing a comment in such a way as to try to force a particular hyperbolic framing and set of ideas onto anyone who disagrees is a terrible rhetorical device, then you might not like the comment above. And the commenter above might gain some insight, from the first paragraph of this, into why.

                                                      1. 1

                                                        Working on implementing the desktop app for my new cloud service called Greenhouse. The service is going to be similar to pagekite but cheaper, more fully-featured, and more secure by default. The desktop app is an fbs based (python + QT) thing which comes with a background daemon, the background daemon runs caddy server to manage Lets Encrypt TLS certificates and threshold, my TCP reverse-tunnel application. Later on I’ll build a CLI that talks to the same daemon. It has to support mac, windows, and linux.

                                                        Plan is to allow everyone, not just “tech wiz” users, to easily own & host websites and servers accessible on the internet from their own home. Self-hosting, so they don’t have to give up ownership of their data and processes to a 3rd party.

                                                        1. 4

                                                          A lot of the criticism I see leveled against Gnome centers around the way they often make changes that impact the UX but don’t allow long-time users to opt-out of these new changes.

                                                          The biggest example for me: There was a change to the nautilus file manager where it used to be you could press a key with a file explorer window open, and it would jump to the 1st file in the currently open folder whose name starts with that letter or number. They changed it so it opens up a search instead when you start typing. The “select 1st file” behaviour is (was??) standard behavior in Mac OS / Windows for many many years, so it seemed a bit odd to me that they would change it. It seemed crazy to me that they would change it without making it a configurable option, and it seemed downright Dark Triad of them that they would make that change, not let users choose, and then lock / delete all the issue threads where people complained about it.

                                                          It got to the point where people who cared, myself included, started maintaining a fork of nautilus that had the old behavior patched in, and using that instead.

                                                          What’s stopping people who hate the new & seemingly “sadistic” features of gnome from simply forking it? Most of the “annoying” changes, at least from a non-developer desktop user’s perspective, are relatively surface level & easy to patch.

                                                          1. 3

                                                            Wow, I thought I was the only one who thought that behavior was crazy. Since the early 90’s, my workflow for saving files was: “find the dir I want to save the file in,” then “type in the name.” In GNOME (or GTK?) the file dialog forces me to reverse that workflow, or punishes me with extra mouse clicks to focus the correct field.

                                                            I have never wanted to use a search field when trying to save a file.