1. 2

    Well the biggest problem is there aren’t great complete “solutions” for creating repeatable deployments that aren’t using containers.

    You can get a long way with a proper Salt deployment for at least machine setup and configuration. I don’t have a huge amount of experience there but folks I work with are beginning to do this with software that simply cannot be containerized.

    1. 6

      Well the biggest problem is there aren’t great complete “solutions” for creating repeatable deployments that aren’t using containers.

      There most certainly are.

      1. 1

        Neato. Thanks for the info n

      2. 1

        Interesting. Do you know of “non-great” solutions?

        Background is I am currently developing a lightweight CI solution without docker (because I found that either they are big like Jenkins/Gitlab or they are built to use docker). So I am wondering if it would make sense to develop this into the direction of deployments or create a sibling lightweight program for deployments.

        1. 1

          Honestly, not much that isn’t tied to huge monolithic systems.

          I’m not a huge fan of Docker and Kubernetes but I recognize their value in codifying deployment into code. I don’t deploy many services as a hobby so I haven’t sorted our the none-container solution.

          At work we are heavily Kubernetes and Docker so I work within that framework. It has warts but it works well enough for an environment that runs ~4k containers within K8.

          1. 1

            We’ve been using a combination of Terraform (initial setup) plus Nix/NixOS for our containerless deployments. There is also an autoupdater running on each machine that monitors the CI for new derivations and does an update if there is, with rollback in case of a problem.

            1. 1

              Ansible? I find it perfectly acceptable for small to medium sized deployments. Can do orchestration too, although of course specialized tools like Terraform are better at that.

          1. 6

            I’ve worked on designing, building and deploying a couple of large systems at Uber/Skype/Microsoft and reviewed the design of several more of them. None of them follow the documentation advice or approaches given in this post, not even close.

            However when we need to explain to someone else (new developer, product owner, investor, …) how the application works, we need something more… we need documentation. But what documentation options do we have that can express the whole application building blocks and how it works?!

            This post then goes through formats like UML, 4+1 Architectural view model, ADS, C4, dependency diagrams and application map. I’ve heard of some of them (e.g. UML taught extensively at universities), but haven’t used them and find it unlikely I’ll reach for these tools. Here’s why.

            What I’ve found to work best when documenting architecture is this: Simplicity.

            1. Whiteboard should be your top choice. If you can’t explain the architecture of your system/app on a whiteboard to another person or a team starting at the high-level and digging into the components, you have a problem. You either cannot step back to being high level enough to start with a simple state. Or you don’t understand how key parts of the system interact. Go back and practice until you get there.
            2. Simple documentation with simple diagrams based on what you would draw on a whiteboard. If you can explain how it works on a whiteboard, write it down into a document. Write it using clear and easy to follow language.
            3. Architecture is about tradeoffs. Be clear on what tradeoffs you choose. My beef with much of the “let’s find the best way to document software architecture” is how these discussions seem to sideline the fact that architecture is just a tool to choose certain tradeoffs for certain short- or long-term benefits. A good architecture reasoning should explain what tradeoffs it’s chosen and what goals it’s optimizing for. This detail is a lot more important in implementing it correctly, than having perfect documentation.

            This is the approach I’ve seen people and teams use when building systems at places like Uber, Facebook, Microsoft, Google, Amazon. Amazon has the famous 6-pager narrative, which teams I’ve heard follow more or less for new systems as well. At Uber, we write down and share system design proposals via an RFC-like format (I’ve written more about that here).

            I’ve seen heavyweight, formal architecture documentation indicate a slow-moving company/team. When we rebuilt the payments system at Uber, we used the lightweight design documentation I mentioned above, completed the planning&protoyping in 2 months, built and rolled it out in 6 months. I talked with a peer who was redesigning the payments system at a major bank. They were using formal methodologies (UML et al) and were 2 years into the planning, not even starting development. A key difference between how the two places work is that this bank had architects who only did paper architecting, not being involved in coding it. While at most major tech companies, the team who comes up with the design also codes/ships this, keeping focus, engagement and accountability higher.

            Use the KISS principle to architecture planning and prefer iteration over perfection. Keep it stupid simple. This worked really well for me and it continues to work well for most of the major tech companies.

            1. 3

              I think however that none of the two approaches (Uber vs bank) is better or worse. It just works for both companies respectively. (I do not want to create the impression that you said anything negative about the bank situation; you did not. “slow-moving company/team” and “2 years into the planning” itself does not mean anything negative at all, it’s only natural that banks with massive legacy systems that have to run and be maintained are extremely slow; I just think that many people here on lobste.rs probably think that “slow-moving” == bad).

              Uber on the one hand is a small and young tech company, thus for them it works that a few engineers design a new system in a short amount of time.

              A bank has a massive amount of legacy systems, a lot of legal requirements to follow etc. and to get a grasp of all these requirements and compare them to the newly planned system a lot of communication and planning ahead is needed. The result still will be less than optimal, but from recent experience I just think that is how it goes when communication spans several departments (and developers get re-scheduled from and to other projects).

              I mainly work on quite small systems in automotive, but even I recognize how much communication overhead you start to get as soon as multiple department start to get involved and you have to combine all of their input into a new solution.

              I also do not do any of the methods from research (e.g 4+1 diagram) from the article above. I currently try to do a one page overview for each new project I start and in these usually include 1 or 2 charts of the existing situation and the planned architecture.

              1. 2


                Generally I just say cover: what and why.

                What, from a high-level down to whatever level detail you need.

                Why, why did you make the decisions you made? again, start from a high-level and go down in detail as needed.

                Which is basically what you said, just put differently.

                1. 2

                  I also believe that documenting the decisions RFC-like is one of the most valuable points. We call these “design decision documents”. Documenting context, tradeoffs, and assumptions is the most important part there. While the chosen solution is apparent later because it got implemented but the “why” is easily lost. Especially in large long-running projects where many people come and go.

                  What I like about these templates is that they provide a checklist of things to consider. Even if you keep things simple and fitting on a whiteboard it makes sense to do a stakeholder analysis and define your quality goals. Maybe half an hour is already enough in your project but it can result in some surprising insights.

                  I work in automotive. Due to a customer-supplier relationship (requirements engineering) and safety regulations we are more process-heavy. Once that level is requirement there is on additional aspect I can recommend: Model your architecture! I have heard many interpretations what that could mean. The relevant point is that multiple diagrams are kept consistent by a common data structure (the model). Generate code from that data structure to keep the code base consistent with the architecture model.

                  1. 1

                    The relevant point is that multiple diagrams are kept consistent by a common data structure (the model).

                    Can you elaborate a bit more on this point or do you have some links / papers I should read? I am currently learning for a university exam in “Software Architecture” and one of the many questions they deal with in the material for this lecture (but do not present a final solution) is: How can you make sure that multiple diagrams are always consistent?

                    (If I remember the lecture correctly they actually ask this question about a thing they call “architectural structures” and then show “architectural views” as a solution, which are the views also used in the 4+1 diagram; but then again you can ask the question how to ensure that the 4+1 diagrams are always consistent)

                    1. 2

                      The good way is have some tool for that. Enterprise Architect and similar stuff can do it. I’m not sure if a tool supports 4+1 but lets assume there is one. The paper of Kruchten linked in the article clearly considers an embedded system. If you develop cloud software there usually not much of a physical layer. Maybe you pick an AWS region?

                      Well, lets pick an example which should apply everywhere. The logical view is essentially class diagrams. The development view is essentially the repository/file structure. The consistency question is: How can we ensure that all classes we draw in UML also exist in the code and vice versa?

                      1. I’d say the usual industry approach for consistency is to do it manually. Of course that never really works in all details but it is good enough for legal requirements.
                      2. Generate code from UML.
                      3. Generate UML from the code.
                      4. Have tool support to update UML from code changes and vice versa. Allegedly IBM Rational can do that but I have never heard someone praising it.

                      There are nuances and variants in these approaches. If you use C++, then you could generate only header files from UML.

                      I guess the reason why your lecture does not provide a final solution is because there are many possibilities and it depends on the circumstances which one is best.

                      Thinking of UML, we can be more specific and many tools have support there. For example, a tool could ensure that a sequence diagram only uses classes and methods which are modeled in a class diagram. The UI might only provide a drop-down menu with the modeled classes. This is what I actually had in mind by having a single model behind multiple views. The “data structure” here is an XMI file or a server database.

                1. 10

                  With the built-in container support in SystemD you don’t even need new tools:


                  …and with good security if you build your own containers with debootstrap instead of pulling stuff made by random strangers on docker hub.

                  1. 8

                    The conflict between the Docker and systemd developers is very interesting to me. Since all the Linux machines I administer already have systemd I tend to side with the Red Hat folks. If I had never really used systemd in earnest before maybe it wouldn’t be such a big deal.

                    1. 5

                      …and with good security if you build your own containers with debootstrap instead of pulling stuff made by random strangers on docker hub.

                      I was glad to see this comment.

                      I have fun playing with Docker at home but I honestly don’t understand how anyone could use Docker Hub images in production and simultaneously claim to take security even quasi-seriously. It’s like using random npm modules on your crypto currency website but with even more opaqueness. Then I see people arguing over the relative security of whether or not the container runs as root but then no discussion of far more important security issues like using Watchtower to automatically pull new images.

                      I’m no security expert but the entire conversation around Docker and security seems absolutely insane.

                      1. 4

                        That’s the road we picked as well, after evaluating Docker for a while. We still use Docker to build and test our containers, but run them using systemd-nspawn.

                        To download and extract the containers into folders from the registry, we wrote a little go tool: https://github.com/seantis/roots

                        1. 2

                          From your link:

                          Inside these spaces, we can launch Linux-based operating systems.

                          This keeps confusing me. When I first saw containers, I saw them described as light weight VM’s. Then I saw people clarifying that they are really just sandboxed Linux processes. If they are just processes, then why do containers ship with different distros like Alpine or Debian? (I assume it’s to communicate with the process in the sandbox.) Can you just run a container with a standalone executable? Is that desirable?


                          Does anyone know of any deep dives into different container systems? Not just Docker, but a survey of various types of containers and how they differ?

                          1. 4

                            Containers are usually Linux processes with their own filesystem. Sandboxing can be good or very poor.

                            Can you just run a container with a standalone executable? Is that desirable?

                            Not desirable. An advantage of containers over VMs is in how easily the host can inspect and modify the guest filesystem.

                            1. 5

                              Not desirable.

                              Minimally built containers reduce attack surface, bring down image size, serve as proof that your application builds in a sterile environment and act as a list with all runtime dependencies, which is always nice to have.

                              May I ask why isn’t it desirable?

                              1. 1

                                You can attach to a containerized process just fine from the host, if the container init code doesn’t go out of it’s way to prevent it.

                                gdb away.

                              2. 3

                                I’m not sure if it’s as deep as you’d like, but https://www.ianlewis.org/en/tag/container-runtime-series might be part of what you’re looking for.

                                1. 1

                                  This looks great! Thank you for posting it.

                                2. 3

                                  I saw them described as light weight VM’s.

                                  This statement is false, indeed.

                                  Then I saw people clarifying that they are really just sandboxed Linux processes.

                                  This statement is kinda true (my experience is limited to Docker containers). Keep in mind more than one process can run on a container, as containers have their own PID namespace.

                                  If they are just processes, then why do containers ship with different distros like Alpine or Debian?

                                  Because containers are spun up based on a container image, which is essentially a tarball that gets extracted to the container process’ root filesystem.

                                  Said filesystem contains stuff (tools, libraries, defaults) that represents a distribution, with one exception: the kernel itself, which is provided by the host machine (or a VM running on the host machine, à la Docker for Mac).

                                  Can you just run a container with a standalone executable? Is that desirable?

                                  Yes, see my prometheus image’s filesystem, it strictly contains the prometheus binary and a configuration file.

                                  In my experience, minimising a container image’s contents is a good thing, but for some cases you may not want to. Applications written in interpreted languages (e.g. Python) are very hard to reduce down to a few files in the image, too.

                                  I’ve had most success writing minimal container images (check out my GitHub profile) with packages that are either written in Go, or that have been around for a very long time and there’s some user group keeping the static building experience sane enough.

                                  1. 3

                                    I find the easier something is to put into a docker container, the less point there is. Go packages are the ideal example of this: building a binary requires 1 call to a toolchain which is easy to install, and the result has no library dependencies.

                                  2. 2

                                    They’re not just processes: they are isolated process trees.

                                    Why Alpine: because the images are much smaller than others.

                                    Why Debian: perhaps because reliable containers for a certain application happen to be available based on it?

                                    1. 1

                                      Afaik: Yes, you can and yes, it would be desirable. I think dynamically linked libraries were the reason why people started to use full distributions in containers. For a Python environment you would probably have to collect quite a few different libraries from your OS to copy into the container so that Python can run.

                                      If my words are true then in the Go environment you should see containers with only the compiled binary? (I personally installed all my go projects without containers, because it’s so simple to just copy the binary around)

                                      1. 3

                                        If you build a pure Go project, this is true. If you use cgo, you’ll have to include the extra libraries you link to.

                                        In practice, for a Go project you might want a container with a few other bits: ca-certificates for TLS, /etc/passwd and /etc/group with the root user (for “os/user”), tzdata for timezone support, and /tmp. gcr.io/distroless/static packages this up pretty well.

                                        1. 1

                                          You can have very minimal containers. Eg. Nix’s buildLayeredImage builds layered Docker images from a package closure. I use it to distribute some NLP software, the container only contains glibc, libstdc++, libtensorflow, and the program binaries.

                                    1. 10

                                      I switched off of Google products about 6 months ago.

                                      What I did was I bought a Fastmail subscription, went through all my online accounts (I use a password manager so this was relatively easy) and either deleted the ones I didn’t need or switched them to the new e-mail address. Next, I made the @gmail address forward and then delete all mail to my new address. Finally, I deleted all my mail using a filter. I had been using mbsync for a while prior to this so all of my historical e-mail was already synced to my machine (and backed up).

                                      Re. GitHub, for the same reasons you mentioned, I turned my @gmail address into a secondary e-mail address so that my commit history would be preserved.

                                      I still get the occasional newsletter on the old address, but that’s about it. Other than having had to take a few hours to update all my online accounts back when I decided to make the switch, I haven’t been inconvenienced by the switch at all.

                                      1. 4

                                        It’s really exciting to see people migrating away from Gmail, but the frequency with which these posts seem to co-ocur with Fastmail is somehow disappointing. Before Gmail we had Hotmail and Yahoo Mail, and after Gmail, perhaps it would be nicer to avoid more centralization.

                                        One of the many problems with Gmail is their position of privilege with respect to everyone’s communication. There is a high chance that if you send anyone e-mail, Google will know about it. Swapping Google out for Fastmail doesn’t solve that.

                                        Not offering any solution, just a comment :) It’s damned hard to self-host a reputable mail server in recent times, and although I host one myself, it’s not really a general solution

                                        1. 5

                                          Swapping Google out for Fastmail solves having Google know everything about my email. I’m not nearly as concerned about Fastmail abusing their access to my email, because I’m their customer rather than their product. And with my own domain, I can move to one of their competitors seamlessly if ever that were to change. I have no interest in running my own email server; there are far more interesting frustrations for my spare time.

                                          1. 2

                                            I can agree that a feasible way to avoid centralization would be nicer. However, when people talk about FastMail / ProtonMail, they still mean using their own domain name but paying a monthly fee (to a company supposedly more aligned with the customer’s interests) for being spared from having to set up their own infrastructure that: (A) keeps spam away and (B) makes sure your own communication doesn’t end up in people’s Junk folder.

                                            To this end, I think it’s a big leap forward towards future-proofing your online presence, and not necessarily something comparable to moving from Yahoo! to Google.

                                            1. 3

                                              for being spared from having to set up their own infrastructure that: (A) keeps spam away and (B) makes sure your own communication doesn’t end up in people’s Junk folder.

                                              I’m by no means against Fastmail or Proton, and I don’t think everyone should setup their own server if they don’t want to, but it’s a bit more nuanced.

                                              Spamassassin with defaults settings is very effective at detecting obvious spam. Beyond obvious spam it gets more interesting. Basically, if you never see any spam, it means that either you haven’t told anyone your address, or the filter has false positives.

                                              This is where the “makes sure your own communication doesn’t end up in people’s Junk folder” part comes into play. Sure, you will run into issues if you setup your server incorrectly (e.g. open relay) or aren’t using best current practices that are meant to help other servers see if email that uses your domain for From: is legitimate and report suspicious activity to the domain owner (SPF, DKIM, DMARC). A correctly configured server SHOULD reject messages that are not legitimate according to the sender’s domain stated policy.

                                              Otherwise, a correctly configured server SHOULD accept messages that a human would never consider spam. The problem is that certain servers are doing it all the time, and are not always sending DMARC reports back.

                                              And GMail is the single biggest offender there. If I have a false positive problem with someone, it’s almost invariably GMail, with few if any exceptions.

                                              Whether it’s a cockup or a conspiracy is debatable, but the point remains.

                                            2. 2

                                              We’re not going to kill GMail. Let’s be realistic, here. Hotmail is still alive and healthy, after all.

                                              Anyone who switches to Fastmail or ProtonMail helps establish one more player in addition to GMail, not instead of it. That, of course, can only be a good thing.

                                              1. 1

                                                Just to bring in one alternative service (since you are right, most people here seem to advice Fastmail, Protonmail): I found mailbox.org one day. No experience with them though.

                                              2. 1

                                                I still get the occasional newsletter on the old address, but that’s about it.

                                                Once you moved most things over, consider adding a filter on your new account to move the forwarded mails to a separate folder. that way it becomes immediately clear what fell through the cracks.

                                                1. 1

                                                  Sorry, I wasn’t clear. E-mails sent to the old address are forwarded to the new one and then deleted from the GMail account. When that happens I just unsubscribe, delete the e-mail and move on. It really does tend to only be newsletters.

                                                  I suppose one caveat to my approach and the reason this worked so well for me is that I had already been using my non-gmail address for a few years prior to making the change so everyone I need to interact with already knows to contact me using the right address.

                                              1. 1

                                                It might not be totally thought through, but I think the situation might also have an advantage: In my opinion many developers rely too heavily on Cloud services (may it be IaaS like AWS or SaaS like Slack) and thus totally depend on their availability. Even for many company-internal installations we do not care about replicating the dependencies and if the dependencies in the Python Package Index or similar go away we cannot install our package anymore.

                                                If you know that everything might be blocked at any time you will develop countermeasures. You will setup a local repository for all dependencies, you will not rely on cloud services so heavily (but e.g. for SaaS prefer self-hosted solutions and if you really want to use Docker create your own images).

                                                1. 8

                                                  It might not be totally thought through, but I think the situation might also have an advantage

                                                  I’d take the fattest docker image that ever existed over that advantage, and I’ve been through that scenario where everything was blocked, literally all of the internet was cut off for 6 whole months (Tripoli, Libya, FEB/2011 - AUG/2011).

                                                  This is a bit like saying a drought might be advantageous because most people in the developed world are so used to the scarcity of food that they throw a lot of it away.

                                                  I do agree with your general point though and specially about the importance of replicating the dependencies on internal servers.

                                                1. 4

                                                  I gave up on the captcha. Will this end at some point? I think I filled 4 or 5 of the captcha screens before I wanted to throw my PC out of the window.

                                                  The funny thing is how you can relate many components to modern web pages: Cookie banners, message boxes popping up on the bottom right, fullscreen hover popups about “Please subscribe to our newsletter”…

                                                  1. 3

                                                    The captcha ended for me after three screens – checking pictures containing “glasses”, “checks”, and “bows”.

                                                    Maybe the form checks if you marked the pictures accurately. Did you notice that the checkboxes aren’t associated with the picture close above them, but the picture far below them? To see the checkboxes for the top row of pictures, you have to scroll up. I’m not sure if it mattered, but I didn’t select the pictures of glass windows since I would call them “glass”, not “glasses”, and I didn’t select the single picture of chess pieces that were not in check.

                                                    1. 3

                                                      I noticed the checkboxes being above the pictures (but also well done that you have to scroll up a bit to see this). My mistake might have been the glasses. There was one image with multiple panes(?) of glass, which I marked as “glasses”.

                                                      But having to repeat this over and over was also a nice feeling of surfing the web with a VPN and getting Google captchas. I feel that when you use a VPN you get the most difficult ones and have to do like 5 or more to reach a website.

                                                  1. 13

                                                    As someone who works on a large system that sees its fair and regular share of outages, I see about 95% of outages caused by bad deploys, the mitigation being rolling the deploy back. The remaining 5% are largely unexpected faults like network cables physically being cut, power being cut across a whole data center or other hard-to-test-for cases. Note that the bad deploy can often be at a third party we rely on, such as a payments processor or cloud provider.

                                                    This is also the reason that at critical times, when we want to minimise downtime, we put a “code freeze” no-deployment policy in-place. Most large companies for this around e.g. the winter holidays (Christmas and New Years), when many people are away and/or large traffic spikes are expected for commerce. Same with Thanksgiving or Black Friday in the US.

                                                    And the crazy thing is, code freezes work like a charm! Having been through multiple weeks-long code freeze, this is the time when there are virtually no outages or pages. On the flip side, I observed a lot more outages happening when the code freeze is lifted and many times more changes hitting production, than before.

                                                    1. 9

                                                      I used to work for a large company where we had services which we deployed many times each day. One demo season we decided not to deploy anything for a while so that the televised presentations would work without a hitch.

                                                      Unfortunately we hadn’t noticed a memory leak in one of the services which meant that it would fall over after a few days. This is one of the few times I’ve found that not releasing a change has caused a problem.

                                                      1. 1

                                                        This is also the reason that at critical times, when we want to minimise downtime, we put a “code freeze” no-deployment policy in-place.

                                                        I think I read the same idea also in the “Site Reliability Engineering” book from Google. They advertise the idea there that you define a percentage uptime you want to achieve (e.g. 99,9%) and try to push changes as long you’re above your target uptime. If you fall below you’re not allowed to push changes anymore.

                                                      1. 2

                                                        I had similar setup. I run https://betterdev.link and original use Mailchimp to scale to 2k. It’s free and great with lot of features and I would say if you had money, then Mailchimp is great.

                                                        But due to cost, I switch to AWS SES(https://github.com/yeo/betterdev.link/blob/master/baja/fanout.go) and it’s way cheaper.

                                                        However, what I do is quite unique and interesting which I think I should share.

                                                        I let Mailchimp handles subscription form and unsubscribe link. This free me to not deal with spam and GDPR. Since Mailchimp handle those.

                                                        When I need to send out news letter, I export Mailchimp contact to a CSV file.

                                                        1. 2

                                                          OT: Firefox somehow displays a security warning when I try to open https://betterdev.link . Not sure if it’s just me, but just thought I’d let you know.

                                                          1. 1

                                                            Not only you. The certificate seems to have expired somewhen today: “Expires on: July 4, 2019”.

                                                            1. 1

                                                              Oh thank you. The auto renew broken due to some refactor recently :(. Just fixed it.

                                                            2. 1

                                                              Do you export a CSV file by hand or programmatically?

                                                              1. 2

                                                                I do it programmatically. They have an API here: https://developer.mailchimp.com/documentation/mailchimp/guides/how-to-use-the-export-api/

                                                                APIKey can be generated from here: https://us16.admin.mailchimp.com/account/api/

                                                                Now you just do this

                                                                curl -X POST -F "apikey=$MAILCHIMP_APIKEY" -F "id=list-id" https://us16.api.mailchimp.com/export/1.0/list/

                                                                You will get the subscribe in a format similar to CSV.

                                                                1. 1

                                                                  Ok, I didn’t know MailChimp provided such an API.
                                                                  I guess you still need to include the unsubscribe link into your emails before sending. How do you get that information for each of the email address you get by calling that endpoint? Does the returned CSV include that (or anything that allows to guess it)?

                                                                  1. 1


                                                                    They have a unsubscribe link. When you go into Form Design, then select Unsubscribe form you can get that URL. It looks like this:


                                                                    Downside: when user visit that unsubscribe form, they have to type in their email address in order to ubsubscribe. It doesn’t support one click unsubscribe. I think if you use their Template Render API you will be able to generate unique one-click unsubscribe link per user.

                                                                    1. 1

                                                                      and btw, hit me up on vinh@getopty.com I can share more information about it.

                                                              1. 1

                                                                Since you explicitly asked for suggestions regarding cookie walls: Googlebot as user-agent might work

                                                                (cf. my comment https://lobste.rs/s/6yo4wi/totext_py_convert_url_rss_feed_text_with#c_53ujqr )

                                                                1. 1

                                                                  Thanks for the suggestion! The user agent is hard coded but could be integrated in the cookie workarounds. For Twitter I already get some sort of security token and some cookie stuff otherwise they serve a 403.

                                                                  One Dutch site, tweakers.net, gives you an immediate IP ban if you set your user agent to Google bot. But they do provide a http header, X-cookies-accepted: 1, to skip the Cookiewall. I’d love a uniform way like that for all sites.

                                                                  1. 1

                                                                    Just had another idea: ublock origin somehow makes the popup at derstandard.at go away after 1 second. I did not check how they do this. They might either automatically click a button or they might set the required cookies. In the second case, maybe you could download the relevant rules that ublock origin uses and use them for your project.

                                                                    Might also be “I don’t care about cookies” doing this… I did not check which plugin makes the popup go away :)

                                                                1. 1

                                                                  Fantastic! This is a lot better than my solution to that problem.

                                                                  Regarding cookiewalls, I’d be surprised if urllib couldn’t supply/spoof cookies. Are you planning to allow user-agent spoofing too? A lot of sites simply won’t serve content if they don’t recognize your user-agent string as belonging to a major browser.

                                                                  1. 1

                                                                    Cookiewalls would require me to reverse engineer each site. The script now sets the user agent to tiny tiny RSS, but does support customs headers and cookies. I guess when I want to scrape a site then I’ll add a specific handler if its behind a Cookiewall.

                                                                    And, if your script works for you, why would it be bad? w3m seems simpler than python with a boatload of external libraries just to get some text ;)

                                                                    1. 2

                                                                      Using Googlebot as the user agent in my experience also solves problems with many sites. I would prefer not to spoof the user-agent, but what should we do if the websites break the internet…

                                                                      E.g. Austrian newspaper derStandard gives human visitors (and also my crawler when I was still crawling with my custom user-agent) a big banner before you can read the news. Unlike on most other websites the content was not available in the source code, either. If you do not have the right cookie set(?) they just serve you a website without the actual content. Until you click the banner.

                                                                      Thus, I started using Googlebot as user-agent for derStandard and now it works fine.

                                                                      Bonus point: German newspaper Zeit has a group of articles in between paywalled and public (require free registration). With user agent Googlebot you do not need a registration.

                                                                      (These experience come from running a newspaper monitoring service for almost two years now)

                                                                      1. 1

                                                                        Could you tell me more on the newspaper monitor?

                                                                        1. 1

                                                                          I started crawling derstandard.at in summer 2017 and added more newspapers later. Now I started setting up some analysis, but that’s still early-alpha and not much there yet. I still have to think about stuff to analyse and program it.

                                                                          This was inspired by a talk called SpiegelMining at the Chaos Communication Congress. I think a categorization for each newspaper would be nice (like “far right”, “liberal”, …). But first I want to have a nice site setup with the simple statistical data like number of articles per week/year, categories, trends in category distributions (what becomes more important), …

                                                                          One or two weeks before the parliament elections in Germany I focused on the coverage regarding parties. There were two different results from this:

                                                                          1. Right-wing AfD was mentioned in extremely many article titles, but in the article body it was not so extreme
                                                                          2. It was possible to estimate the election results with an error of ~5% only based on how often party names were mentioned in articles

                                                                          My goal is to have monitoring over the most important newspapers across whole Europe.

                                                                      2. 1

                                                                        w3m seems simpler than python with a boatload of external libraries just to get some text ;)

                                                                        It’s more that, by processing RSS with regex, I’m probably getting both false positives and false negatives.

                                                                        Also, w3m -dump produces hard newlines at some assumed terminal width, which is a problem if I want to then process the results or view at a different terminal width. For instance, I occasionally scrape sites for use in training markov models, & because distinguishing paragraph breaks from other kinds of whitespace produces better results in markov model output for longform training data, I usually need to jump through hoops to guess which newlines are real and which were injected by the browser in order to reconstruct a more ‘natural’ rendering of the text. Because the python stuff here seems to be extracting text as opposed to attempting a console-based ‘rendering’ of the document, it seems like it’s less likely to try to reconstruct frames or tables, interpret layout markup, or inject its own newlines based on the value of $COLS.

                                                                    1. 1

                                                                      Does anyone know if and how this will impact open source? I’m thinking of platforms like Github and package managers (e.g. npm and Hex both have commercial interests afaik).

                                                                      1. 3

                                                                        Github is exempt from the much discussed article 17 (formerly 13; user generated content), because of the redefinition of “online content-sharing service provider”. The EU defines:

                                                                        “Providers of services, such as […], open source software-developing and-sharing platforms, […] are not ‘online content-sharing service providers’ within the meaning of this Directive.”

                                                                        But the question is how definite this list is. This is what troubles me and what also Wikipedia has criticized. They write a list of a few services, put a “such as” before the list and expect that everything is fine.

                                                                        They list a few not-for-profit examples in this list, but for example not a “not-for-profit soccer movie hosting service”. Is my theoretical not-for-profit soccer movie hosting service ¹ now exempt or not? I guess not, but the list is prefixed with “such as” which for me means that it is not a definitive list. And it’s not a definition like “not-for-profit services such as not-for-profit encyclopedia, not-for-profit …”, it’s a list of “services such as”. They EU does not even know what kind of services they want to exclude from article 17, only that they want to exclude “services such as”. lmao.

                                                                        ¹ not so theoretical indeed, because there is or was a German service that hosted short clips from 4th or 5th (or even lower?) league soccer matches; cannot find them anymore and maybe they are offline, because some years ago they got into big trouble with the soccer association for copyright infringement or licensing problems or so (I think all clubs playing in the league are member of the association and the association said the website is violating their exclusive right to publish soccer videos, even though the association of course have zero will to sell clips from the 5th league…).

                                                                      1. 26

                                                                        This law targets link aggregators like lobster.rs explicitly. Nobody has any idea how it should work in practice but presumably linking to the bbc now incurs an obligation to pay. A sad day for the internet as yet another promising future is crushed beneath those who use the law as a means to oppress the populace.

                                                                        1. 4

                                                                          How can a site like this one be affected by this law?

                                                                          Correct me if I’m wrong, but, lobste.rs:

                                                                          1. doesn’t make money off the content they host,
                                                                          2. it hosts links (not quotes, not summaries, …), they are giving the original author more visibility;
                                                                          3. it also hosts discussion, which I believe is still a human right.

                                                                          If someone where to acuse Lobsters of damaging content creators (which is what this law is all about, isn’t it?) how would that differ from taking me to court for sharing a link in a closed chat group, or even IRL?

                                                                          Lobsters is by the community, for the community, it’s not one large corp promoting their stuff (I could see the argument made against HN as it’s YCombinator’s site and it hosts some publicity from time to time), that does not differ IMO to sharing things IRL, and we surely won’t stop from sharing links to our friends, will we?

                                                                          If this law goes agains’t Lobsters for those reasons, then I will understand all the noise made around this directive.

                                                                          1. 2

                                                                            it hosts links (not quotes, not summaries, …)

                                                                            well, depends on the links you have. technically, torrent sites which host only magnet links should be fine, too, but aren’t.

                                                                            1. 3

                                                                              Torrent sites and alike are link aggregators for copyrighted material. It’s a link to something that’s already illegal to distribute, therefore torrent sites are redistributing copyrighted material, which goes against copyright.

                                                                              But lobste.rs’ submissions link to direct sources of information, where you can see the content how it’s creator wanted it to be. Sometimes paywalled articles are downvoted here because not everyone can read them. If lobste.rs were redistributing copyrighted material it wouldn’t be paywalled.

                                                                              A clear example of the opposite is https://outline.com, which displays the content of a third party without it’s consent, and without keeping the shape and form of how the author wanted it to be.


                                                                              • This link is not illegal. It’s the primary source of the content, you are seeing the content how it was meant to be and from the creator’s own site.
                                                                              • This link is illegal, it’s a secondary source of copyrighted material, if the creator decides to paywall it, modify it, put ads on it, etc. They can’t.

                                                                              Lobsters links are to the direct (sometimes indirect, but it’s an unwritten rule to post the direct) source of the topic.

                                                                              I ignore if the EU-approved directive would put Lobsters in the “copyright infringement” bucket, if it does then I repeat my previous point: if sharing links online with other people is illegal, where do you draw the line so that sharing links with IRL friends isn’t, because that would be an obvious violation of free speech?

                                                                              1. 3

                                                                                Agreed. Threadstarter/OP is being way hyperbolic.

                                                                                This is probably not the best law that could have been enacted, but it’s also a fact that American companies like Google and Facebook have been lobbying heavily against it. A lot of online rhetoric is reflective of this.

                                                                              2. 1

                                                                                Such links are off-topic for this site.

                                                                                Only legitimate pointer to a torrent link for this site’s rules is for an open-source distribution. But in that case, it’s more appropriate to link to that project’s release page.

                                                                            2. 7

                                                                              actually lobste.rs is exempt because it earns less than 10million euros, and if its for educational or research use its exempt as well.

                                                                              just as sites like Wikipedia are exempt

                                                                              1. 26


                                                                                According to Julia Reda you have to fulfill all (not any) of those three criteria:

                                                                                • Available to the public for less than 3 years
                                                                                • Annual turnover below €10 million
                                                                                • Fewer than 5 million unique monthly visitors

                                                                                Since lobste.rs is nearly seven years old an upload filter is required.

                                                                                1. 4

                                                                                  Reads like it was designed specifically to prevent newcomers to the field.

                                                                                  Clever, and not subtle at all. I’m surprised to see this coming from the EU.

                                                                                2. 10

                                                                                  You have to distinguish between former article 11 and former article 13 (now article 15 and article 17).

                                                                                  Article 17 (requirement to try to get licenses and if you cannot get a license make sure that no licensed content gets uploaded) has the limitation regarding company size and age (as correctly stated by qznc) and Wikipedia is exempt from this.

                                                                                  Article 15 however (requirement to license newspaper excerpts) does not have exemptions (according to my current knowledge). I guess however that all newspapers will again give Google a royalty-free license, because they fear that they will get less visitors without Google. Thus, in effect the only affected services are small services. Article 15 has these limits (imo not codified directly in the article, but in an annotation to the article): “The rights provided for in the first subparagraph shall not apply to private or non-commercial uses of press publications by individual users.”, but I am not sure how to interpret this “non-commercial uses by individual users” (it’s a similar grammatical construction in German).

                                                                                  German Wikipedia stated that they are exempt from “article 13” (now 17). They mention their implications by article 11 (now 15), but do not mention that they are exempt from it. They state “[Article 11] could complicate our research for sources on current topics” (“Dies könnte auch unsere Quellenrecherche bei aktuellen Themen deutlich erschweren.”)

                                                                                  1. 1

                                                                                    I guess however that all newspapers will again give Google a royalty-free license, because they fear that they will get less visitors without Google.

                                                                                    I can’t be certain but a discussion between some newspaper owners on a BBC Radio 4 weekly program on the state of media pointed out that similar laws existed already in both Germany and Spain (I think I remember the countries right) rendering Google news illegal in those countries and therefore unavailable. I don’t know if these new EU directives differ from those countries initial versions but their laws stated clearly that a license fee must be charged, therefore free licensing became illegal. The discussion revolved around how damaging it was to a number of publications whom obtained the majority if not all their revenue generating traffic from Google.

                                                                                  2. 7

                                                                                    Found something: I think lobste.rs is exempt from article 17 (article 13), because of the definitions in article 2. A “online content-sharing service provider” according to that definition “promotes [the content uploaded by its users] for profit-making purposes”. I think lobste.rs does not want to make any money? And then there comes the list of “educational or research use” that yakamo refers to. However, that’s only for the article 17.

                                                                                    For article 15 (article 11) the relevant term is “information society service providers” and that is defined as: “information society service’ means a service within the meaning of point (b) of Article 1(1) of Directive (EU) 2015/1535”:

                                                                                    Directive (EU) 2015/1535 in turn defines: “‘service’ means any Information Society service, that is to say, any service normally provided for remuneration, at a distance, by electronic means and at the individual request of a recipient of services.” (but I have troubles with “normally provided for renumeration”, because in Germany we have some terms that sound like they would mean “commercial”, but in fact they don’t).

                                                                                1. 10

                                                                                  It’s going to be interesting to see how much this is going to affect the future of how the WWW functions. GDPR sure didn’t manage to be as severe of a measure as we’d hoped it be. Heck, I’m having troubles getting the relevant authorities to understand clear violations that I’ve forwarded to them, where they then end up just being dismissed.

                                                                                  But this law here is of course not for the people, no… This is here for the copyright holders, and they carry much more power. So will this actually result in the mess we expect it to be?

                                                                                  1. 25

                                                                                    GDPR and the earlier cookie law have created a huge amount of pointless popup alert boxes on sites everywhere.

                                                                                    1. 10

                                                                                      The one thing I can say is that, due to the GDPR, you have the choice to reject many cookies which you couldn’t do before (without ad-blockers or such). That’s at least something.

                                                                                      1. 10

                                                                                        Another amazing part of GDPR is data exports. Before hardly any website had it to lock you in.

                                                                                        1. 4

                                                                                          You had this choice before though, it’s normal to make a cookies whitelist for example in firefox with no addons. The GDPR lets you trust the site that wants to track you to not give you the cookies instead of you having personal autonomy and choosing not to save the cookies with your own client.

                                                                                          1. 26

                                                                                            I think this attitude is a bit selfish since not every non-technical person wants to be tracked, and it’s also counter-productive, since even the way you block cookies is gonna be used to track you. The race between tracker and trackee can never be won by any of them if governments don’t make it illegal. I for one am very happy about the GDPR, and I’m glad we’re finally tackling privacy in scale.

                                                                                            1. 2

                                                                                              it’s not selfish it’s empowering

                                                                                              if a non-technical person is having trouble we can volunteer to teach them and try to get browsers to implement better UX

                                                                                              GDPR isn’t goverments making tracking illegal

                                                                                              1. 15

                                                                                                I admire your spirit, but I think it’s a bit naive to think that everyone has time for all kinds of empowerment. My friends and family want privacy without friction, without me around, and without becoming computers hackers themselves.

                                                                                            2. 18

                                                                                              It’s now illegal for the site to unnecessarily break functionality based on rejecting those cookies though. It’s also there responsibility to identify which cookies are actually necessary for functionality.

                                                                                          2. 4

                                                                                            On Europe we’re starting to sign GDPR papers for everything we do… even for buying glasses…

                                                                                            1. 12

                                                                                              Goes on to show how much information about us is being implicitly collected in my honest opinion, whether for advertisement or administration.

                                                                                              1. 1

                                                                                                Most of the time, you don’t even have a copy of the document, it’s mostly a tl;dr document full of legal jargon that nobody reads… it might be a good thing, but far from perfect.

                                                                                          3. 4

                                                                                            “The Net interprets censorship as damage, and routes around it.”

                                                                                            1. 22

                                                                                              That old canard is increasingly untrue as governments and supercorps like Google, Amazon, and Facebook seek to control as much of the Internet as they can by building walled gardens and exerting their influence on how the protocols that make up the internet are standardized.

                                                                                              1. 13

                                                                                                I believe John Gilmore was referring to old-fashioned direct government censorship, but I think his argument applies just as well to the soft corporate variety. Life goes on outside those garden walls. We have quite a Cambrian explosion of distributed protocols going on at the moment, and strong crypto. Supercorps rise and fall. I think we’ll be OK.

                                                                                                Anyway, I’m disappointed by the ruling as well; I just doubt that the sky is really falling.

                                                                                                1. 4

                                                                                                  I agree that it is not the sky falling. It is a burden for startups and innovation in Europe though. We need new business ideas for the news business. Unfortunately, we now committed to life support for the big old publishers like Springer.

                                                                                                  At least, we will probably have some startups applying fancy AI techniques to implement upload filters. If they become profitable enough then Google will start its own service which is for free (in exchange for sniffing all the data of course). Maybe some lucky ones get bought before they are bankrupt. I believe this decision is neutral or positive for Google.

                                                                                                  The hope is that creatives earn more, but Germany already tried it with the ancillary copyright for press publishers (German: Leistungsschutzrecht für Presseverleger) in 2013. It did not work.

                                                                                                  1. 2

                                                                                                    Another idea for a nice AI startup I had: Summarizing of news with natural language processing. I do not see that writing news with an AI is illegal, only copying the words/sentences would be illegal.

                                                                                                    Maybe however, you cannot make public from where you aggregated your original news that you feed into your AI :)

                                                                                                2. 4

                                                                                                  Governments, corporations, and individual political activists are certainly trying to censor the internet, at least the most popularly-accessible portions of it. I think the slogan is better conceptualized as an aspiration for technologists interested in information freedom - we should interpret censorship as damage (rather than counting on the internet as it currently works to just automatically do it for us) and we should build technologies that make it possible for ordinary people to bypass it.

                                                                                              2. 2

                                                                                                I can see a really attitude shift coming when the EU finally gets around to imposing significant fines. I worked with quite a few organisations that’ve a taken ‘bare minimum and wait and see’ attitude who’d make big changes if the law was shown to have teeth. Obviously pure speculation though.

                                                                                              1. 3

                                                                                                I checked the text a few days ago and I found some sections that were not so widely discussed very interesting. For example, the directive has an exemption regarding copyright for text and data mining for scientific organisations. This is similar to something we already have in Germany. However, the directive also includes this sentence that in my opinion tooootally is a section for the industry:

                                                                                                Article 3 (3): “Rightholders shall be allowed to apply measures to ensure the security and integrity of the networks and databases where the works or other subject matter are hosted. Such measures shall not go beyond what is necessary to achieve that objective.”

                                                                                                This is something I have not seen in German law, yet, and in my opinion means that copyright holders can declare which technology universities have to use if they want to do data mining / text mining on their material… (luckily according to the second sentence only to some extent).

                                                                                                Another detail I liked about article 17 (the famous upload filtering article). “If no authorisation is granted, online content-sharing service providers shall be liable for unauthorised acts of communication to the public […] unless the service providers demonstrate that they have: (a)made best efforts to obtain an authorisation, and […] [b) upload filters] [c) take down notice]”

                                                                                                I probably misunderstand this one, but to me this “and” seems as if you have to proof that you tried to get a license, otherwise you are still liable for any copyrighted content that gets uploaded, even if you employ upload filters (and there is a false negative). Thus, essentially making this whole directive a “force the whole internet to request licenses from copyright holders”.

                                                                                                1. 5

                                                                                                  In case somebody is interested, I think this is the full text: http://www.europarl.europa.eu/doceo/document/A-8-2018-0245-AM-271-271_EN.pdf

                                                                                                  German: http://www.europarl.europa.eu/doceo/document/A-8-2018-0245-AM-271-271_DE.pdf

                                                                                                  Other languages can be selected by changing DE/EN in the URL, e.g. Croatian would be A-8-2018-0245-AM-271-271_HR.pdf and so on.

                                                                                                  1. 7

                                                                                                    Can anyone provide a simple bullet-point list of why this is a bad thing? It might help when discussing it with folks.

                                                                                                    1. 16

                                                                                                      Just out of my head:

                                                                                                      Newspaper licensing (former article 11):

                                                                                                      • still unclear what is “very short excerpt” (afaik no ruling in Germany yet and we had this law for a few years), even titles might not be “very short excerpts”
                                                                                                      • all newspapers gave Google a royalty-free license in Germany, because they were afraid of losing visitors -> thus it only helps Google and harms competitors and small services
                                                                                                      • essentially in my opinion it becomes impossible to start a startup company that somehow collects news (news younger than 2 years), because you’d have to request a license from each newspaper

                                                                                                      Licenses for user-generated content platforms (former article 13):

                                                                                                      • all experts say that you cannot create a reliable upload filter that can distinguish between original and satire (and similar problems), thus leading to a lot of overfiltering
                                                                                                      • getting licenses from all major labels is simpler for the big players than for small players
                                                                                                      • the “exemption” for small startups (must be younger than 3 years AND less than 5 mio users per month AND turnover below 10 million Euro) is essentially useless, because all companies will have to use filters after 3 years (in the public communication they always acted as if they had used OR, but it is an AND).
                                                                                                      • the only ones that have enough money and knowledge to implement upload filters are Google, Facebook etc. Thus, they will be the only ones that can host content in compliance with the law or (which is very probable) they will provide Filter-as-a-Service to other companies.
                                                                                                      • Filter-as-a-Service will lead to routing of all uploads through Google / Facebook, thus this might have data protection problems

                                                                                                      There’s also the less famous former article 12 (there were some talks about it on the Saturday protests):

                                                                                                      • this one seems to make authors pay a bigger share of their earnings to the publishing companies

                                                                                                      In general the whole process was a big infamy(?) (=they were extremely rude) by the fans of the new law, especially the German party “CDU/CSU in the European Union” (even so far that the national parts of the CDU/CSU had to try to fix it…)

                                                                                                      • When there were a lot of emails to the politicians about the law they called us bots.
                                                                                                      • When there were a lot of phone calls to the European Parliament they called it a campaign controlled by the US companies.
                                                                                                      • When there were massive protests in many European cities last Saturday they called us “bought by so called NGOs”.
                                                                                                      1. 2

                                                                                                        Thanks for putting this together?

                                                                                                        1. 1

                                                                                                          Thanks, very helpful!

                                                                                                      1. 3

                                                                                                        Haven’t heard about a hype of monorepos. Is this actually happening? I am structuring one repo per project for my git projects and it works well.

                                                                                                        At work we have a few SVN repositories that contain more than one project. These contain tools for one specific software or a specific topic and we usually set them up so that each tool has its own dedicated subfolder. For example Repo/ToolA and Repo/ToolB. This allows us to checkout only the specific subfolder in Jenkins.

                                                                                                        1. 6

                                                                                                          If you have a large organization where many services interact with one another, a monorepo is fantastic because you can change the interface between services in a single commit (and have CI fail if you didn’t realize that some additional team was depending on your interface).

                                                                                                          1. 1

                                                                                                            I see your point with this one, but with the cons given from your question and static code checks being a pro, I would still go for multiple repositories and check the compatibility between the components with a system test suite or so. But maybe we are talking about different sizes here, plus I see that luckily other people have some ideas and solutions for your problem.

                                                                                                            Just to make sure we are not talking about totally different things. I think in my company (~10k employees) we have a few software packages we sell that are developed each by around 20 people, might also be a few more up to 50. And since these are monolithic desktop applications I assume that these reside in one repository each. This makes 5 repositories for 5 software products. A monorepo for this situation would now mean that 5 software products get developed in 1 repository.

                                                                                                            So, if you’re talking about the field of microservices (which I expect from the other answers) then we’re definitely talking about different situations :D

                                                                                                            Maybe there also is a bit a distinction in the usefulness between desktop applications and hosted-only. If we shipped our desktop product P in version 1.2 customer C that customer expects from us that we will keep the API stable a few more versions. This means, we have to keep the API backwards-compatible for a few months (in practice I think it’s years) anyway.

                                                                                                            1. 1

                                                                                                              Monolithic desktop applications don’t tend to have frequently-changing network/library interfaces, so I doubt the benefits would be as visible there, yeah

                                                                                                        1. 3

                                                                                                          I’m not especially familiar with Docker - what’s the benefit of using it instead of a package manager here? It looks pretty similar to a source-based package where it’s a list of dependencies and a build script.

                                                                                                          1. 3

                                                                                                            Not much a Docker fan here - I use it in some situations at the moment and provide one of my projects as docker images for convenience for users.

                                                                                                            I think the main point for using images/containers is if you have large scale deployments and do not want to care what runs where and use additional features automatic restart, automatic scaling, blue-green deployment (even though much of that is not part of docker itself, but maybe kubernetes can handle it?).

                                                                                                            Another advantage would be simple distribution of the container to any system. As long as your users have docker you don’t have to create packages for different distributions / Windows and they can still run it with one command.

                                                                                                            However, I’d say that most people that use containers would not really need it. I have two servers with several projects running and this goes quite well with ansible deployments. Scaling out with ansible is also no problem - as long as you do not need the dynamic allocation.

                                                                                                            But if you can actually profit from docker then the article’s method seems quite nice. Most docker images are around 100MB or more (I think I have one that is almost 1GB, just because I did not optimize anything…) and a few days ago I deleted around 60GB of docker image data from my hard disk.

                                                                                                            1. 2

                                                                                                              It’s all about trust.

                                                                                                              If you inspect the Dockerfiles, you’ll see each source tarball has a SHA-256 checksum associated to it that gets checked during build. This means source integrity is hard-coded. Anyone can download the according tarball and check that it matches the one in the Dockerfile.

                                                                                                              During the build stage, those tarballs are downloaded and the build only proceeds if integrity is verified.

                                                                                                              You don’t have that kind of protection when downloading from a repository, because the binary is already built and trust is put on the repository maintainer. Now, this means trust is put on the NGINX/OpenSSL/zlib/etc. maintainers to release trusted tarballs on their websites, but if we can’t trust the devs with that task, that’s a whole different problem.

                                                                                                              1. 5

                                                                                                                I said source package. Gentoo, Nix, probably most package managers that can build from source include this by default, and it’s a single build command in the unpacking phase if not.

                                                                                                              2. 2

                                                                                                                I think the goal is to just let the container loose on infrastructure meant to handle ports , logs, hardware, lifetimes, without ever coming close to treating the install like a pet.

                                                                                                                1. 1

                                                                                                                  Here’s you an intro that covers a few uses.

                                                                                                                1. 2

                                                                                                                  I am pro for this. Just a few days ago I argued with some friends about Amazon even blocking one of our bots that only extracts URL page title if it visits amazon.com too often in short time. On the other hand scrapinghub claims that Amazon is one of their customers. This means amazon actively uses scraping, but blocks even simple bots from their own website.

                                                                                                                  I also think that people should know that public data is actually public (considering privacy) and companies claiming that they prevent scraping etc. just obfuscate the whole situation. Your data is still 100% public.

                                                                                                                  In the EU we also have something called sui generis database rights. This is copyright on databases without any requirements on originality of the content. The only requirement is some form of “investment” (financial or work) into the database. You are allowed to copy “irrelevant” parts of a database. Unfortunately, all German jurisdiction I have found was about websites that do ad-hoc scraping like plane ticket search (user performs a search and platform redirects the search to all website). These have all won their law cases, because the courts said that each user request must be considered individually and individually they only scrape a irrelevant part of the whole database. But I did not find any German decisions about situations where people actually scraped and stored content.

                                                                                                                  This sui generis database protection even protects collections of documents that otherwise would not have any copyright at all. E.g. German laws are not copyright protected. However, the company that publishes all laws in Germany (“Bundesanzeiger”) claims that it has copyright protection on the collection of laws. Thus, you are allowed to copy the laws themselves (because they are not copyrighted), but you are not allowed to copy the database (i.e. the collection of the laws). In reality this means you can copy the laws, but you cannot tell anybody that you got them from Bundesanzeiger (because that would mean that you used their copyright protected catalogue).

                                                                                                                  1. 1

                                                                                                                    That is a good anecdote with Amazon. Indeed companies want to have access to information themselves but they do not want others to have this; to get a competitive advantage.

                                                                                                                  1. 7

                                                                                                                    I scrape the web professionally, and have for several years now. I don’t see anything morally objectionable about scraping itself; it’s just a form of data extraction. Some people do shady things with scraping, or stupidly degrade other peoples’ services with it, but if information is on a publicly available website, it is asking to be extracted and processed. “Information wants to be free” etc etc

                                                                                                                    1. 1

                                                                                                                      Mind to share more information about your profession? I am currently starting an open source platform (on top of scrapy and mainly for myself, but try to make it clean FOSS). In a bit simpler words I just standardize data storage, some pipelines etc. and build analysis presentations on top of it (e.g. Monitoring of border wait times).

                                                                                                                      Would be interested what kind of customers you have (small companies, large companies) and how large the scrapes are (1k pages per scrape, 1m pages per scrape, …)? Or are there no direct customers and you somehow make products from the gathered information and sell the products?

                                                                                                                      Plus, do you have to deal with legal protection of databases in your jurisdiction (the EU has it…) and how do you handle it?