1. 1
    01:arrow/ (pr/4574) $ echo hi >> README.md 
    01:arrow/ (pr/4574✗) $ false
    01:arrow/ (pr/4574✗) ! 
    

    The git branch in green with a red dirty marker if the workspace is dirty. The $ transforms to a red ! if the previous command did not exit with 0. The path is truncated to the tip of the tree (top most directory).

    1. 6
      char buf[src_size];
      

      wont this fail if source file is larger than RAM? I am guess more robust solutions (cp, rsync) dont have this issue.

      1. 3

        It should fail if it’s larger than the available stack size, I think.

        1. 1

          How about char* buf?

          1. 3

            I dont see that as helping the problem. Youd need a sliding window of a set size, say 1 GB that is emptied after that portion is copied - or you could just ignore the problem if its for personal use

            1. 1

              What do you mean a sliding window?

              1. 3
                1. 3

                  Its a buffer implementation - you would need to use something like this for a robust copy solution. if you dont care about supporting larger files you can ignore this

                  if you do care about supporting larger files - create buffer of say 1GB - load first 1GB of source file and copy to destination - rinse and repeat until file is copied - you might need to seek as well but I think not as I believe C read moves the cursor as well.

              2. 2

                You’ve changed the code now to just do:

                  char* buf;
                  fread(buf, 1, src_size, src);
                

                Won’t that just fail since buf is uninitialized?

                1. 0

                  I tested it, it didn’t

                  1. 4

                    You’re relying on undefined behavior then, which is inadvisable.

                    1. 1

                      Are you joking? Even the most cursory of checking triggers the warning:

                      $ x86_64-w64-mingw32-gcc  -Wall   copy.c
                      copy.c: In function ‘copy’:
                      copy.c:14:3: warning: ‘buf’ is used uninitialized in this function
                      [-Wuninitialized]
                      
                      1. 1

                        reply

                        I think OP is learning C.

              1. 2

                If this is for Linux, then I think sendfile() would be fastest, as it happens entirely within the kernel. If you can’t use sendfile() (old Linux kernel, non-Linux Unix system), then I think calling mmap() on the file being copied, then madvise(MADV_SEQUENTIAL) followed by write() would be a good thing.

                1. 1

                  no need to write(), just mmap() the output file too and even use MADV_DONTNEED.

                  1. 0

                    Not Linux-specific

                  1. 3

                    Small and tracktable project like you just listed seems a perfect way to do it (if you have free time). I would advise using a modern language like go and rust, since the ecosystem and tooling is easier to get started.

                    Some other learning projects: https://cstack.github.io/db_tutorial/

                    1. 0

                      This db tutorial looks amazing. I was thinking about using C, but perhaps Rust or Go is the right way forward. I’ve been getting similar feedback. Especially around Go since it’s used in production quite a bit. On the other hand Rust is genuinely a step forward in terms of PL tech.

                      1. 1

                        What fsaint said. I’ve done a small CI system in Golang and cli tools like ps and a pseudo MIPS assembler in Rust. Great languages. If you want to do PL type stuff, writing a interpreter/compiler in Go are good books.

                    1. 10

                      If you find this interesting, you absolutely should read The Third Manifesto (PDF Warning).

                      It is an attempt to put relational database languages on a firm, type-safe basis while still leveraging the power of the relational algebra. It’s also just a really fun read.

                      1. 3

                        EdgeDB/EdgeQL conforms to many (but not all) of the D proscriptions.

                        1. 6

                          1st1 says in their profile that they’re an EdgeDB employee. Are you?

                          1. 3

                            Yes, he’s EdgeDB co-founder and the author of the linked blog post.

                            1. 12

                              Ok, thanks. Just like to be sure about that when I’m reading articles about products and related tech. This is a good one. Welcome to Lobsters to the both of you. :)

                              1. 4

                                Thank you! :)

                          2. 3

                            Python is an unusual choice for a database. Do you not expect latency/performance to be an issue?

                            1. 4

                              We’ve managed to squeeze a lot of performance out of Python and there are ways to do way more. Python has been amazing at allowing us to iterate quickly. That said, the current plan is to slowly start rewriting critical parts in Rust.

                              1. 2

                                :D

                                Any plans for a Rust language interface? I have a medium sized database-oriented project in Rust and it would be cool to try rewriting part of it for EdgeDB.

                                1. 2

                                  Not yet. Maybe you want to help us implement it? :)

                                  1. 3

                                    I would love to, in my Copious Free Time. …of which I have none. It’s very tempting though.

                                2. 1

                                  Sounds good, thanks for the insight :)

                                3. 2

                                  We’ve posted some initial benchmarks here: https://edgedb.com/blog/edgedb-1-0-alpha-1/

                                4. 2

                                  Could you elaborate?

                              1. 4

                                This is why I recommend everyone use an ad blocker. Not privacy, not a categorical opposition to advertising, but because the ad networks have terrible quality control. They let nonsense like JavaScript injection, coin miners, and simple fraud through all the time.

                                Television advertisements have legal and network-wide controls in place to ensure that the ads don’t lie. This isn’t from a place of the goodness of their hearts, but rather because if people don’t trust ads to live up to even a minimal amount of truth (“puffery” is okay, but claiming to relieve headaches without clinical trials, or simply eating your money and riding off into the sunset, are not), then they won’t actually buy anything.

                                Internet ads, on the other hand, are as likely to deliver fake antivirus products and phishing attacks as non-advertised sites are, and they get mixed in with sites that are otherwise good. I’m never going to intentionally click a banner ad, because I don’t trust them, and they’re not as relevant as the organic results are. So why would you want to see them?

                                1. 4

                                  It’s also an amazing distribution channel for malware. You can programmatically select by multiples criterions like geography, browser and platform type.

                                  Say for example you target only devices which you’ve seen the weekend at mar-a-lago and the week at Washington and selectively distribute your Javascript 0 day there. All you need is an Apple IFA and hope that someone plays any game with a viewport embedded.

                                  It enables any actor to run Javascript on virtually any browser.

                                1. 2

                                  There’s a poor’s man technique to avoid auditing. The third-party ad-tracker (which google don’t control) is configured to have geolocation dynamic dispatching. If the request comes from California/New York (or from a google network block), the tracker forwards to the legitimate site. Otherwise you get to the bad actor site.

                                  1. 16

                                    Check list to use mmap on files:

                                    • No portability (Linux/BSD only). On the surface it’s portable, but the failures aren’t.
                                    • Local files on friendly file systems, reading the (fs) source is the only way to know the failure modes, mmap manpage is useless.
                                    • File size known, pre-allocated and static under the duration of the mmap. Size can be increased, but beware of mremap (see later point).
                                    • Known and controlled bounds on number mmap invocations. mmap is a limited resources like files, see sysctl vm.max_map_count.
                                    • Lifetime management is of utmost priority. Calls to munmap are carefully tracked. Make sure that every pointers to the un-mapped regions are reclaimed/invalidated. This is probably the most difficult part. If your pointers are not wrapped with a handle on the mmap resource, you likely have a use-after-free bug laying somewhere.
                                    • Always assume mremap will give you a new address, or fail under MREMAP_FIXED. This is especially relevant to previous point.
                                    • Using a signal handler to recover from mmap failures is just pure madness and a daunting task to make re-entrant. Rule of thumb is that the only thing you can do in a non-crashing signal handler is to flip an atomic.

                                    If you meet all the previous point, go ahead and use mmap. Otherwise trust the kernel and use pread/pwrite. mmap is amazing, but it’s a double-edge-nuclear-foot-gun.

                                    1. 3

                                      mmap() is POSIX so any Unix system should support it (for instance, Solaris does). I agree, but would also add:

                                      • Make sure the file DOES NOT CHANGE from outside your program. So don’t go overwriting the file with a script while you have a program that has mmap()ed the file. Bad things happen.
                                      • It can be used to map memory without a file, using MAP_ANONYMOUS. I would wager this is probably the most often used case for mmap().
                                      • Once you mmap() a file, you can close the file descriptor. At least, this works on Linux. Mac OS-X and Solaris in my experience (of mmap()ing a read-only file)
                                      1. 5

                                        Note that map anonymous is not in posix. Actually the spec is quite lengthy and if you read it carefully, there are a lot of caveats and wiggle room left to implementations to only provide a minimal version.

                                        https://pubs.opengroup.org/onlinepubs/9699919799/functions/mmap.html

                                        1. 2

                                          That’s surprising (but it shouldn’t be—I mean, the mem*() functions from Standard C were only marked async safe a few years ago in POSIX). On the plus side, it appears that Linux, BSD and Solaris all support MAP_ANONYMOUS.

                                        2. 5

                                          I meant that the way things go wrong with mmap are not portable. If you’re building a library or a long running daemon, this is critical. Other points I forgot in my list

                                          • mmap/munmap is a costly syscall. A corollary of this and the third point in the original list is that you should not mmap many small files. Few large block is the optimal case.
                                          • you can only sync on page boundaries, thus if you write, it needs to be on aligned blocks of 4k, otherwise you’ll trigger some serious write amplification on SSDs.
                                          • msync is tricky, MS_ASYNC is a noop on Linux and doesn’t report failures, thus you might never know if a write really succeeded and you also have to know about vm.dirty* sysctls, they’re directly correlated with async writeback.
                                          • huge pages don’t work with file backed mmap (and doesn’t make sense, there’s only a single dirty bit for the whole page!)
                                          • if you have many virtual memory mmaping, the chance of a TLB miss increase and your performance can decrease depending on your access patterns.

                                          I’m starting to realize that the mmap man page should be written with all said pitfalls :)

                                          1. 2

                                            At work, we have a near perfect case use of mmap()—we map a large file shared read-only (several instances of the program share the same data) on the local filesystem (not over NFS) that contains data to be searched through, and it’s updated [1] periodically. That’s the only case I’ve seen mmap() used [3].

                                            [1] By deleting the underlying file [2], then moving in the new version. Our program will then pick up the change (I don’t think the system supports file notification events so it periodically polls the timestamp of the file—this is fine for our use case), mmap() the updated filed, and when that succeeds, munmap() the old one.

                                            [2] Unix file semantics to the rescue! Since there’s still a reference to the actual data, the file is still around, but the new copy won’t overwrite the old copy.

                                            [3] Although it’s possible the underlying runtime system uses it for memory allocation.

                                            1. 1

                                              When working on IO devices where the bandwidth is equal or greater than the memory bandwidth (on my desktop, I’m capped at 10-12G/sec on a single core, or 48-50G/sec for all cores), you’re eliding one copy of the data. Effectively reaching the upper bound (instead of half).

                                      1. 3

                                        Thanks for sharing your learnings! This couldn’t have come at a better time, since I’m considering learning Rust by implementing a TUI as well. Don’t ask me for what, I’m embarassed to say, but maybe I’ll share it as well once it reaches a satisfactory state.

                                        1. 4

                                          This morning it crossed my mind that I would like a gmail TUI client.

                                          1. 4

                                            Lynx used to work before they killed noscript support. :(

                                            But some other terminal browser may work. browsh comes to mind, although I personally think it’s pretty heavyweight.

                                            1. 1

                                              It appears you can still browse gmail without js. I have to use gmail at work, and after disabling js, I can load a 2005-era gmail webpage.

                                              “You are currently viewing [Company] Mail in basic HTML. Switch to standard view | Set basic HTML as default view”

                                              1. 3

                                                Per blog post, they disabled javascript free logins. So you’d have to login with one browser and somehow copy the cookies into lynx and hope it works.

                                                https://security.googleblog.com/2018/10/announcing-some-security-treats-to.html

                                                1. 1

                                                  Hmm…It’s still been working for me. I’ve been using the HTML GMail interface through www/links+ (links2 to the linux folk) for a long time. And I can log in with 2FA just fine.

                                            2. 2

                                              Oh I fully agree. I can’t be bothered to configure mutt and the other 5 tools you need. Also likely I’d be happy with just the basics but in TUI form.

                                              (My project is not about that)

                                              1. 1

                                                Well, depending on your editor preference, there’s a vim plugin, or you can use emacs’ included email client.

                                                Alternately, there was a project several years back called sup that aimed to put a gmail-like interface in the terminal, for any mail server. Sup is written in ruby.

                                                1. 5

                                                  I’m writing a TUI MUA in rust, I haven’t released it yet because it seems my free time is not that much. I have some earlier screenshots in its mockup site: https://meli.delivery

                                                  1. 2

                                                    That does look really promising!

                                                  2. 1

                                                    sup is indeed really good. I can’t remember why exactly I stopped using it.

                                                    EDIT: now I remember: sup is no longer supported (for instance https://github.com/sup-heliotrope/sup/issues/546). There are instructions around to get it to work on an older ruby version (https://wiki.archlinux.org/index.php/sup#Configuration) but sup-config still crashes. Plus it needs offlineimap + mailfilter + msmtpd … that’s a bit too much for my taste

                                                    1. 1

                                                      it’s a turnoff, I want to edit a config with credentials and be done with it.

                                                      1. 1

                                                        Absolutely. If you find that or write that, let me know, I want to try it out.

                                                    2. 1

                                                      interesting.

                                                1. 5

                                                  Looks like example of this is Instagram’s “explore” feed. 2-3 years ago it was really interesting and showed relevant and weird pics for me. Then it suddenly started to show only things I especially hate: non-thematic video bloggers, cars, dogs, hunting, rap and rappers’ fashion. Now it consists of “average dull” content, like ubiquitous video bloggers that dye their hair every day, highly-promoted “funny cats” and reviews of decorative cosmetics. This might be just because they removed machine learning altogether, replacing it with simpler “what’s popular” algorithm, but the same also happened to ads in FB ad network.

                                                  1. 7

                                                    The client is the advertiser, they care about how much inventory they can fill.

                                                    1. 1

                                                      AFAIK, there’s no option for buying shows on Explore, at least in public interface for advertisers, at least last time I looked at it. Most of content that I see there is non-strictly-commercial, just of topics that are opposite of what I’m interested in.

                                                      1. 1

                                                        But those topics might be the most valuable ones for them, right? Like, if no one buys ads targeted at niche X, then Instagram doesn’t have any incentive to promote niche X.

                                                    2. 4

                                                      I feel like often when a model goes off the rails, it’s because the loss function being optimized isn’t appropriate for the domain. What’s popular tends to be what’s recommended if the algorithm doesn’t think it has enough data on you. But that’s a pretty weird situation if you’ve been on the platform for years and the recommendations used to be good.

                                                      1. 2

                                                        Just speculating here, but are you privacy-conscious when using the Internet? If so, there might not be enough “hooks” in your behavior that Instagram can latch onto.

                                                        1. 1

                                                          I’m using adblocker, and it cuts out Facebook buttons, but I’ve been using it for last 5-7 years, and recommendations worked before, and not on all sites, and no fancy things like separation of cookies on different tabs. But as I understand, website visits is not the most important data and I leave more information on the Instagram itself, and on other places where I’m logged in. Also, mostly only mainstream news websites has Facebook trackers, what’s linked here at Lobsters usually have Google Analytics at most, and I doubt Facebook can buy data from Google.

                                                          1. 1

                                                            Thanks for the clarification and expansion.

                                                            Instagram in its entirety for me represents “digital sharecropping” in its most extreme form so far. There’s barely any pretense to serve the needs of its users - the people producing the content. The only entities who matter are the advertisers.

                                                            1. 1

                                                              How it’s different from, say, good old Flickr? Users generate content, platform shows advertisements too. The only ethical advantage of Flickr is that it has field to mark your photos under CC licenses, so others can use your photos. For non-CC photos, it even has anti-download measures, almost like Instagram. Instagram has far worse community, being mainstream and mobile-oriented, but it’s not related to advertising model.

                                                              Almost every text/photo/video publishing platform, except maybe Medium works that way. But Medium is much worse and shady as hell.

                                                              1. 2

                                                                Flickr vs IG is a difference in degree, not in kind, I agree.

                                                                I concede that IG probably has an order of magnitude more users than Flickr at its peak.

                                                                Below is a list of stuff that Flickr has that Facebook lacks:

                                                                User control
                                                                • Users can organize their photos in groups and sets, and post to other groups. Group owners can set rules.
                                                                • Relevant tags - tags can be for the user only, not signal intent like on IG
                                                                • Rich API - I have Lightroom integration, have plethora of apps to choose from, I’ve written my own script to download my images
                                                                • Original size images - with access control
                                                                • follower access control - friends/family can have more rights than the general public
                                                                Economics/financing

                                                                I wish I had numbers on this, but I’ve heard it reported that Flickr is/was self-sufficient from paid users (I have a paid account). Advertising is extra, but not the driving force.

                                                                Flickr is (now) owned by SmugMug, another image hosting site with paid tiers. Their business model is to host images, facilitate print sales, give photographers a platform etc.

                                                                Conclusion

                                                                IG now is a social/news network with images.

                                                                Flickr is an image hosting platform with social features.

                                                                For me as a photographer, I much prefer the level of control Flickr gives me. However I am not a seller of photos. For those that are, IG is absolutely essential, and its constant changing of timeline sorting, visibility algorithms etc. screws those photographers over. But that’s fine, because those photographers are not the customer of IG.

                                                                (Edit spelling)

                                                      1. 6

                                                        This is some serious persistence. Bravo Hugo!

                                                        1. 3

                                                          Cool, thanks for posting. I might be able to use this in my own compiler project, since I have many of the same requirements. Building a domain specific IR + optimization passes from scratch is a big job that I’ve been putting off. A library that takes care of all the boilerplate seems attractive.

                                                          Here’s a question. My language can be either interpreted or compiled. Compilation is required for execution on a GPU, or for fast execution on a CPU, but compilation is slow. The interpreter works by quickly compiling to an intermediate representation (kind of like an IR), then interpreting the IR. It starts instantly, with no discernable lag, and that’s helpful when using the REPL interface, or doing live coding.

                                                          In the optimizer that I want to build, constant folding and partial evaluation will be very important. This could lead to the optimizer containing a copy of the interpreter, and I don’t want to maintain two interpreters in parallel. So the question is, can I design an IR that serves as both the executable format for the interpreter, and also as the input and output of the optimizer? Is this a thing that people do? Are SSA, CPS or ANF better or worse for this?

                                                          1. 3

                                                            For a somewhat related paper, see Adaptive Execution of Compiled Queries and link to the paper.

                                                            1. 1

                                                              Your questions remind me of this post describing WebKit’s IRs and an effort to replace LLVM IR with a custom domain-fit IR: https://webkit.org/blog/5852/introducing-the-b3-jit-compiler/

                                                              The WebKit posts by Filip Pizlo about the evolution of Webkit’s JS engine are fascinating.

                                                            1. 4

                                                              If you are doing request/response services that rarely mutate the request buffer, consider using other serialization methods so you can get zero-copy performance.

                                                              1. 2

                                                                What do you mean zero-copy performance? Zero copies of what?

                                                                1. 14

                                                                  When protobufs are unpacked, they’re copied from the serialized binary stream into structs in the native language. FlatBuffers are an alternative, also by Google, that don’t need to be unpacked at all. The data transmitted over the network is directly useable.

                                                                  Interestingly, there are also some zero copy, zero allocation JSON decoders, like RapidJSON. Its methods return const references to strings directly in the source buffer rather than copying them out. But of course it still needs to parse and return numbers and booleans, rather than using them directly from the message like FlatBuffers.

                                                                  The biggest problem with copying is copying large chunks of binary data out of the message. Suppose you wanted to implement a file storage API using gRPC. Right now to handle a read call the server would have to copy the data into memory, copy it into the serialization buffer, and send it. It would be much better to avoid that extraneous copy for serialization.

                                                                  Our internal protobuf implementation handles this with something called cords—essentially large copy-on-write shared buffers—but cords aren’t open source yet. You can see references to ctype=CORD in the protobuf open source code and docs, and there’s a little bit of discussion here on the public mailing list.

                                                                  1. 2

                                                                    +1 to this. In a real-world test case with ~webscale~ traffic, the heap fragmentation caused by unserialiasing ~10,000 protobufs per minute was enough to inexorably exhaust available memory within minutes, even with jemalloc and tuning to minimise fragmentation, and after doubling the memory available a few times to check that it wouldn’t cap out. I kept bumping into cord references online and wishing they were part of the open-source implementation.

                                                                    Swapped out protobuf for a zero-copy solution (involving RapidJSON! :D) — which meant swapping out gRPC — and memory use became a flat line. We’ve become somewhat avoidant of gRPC since this and some other poor experiences.

                                                                    1. 4

                                                                      That’s weird, 10k protobufs per minute isn’t very many. As you might imagine, we do a lot more at Google and don’t have that problem.

                                                                      Since you mention cords, were these protobufs with large strings?

                                                                      What did you tune in jemalloc? Was this in a containerized environment? Did you limit the max number of arenas?

                                                                      1. 3

                                                                        Since you mention cords, were these protobufs with large strings?

                                                                        Yes – the documents were about as simple as it gets, two strings. One huge, one tiny. The response to most requests was a repeated string, but we found that returning an empty array didn’t affect the heap fragmentation – just parsing the requests was enough.

                                                                        What did you tune in jemalloc?

                                                                        Honestly, I tried a bit of everything, but first on the list was lg_extent_max_active_fit, as well as adjusting the decay options to force returning memory to the OS sooner (and so stave off the OOM killer). It performed much better than the default system malloc, but increasing the traffic was enough to see the return of the steady increase in apparent memory use.

                                                                        (At any point in time, turning off traffic to the service would cause the memory use increase to stop, and then after some minutes, depending on decay options, memory would return to baseline. I mention this explicitly just to make sure that we’re 100% sure there was no leak here – repeated tests, valgrind, jemalloc leak reporting, etc. all confirmed this.)

                                                                        Was this in a containerized environment?

                                                                        Yes, Kubernetes. This does complicate things, of course.

                                                                        Did you limit the max number of arenas?

                                                                        No, I didn’t – the stats didn’t give me any off feelings about how threads were being multiplexed with arenas. (Things looked sensible given what was going on.) I did try running jemalloc’s background thread, but as you might expect, that didn’t do much.

                                                                        1. 2

                                                                          Ah. I ask about arenas because of this problem. In that example it happened with glibc, but the same could happen with jemalloc.

                                                                          I ask about containers because max arena count is often heuristically determined from core count, and containers expose the core count of the host system. You can easily run e.g. a 4 core container on a 40 core server and container-unaware code will incorrectly believe it has 40 cores to work with. I believe jemalloc defaults to 4 arenas per core, so 160 arenas in that example. That could massively multiply your memory footprint, just as it did in the linked post.

                                                                          If you didn’t notice a surprisingly large amount of arenas in the stats, that probably wasn’t the issue.

                                                                          At Google all binaries are linked with tcmalloc. I don’t know whether that matters, but it’s another possible difference.

                                                                          If parsing empty protobufs was enough to cause memory fragmentation, I doubt cords would have made a difference either. But I agree, I wish they were open source. I’m sure they’re coming at some point, they just have to be detangled from Google internal code. That’s the whole point of Abseil, extracting core libraries into an open source release, so Google can open source other things more easily.

                                                                          1. 1

                                                                            Aaaah, ouch, yes, that makes sense; that could easily have bitten me, and I just got lucky that our machines had only 4 cores. I do wonder about tcmalloc.

                                                                            If parsing empty protobufs was enough to cause memory fragmentation, I doubt cords would have made a difference either.

                                                                            I may have been a little unclear – we were never parsing empty protobufs, always valid (full) requests, but we changed it so we returned empty/zero results to the RPCs, in case constructing the response protobufs were responsible for the memory use. So it’s possible cords would have helped some, but I have my doubts too.

                                                                            Abseil looks neat! I’m glad such a thing exists.

                                                                          2. 2

                                                                            Apache Arrow uses gRPC with effectively a message similar to yours, some metadata and a giant binary blob. It is possible to use zero-copy:

                                                                            https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/serialization-internal.h

                                                                            1. 1

                                                                              Whew! That is interesting. Thank you for the link, I’ll digest this. Unfortunately the project my experience was with is dead in the water, so this will have to be for the future.

                                                                    2. 1

                                                                      The contents (or portions thereof) of the input buffer.

                                                                      As an example, if what you’re parsing out of the buffer is a collection of strings (typical of an HTTP request, for instance), zero-copy parsing would return pointers into the buffer, rather than copying sections of the buffer out into separately-allocated memory.

                                                                      It’s not always beneficial (for instance, if you keep parts of the request around for a long time, they will force you to keep the entire buffer allocated; or if you need to transform the contents of the buffer, separate allocations will be required anyway), but in the right circumstances, it can speed parsing significantly.

                                                                      Neither JSON (due to string escaping and stringified numbers) nor protobuf (varints, mostly) are terribly well-suited to zero-copy parsing, but some others serialization formats (Cap’n Proto being the one I’m most aware of) are specifically designed to enable it.

                                                                      1. 1

                                                                        AFAIK, the binary format of protobuf allows you to do zero-copy parsing of string/binary?

                                                                        1. 1

                                                                          Yes, definitely; string and bytes fields’ data is unchanged and whole in the serialised form.

                                                                  1. 2
                                                                    • Submit a PR I’ve been working on the last 2 weeks, implementing basic compute kernels for Arrow.
                                                                    • Attending rstudio::conf 2019
                                                                    1. 1

                                                                      “Subscription required”

                                                                      1. 2

                                                                        Heh, sorry, I forgot putting the subscriber link, see @azdle comment, or if any mod can update the link.

                                                                      1. 4

                                                                        I’ve heard that bash scripts should always be written with -e, and I usually do use that flag, but if I remember correctly it halts test-suite execution prematurely (like if you’re testing a failure case) if you’re doing TDD in bash with shunit2.

                                                                        1. 7

                                                                          You can turn it off again with set +e. But usually you’d “handle” the exit status with an if or a trailing && or ||.

                                                                          1. 3

                                                                            Thanks. I love sharing and learning tips like this; bash is such an essential skill.

                                                                          2. 5

                                                                            I’d say -eu is the bare minimum.

                                                                          1. 14

                                                                            I usually expose -x via a flags like --trace/--debug. You can also invoke it manually with bash -x ze_script. I wouldn’t turn it on by default.

                                                                            1. 2

                                                                              I definitely agree. I think it’s tempting to think trace output like this means you don’t have to work on log messages and error messages for human beings, and the result is a giant wall of pretty inscrutable output to sift through to sort out what went wrong.

                                                                              1. 2

                                                                                -x is really useful when writing Dockerfiles, so you can do:

                                                                                RUN set -x \
                                                                                    && my-command \
                                                                                    && my-2nd-command
                                                                                

                                                                                This way you get clear picture of what’s happening while doing image builds.

                                                                                EDIT: formatting

                                                                              1. 1

                                                                                Not thread safe, don’t do this at home.

                                                                                1. 10

                                                                                  In the comment section:

                                                                                  But Redis has a ton of references inside the source code, API, and is a mess at this point.

                                                                                  On this date, there are 61 matches for “slave” and 66 for “master” on the source-code of Redis. That’s definitely not a ton.

                                                                                  Other projects, arguably bigger than Redis, did the change long ago. Django, CouchDB, Drupal. I don’t see how this is a issue particular of backward compatibility either, as dealing with backward compatibility is something that every user of open-source software deals with since software is software.

                                                                                  1. 9

                                                                                    GitHub search result is misleading because it doesn’t match substrings. If you look at the search result, GitHub is not matching SLAVEOF for example, which is a Redis command so it must be maintained for backward compatibility.

                                                                                    1. 9

                                                                                      Can’t count how many times I’ve cursed at how bad the github search is. Stop developing new useless features, fixing search should be priority #1.

                                                                                      1. 1
                                                                                    2. 2

                                                                                      61 matches for “slave”

                                                                                      antirez did better count and came up with 1500. 60 is a garbage number.

                                                                                      1. 1

                                                                                        I’d search and replace it, then duplicate the tests refering to PRIMARY and and reintroduce all or at least enough tests for master/slave API so it is not a breaking change. That way the project basically has shown good intentions and is on the right track. The deprecated API can stay for a few years, I don’t think many will be offended by that.

                                                                                        1. 1

                                                                                          A lot of open source software around us has been working really hard to stay backwards compatible, like most libc implementations, many kernels within reason, etc.

                                                                                          this is common for things for which an incompatible change will cause a cascade of everything else on your system becoming incompatible too.

                                                                                          recently libstdc++ was forced to make a backwards incompatible change by a C++ language standard change, and they attempted the insane move of shipping two major versions in one .so file and trying to match symbols.

                                                                                          1. 1

                                                                                            It’s not just the source. It’s also the documentation, both official and blog posts and articles already published that now use outdated terminology that may confuse others reading them, internal documentation of users, perhaps config files in millions of installs, etc. A change such as this one has a pretty large aggregate cost.

                                                                                            That’s not a terminal argument: it should be weighed against the harm caused by not making the change. But it should also be weighed against the harm prevented by not making the change, because people spent their time to prevent different harms. And neither are simple, because there is no method to value harms, no calculus to sum them and no way to predict the future. But anyone who denies all of those would be necessary for a clear cut judgment in this case is viewing the world as simpler than it is and can be called out for doing so.