Threads for objectif_lune

  1. 1

    Company: Wellcome Trust

    Company Site: https://wellcome.org

    Position(s): Software Engineer (Python), Senior Security Engineer, Senior Software Engineer (JavaScript)

    Location: REMOTE (In UK), London

    Description: Wellcome is a politically and financially independent global charitable foundation, funded by a £38.2 billion investment portfolio.

    Tech Stack: Python, JavaScript, AWS, Azure

    Compensation: Competitive Salary per position, loads of benefits and perks, annual performance rating based bonuses

    Contact: See the job listings here https://wellcome.org/jobs and apply through the site

    1. 2

      How large is the data in uncompressed CSV format? How long does the data take when loading in with the sqlite CLI?

      1. 4

        I can answer my own question, the 100,000,000 rows sqlite3_opt.py produced when I ran it created a 1.5 GiB SQLite3 file and a 1.8 GiB CSV file when I dumped it out. If you just wrote zeros to a file on NVMe-backed storage the above files could be generated in 0.5 - 0.8 seconds. So 1B rows, or ~15GiB in SQLite format could be written in 8 seconds if it was being copied from another NVMe-backed storage device. That gives another 52 seconds to all the overhead of generating random data and writing it out in SQLite3’s internal layout.

        1. 3

          So do you estimate it will faster than the fastest Rust version? That was 100 M rows in 33 seconds.

          I’d expect you could do .import csv of a 1.5 GB CSV file on a modern machine in less than 30 seconds?

          Also this architecture is parallelizable over 2 cores – one generating CSV, and one sqlite process writing the data.

          My instinct was that the “shell version” of this program would be the fastest :)

          1. 3

            Without the overhead of generating the dataset, loading the CSV version into SQLite3 via a single import command took 3m5s seconds on my 2020 MBP with the DB on an SSD. This works out to about 10 MiB/s.

            $ cat load.sql 
            PRAGMA journal_mode = OFF;
            PRAGMA synchronous = 0;
            PRAGMA cache_size = 1000000;
            PRAGMA locking_mode = EXCLUSIVE;
            PRAGMA temp_store = MEMORY;
            .mode csv
            .separator ","
            .timer on
            .import sqlite3_opt.csv user
            
            $ time sqlite3 import.db < load.sql
            
            real    3m5.419s
            user    2m46.685s
            sys     0m12.391s
            

            I’m not sure if there is a good way to break past the single-core bottleneck when loading data in. I can see one core sat at 100% when this above import happens.

            Even on Presto+HDFS clusters, creating a new table based on a SELECT * from another table will result in each node in the cluster building their own section of the dataset with a single core. There seems to be some enforced linearization there as well. Using Zstd instead of Zlib for compression at best improves perf by ~30%.

            Can anyone shed some light on getting past the single-core compute restrictions of SQLite when loading CSV data in?

            1. 1

              Thanks for the info! I wonder if the CSV parser is kinda slow? I hope I will find time to test it out, as I’m kinda surprised that 100M rows or 1.5 GB can’t be loaded faster than 3 minutes.

              I was also wondering about a single INSERT statement of multiple rows would be faster. You use the SQL parser instead of the CSV parser.

              INSERT into user VALUES (row 1 ...), (row 2 ...), ... (row N ...) ;
              

              I think one issue is that the example code uses an if statement and separate INSERT statements, and I think the Rust version does too:

              https://github.com/avinassh/fast-sqlite3-inserts/blob/master/sqlite3_opt.py#L26

              Basically I think you could factor that code into data, and it would be faster (?). I guess the CSV method is already doing that though. I’d be interested to see your code and maybe run it on my machine.

              I don’t think sqlite has any multicore / multithreading features – which is why I think splitting it up into 2 processes (generation and loading) is probably a decent win.

              1. 2

                I did a quick test comparing inserting via CSV versus SQL. It seems SQL triggers about the same number of calls to fdatasync but they take about ~3x longer.

                $ cat test1.csv
                1,hello,123
                
                $ cat load.sql
                .mode csv
                .separator ","
                .import test1.csv test
                
                $ cat load2.sql
                INSERT INTO test (a,b,c) VALUES (1, 'hello', 123);
                
                $ strace -wc sqlite3 import.db < load.sql
                
                % time     seconds  usecs/call     calls    errors syscall
                ------ ----------- ----------- --------- --------- ----------------
                 78.02    0.005216        1304         4           fdatasync
                  4.29    0.000287           7        41           mmap
                  2.53    0.000169           8        19         1 openat
                  1.67    0.000111         111         1           execve
                ...
                ------ ----------- ----------- --------- --------- ----------------
                100.00    0.006685                   213        10 total
                
                
                $ strace -wc sqlite3 import.db < load2.sql
                % time     seconds  usecs/call     calls    errors syscall
                ------ ----------- ----------- --------- --------- ----------------
                 79.94    0.015514        3878         4           fdatasync
                  3.61    0.000701          17        41           mmap
                  2.34    0.000454         453         1           execve
                ...
                ------ ----------- ----------- --------- --------- ----------------
                100.00    0.019406                   209        10 total
                

                I suspect something like valgrind could give a much more authoritative answer on the number of op codes needed for the operations above.

            2. 1

              I was going to mention that there’s the .import function as well, and pythons sqlite3 package I think has the ability to call functions from the package on the db. I was loading a large set on Friday and a 6.44Gb CSV with about 6 columns took single-digit minutes using .import and I wasn’t using any of the additional performance settings.

            3. 1

              You could pipe the CSV output from Python to the sqlite tool instead of writing it to a file first.

              But the tool isn’t doing anything magic, so I wouldn’t expect it to be faster than a C or Rust program. This would just be a fast solution involving a lot less coding.

          1. 13

            I’ll tell you why I’m super-happy to see this. Not because it puts some money, and stability under the people working on for instance tokio (though that is really awesome), and not because it will help rust and the ecosystem around it move that much little bit faster.

            No, I’m happy to see this because I now don’t have to try and make the argument “It’s great! Mozilla uses it for stuff!” which is what almost every discussion about adopting rust as a programming language in an organisation generally degrades into once you realize no one understands what the hell a “borrow checker” is.

            And yeah I get that evaluating a language for use in enterprise / large organisations is a lot more nuanced than that, but when you’re talking to higher ups, being able to point to real world, large-scale, vocal adopters of something wins you half the battle some times.

            1. 4

              This is effectively where I was headed about 1/1.5 years ago with disease surveillance / outbreak management software. We had a number of places we operated in which didn’t have infrastructure, didn’t like the idea of cloud or have the money for cloud, but could get hardware easily (laptops, phones, little mini servers). In a context where excel dominates, whatever you build needs to feel as ubiquitous as excel, and web apps just can’t be that, no matter how hard we try, even with electron, there are certain things that require too much working around or are just impossible to make work the way you want.

              At the end of my contract I had a fully offline, native desktop application that was using CRDTs to sync data between local instances of the application over TCP/UDP and using multicast to detect other peers. As well as using bluetooth to allow transfer between mobile, etc… You could open the application on a laptop or mini tower somewhere and configure it as a “transit hub” which basically acted as a sort of dumping ground for data which came near it. Data was synced across all the nodes in the graph of nodes via a gossip-like protocol, where a given user could be carrying the CRDTs for data they didn’t even know they had on their device. It was made so that a significant portion of users could operate completely offline, and their data would “findy a way” to make it to the transit hubs and eventually to the regional or country-level offices.

              It was frankly pretty awesome, and one of the things I’m most proud of building out of anything I’ve ever built, my walls were plastered with diagrams, reams of paper everywhere while I figured things out. Then I went into contract limbo, got fed up, and finally walked away from the project after 6 months of not being paid while still working on it, about 6 months later, Coronavirus came… The project was handed off to a company in India and all that work was tossed out for an app that required an always-on connection to the internet and the super ambitious roadmap I had built for the project as far as I’m aware was tossed along with it. Saddens me, it had such potential to do some real good, it was already active in most of the outbreaks in the last 5-6 years like Ebola, Dengue, the Rohinga crisis, Syria, all over the Pacific… had potentially curbed hundreds of deadly outbreaks in those regions…

              1. 3

                Kids, kids, it’s all kids. Playing some Mario U with them, I managed to trade in a bunch of old games/devices for a Wii U and some games, the kids are addicted to watching Bricks o Brian mario videos on Amazon, but we can’t afford a Switch and games, so cobbled enough trade-ins together to get a Wii U and some games.

                I’ve got a few things I’m mulling over whether I keep working on. I’m pretty happy where I am at work, and I slightly just want to build things that I want in my free time instead of things to try and make money off of now that we’re a bit more financially stable.

                1. 8

                  When my kids come home at 3-ish and rush in the door to tell me all the exciting things that they did today at school. I get hugs, stories and a lot of shouting and screaming because they missed me.

                  1. 17

                    This is really hard to read on mobile.

                    1. 33

                      Much like YAML!

                    1. 2

                      Electron without electron. I can see the benefits that electron has had in terms of making desktop applications easier to build for a whole lot of people. But I hate the cruft and janky dependencies of NodeJS and the whole dependence on embedding chromium. I want a native desktop framework that has a good solid HTML/CSS interpreter and an embedded UI programming language which has more type-safety and doesn’t end up being a resource hog while being somewhat aligned with the language used for the lower-level programming of the application. It should compile down to a manageable static binary size. CSS, LESS and JS concepts of UI development work for people, people understand them so it has to be something parallel to that, but strip off all the extra cruft that should be handled by the underlying language that the framework is built in as well as exposing a GL context into the HTML/Views to render highly customized graphics / bypass the HTML rendering engine from the backend.

                      Also an out of the box CRDT (or similar) based embeddable database library with a gossip/viral-like syncing protocol for deploying desktop, embedded and mobile applications in places where connectivity is typically pretty bad and it’s easier to shunt data along through other devices until connectivity is stable.

                      1. 1

                        What’s a good candidate for cross-platform GUI? Qt?

                        1. 2

                          Qt5 is nice, but if you go the QML way it’s sort of a bastardization of HTML/CSS and when I’ve used it, it doesn’t feel as easy to use as it could be, I’d like to be able to have a frontend developer be able to build out the UI with something that they can easily pick up, and has parallels to the browser context.

                          Also Qt5 licensing is awkward to understand, which I think is why it doesn’t get as much uptake as it should, it’s a great framework/application building platform, but I think the license is just too confusing for people that want to dip their toe in. When you hold up GUI frameworks beside each other, you look at Qt5 and see a really robust platform for cross-platform development, but you see the license terms and a lot of people get scared away by them, there’s not a lot Qt can do about that as it’s because of the libraries they use, and the reason the framework is as good as it is is because they’ve had commercial sales to underpin the work.

                          But there are a lot of semantics about how C++ works in there, and how the whole QML + JSON/JS hybrid thing works that just irk me, I want to be able to hire a frontend developer to work on an application from the pool of available web develoeprs and have minimal friction for them transitioning to a desktop context (this is one of the main factors why electron is as popular to use as it is). There are some awesome frontend devs/designers out there and I think we sort of shat the bed on making usable / understandable tools for them on native platforms to build UIs.

                          Qt has QWebEngine but that just embeds chromium again, so you’re back to basically the same as electron.

                      1. 2

                        Working on wrapping up a first staging deployment of a fairly sizeable data sharing platform at work before I take my first proper annual paid leave next week for the first time in I think over eight years. Can’t remember, my brain is doing this thing where I break out in sweats thinking about how am I going to be able to afford not being paid for a week, then realizing that I don’t have to worry about it…

                        On the side i’ve got an disease surveillance platform i’m working on that I’ve broken a few pieces out of it (text editor, UI components, general components, and a few other bits) into a smaller task management / knowledge management app to put up in the app stores to try and generate a bit of revenue to feed into the larger application. It’s taking more time than I thought it would overall with a full-time job going and doing this stuff on the side but it’s bought me the time to think things throuhg more thoroughly for how I implement certain aspects of things, especially data syncing and conflict management / resolution.

                        1. 2

                          Stripping the frame on an 83 yamaha xv920 and trying to find a local CNC shop to mill me a new top yoke and spacers so I can fit a Yamaha R6 frontend on it.

                          Taking a break from spending the last couple weeks jumping between a task/knowledge management desktop app and a massive disease surveillance platform I’ve been working on for a while.

                          Getting ready for my first paid vacation in about 10 years a week from now. Feels weird. Still having heart palpitations when I think about it as my default state is how am I going to recoup the lost money from taking the week off. But don’t have to worry about that!!

                          1. 3

                            Just to say, the disclaimer is gold. Good on the author for being up front. A lot of articles just dive in and state things like they’re the de facto way to do something, a concerning majority of them are very wrong and occasionally very dangerous.

                            Having that disclaimer up front sets the playing field pretty clearly.

                            1. 2

                              Thank you! :)

                              I have noticed this a lot, where the author of the tutorial acts as if that is the only way to do it. I wanted to change that and make sure any readers do not assume my, or anyone’s, way is best.

                            1. 24

                              Don’t. They’re loud, heavy, inconvenient, and expensive.

                              1. 3

                                To add to this: unless you specifically want to learn about enterprise technologies (RAID, SFPs, etc.), go with consumer hardware. You’ll save money on parts, electricity, mounting, etc. and won’t have to deal with the noise as much. NUCs are great if you want something small and powerful(-ish), or Mini-ITX, mATX for small form-factor custom builds. The consumer space has a lot more flexibility than enterprise hardware, which lets you fine-tune your build for whatever usecase you have.

                                1. 1

                                  I second the NUC comment. I have two set up at home and they’re awesome.

                                2. 2

                                  Agreed.

                                  I had a rack-mount 2U Compaq DL380 with all the bells and whistles that I got for free during the dot-com bust. It was a pain to live with:

                                  • Weird boot process that only worked with Windows and Red Hat Linux
                                  • So many fans that even though it was in the basement I could hear it on the ground floor
                                  • It cost $10/mo in electricity, even after I removed one of the two power supplies
                                  • Not that fast, except for the disks which were 15k RPM

                                  Do the planet and yourself a favor and go with consumer-grade hardware.

                                  1. 1

                                    Do the planet and yourself a favor and go with consumer-grade hardware.

                                    I’m not an environmentalist, but the prospective impact to my power bill has me concerned. $10 / month to run a fast machine would be okay, though. I’ll have to do some more research into TCO I think.

                                  2. 2

                                    They also eat power. See vermaden’s posts (e.g. https://vermaden.wordpress.com/2019/04/03/silent-fanless-freebsd-server-redundant-backup/ ) on running a simple server if you really need something.

                                    (bias: I’ve done the freenas thing and paid into the myths of server hardware/ecc ram/etc. While it’s come in handy at times (ipmi is convenient) it’s also been a time burden when it comes to doing maintenance or trying to do any slight customization in setup from what freenas wants. If your needs are largely just storage, consider just leveraging your fast internet connection and paying a cloud provider. Remember even with a server-grade NAS you’re arguably still going to be paying someone for off-site backup.)

                                  1. 1

                                    I was already working from home, have been for over 10 years. Also grew up in a really remote area of northern Canada, barely had any neighbours, worked jobs that put me out in the forest for long periods of time. The pandemic basically didn’t take much of an adjustment for me. My wife wanted to quit her job anyway so the pandemic was an easy out for her. The financial side worries me a bit, food costs where we are seem to be going up steadily. And I worry about my kids and the fact they’re not getting a lot of social time with friends. But mentally I’m good, still have a really good job, haven’t ended up in divorce court.

                                    1. 2

                                      I don’t know about the graph rendering but youtrack has the dependency/relationship aspects in the bag. You could probably generate a chart from the api of the dependencies. It feels a bit awkward at first but YT is really malleable. Once you really start to dig and learn shortcuts and stuff you can power through a lot of things that are really restrictive or awkward in other apps/products.

                                      1. 1

                                        Thanks I will look into that. I have been happy with Intellij so already have a bias towards anything JetBrains makes.

                                      1. 1

                                        Lots of infrastructure, build pipeline work, writing some distributed services using Rust and a lot of kubernetes.

                                        On the side, liasing with my brother who is on the other side of the world to figure out a plan to start getting equipment and gear together to start doing Enduro motorcycle rallies (some in US, some in EU and Africa) once the competitions start opening up with the ultimate goal of eventually attempting Dakar. Just getting to grips with all the costs, equipment needs, how to find sponsors and all that fun stuff.

                                        1. 1

                                          In general, if you don’t care, you don’t care. No one is forcing you to make a change to your repos or change your behaviours. There aren’t any git police going around handing out citations for branch naming conventions. There may be a few people who are pushing harder than others, who are out there asking owners and maintainers of code to make changes that they feel might alienate or otherwise upset people. They don’t have any authority, and you don’t have to listen to them. I don’t think anyone is going to turn around and label the maintainers of various repos as racists just because they haven’t bothered to change a legacy term that’s a holdover from previous systems.

                                          You’re not going to get in trouble for not renaming your branches and the general attitude i’m seeing in some places of people feeling like they’re under attack when they have full control over their git repos, configuration and setup of those git repos is frankly a little ridiculous.

                                          The change in and of itself (for some repos), is mostly trivial, you might have to make some tertiary changes around build systems and the like, modify your typing behaviour a little. If your repo is private, or it’s public and no one on your team really cares about the change, then fine, don’t do the change, it’s as simple as that, there are so many repos out there across all the services it’s pretty much guaranteed that you will be in amongst a sea of other repos that similarly haven’t changed. You can have and communicate your opinion on it, you’re absolutely free to do that. You’re even free to shame or look down on others for making the change if that’s what you really want to do.

                                          We’re making the change and I don’t have to justify it or defend it to anyone outside of my team.

                                          At the end of the day, for me, and for the team that I am a part of, the history of the term, it’s etymology, how it’s used in the context or in other similar contexts such as whitelist/blacklist, doesn’t really matter. What really matters at the end of the day is the answer to the question “Does the use of this word, in this context, have the potential to cause anyone of any background to be uncomfortable or have any other negative emotional/psychological effect on that person?”

                                          If the answer is even remotely a Yes, then my team and I need to change it. It’s a no-brainer, I would rather remove the risk over gambling with someone elses well-being and mental health any day. But others may not feel that way, and all the power to them. Do what works for you, do what you feel is right, just try not to trample on anyone who chooses a different path.

                                          1. 9

                                            You’re not going to get in trouble for not renaming your branches

                                            Let’s revisit this in two years time. ;-)

                                          1. 8

                                            Karabiner for me by far. I have access to over 500 actions/hotkeys without moving my fingers at all. Caps lock is too far so is return key. i.e. my return key is mapped to s+f. That is I press s key, then f key. My control key is a key. Most map control to caps lock but why map it to caps lock when you always hover over a key already.

                                            It’s pretty powerful stuff. My config. https://github.com/nikitavoloboev/dotfiles/blob/master/karabiner/karabiner.edn

                                            1. 2

                                              I love Karabiner. It is an essential part of my workflow on Mac. Vim bindings, layers, media controls, screen organization, all on my 60% keyboard with Mac-compatible keys is amazing.

                                              1. 1

                                                This looks awesome. I see that your config is EDN format but I’m seeing JSON in the docs. Do you just convert yours to json?

                                                1. 2

                                                  Yeah its converted from EDN to Karabiner JSON using Goku (https://github.com/yqrashawn/GokuRakuJoudo)

                                                2. 1

                                                  This looks amazing, admittedly I looked at karabiner when I had a 60% keyboard, but didn’t realize that it could do all the extra things it can do! I’m going to have to give this a try this weekend, may steal some of your config to test it out.

                                                1. 1

                                                  1. zsh
                                                  2. neovim
                                                  3. iterm2
                                                  4. JetBrains DataGrip

                                                  I used IntelliJ for a few years, but found it just got bogged right down when you start into large-scale multi-language projects. There was a significant typing delay for a while where you would hammer on the keys and then could watch the editor trying to catch up that just finally did my head in and made me quit it and move back to vim/nvim. I still pop it open from time to time for the debugging UI, but other than that i’m 99.9% of the time in vim.

                                                  I tried Kakoune for a bit as well, but just never really committed to it. I write code across rust, python, golang, C++, JavaScript, HTML/CSS and a lot of infra and config stuff.

                                                  1. 1

                                                    Reviewing notes on a concept for a disease surveillance data aggregation platform for some countries, working on a research data collaboiration/sharing tool and trying to expedite development on it to get some data sharing happening between orgs on COVID-19 research.

                                                    1. 21

                                                      This is why people are interested in CRDTs. Or at least, it’s why I’m interested in CRDTs. There seems to be a lot of activity in this space at the moment. Two projects that come to mind are Automerge/Hypermerge, and OrbitDB. The irritating thing about all of these projects right now is that they pretty much require you to write your application in javascript, which I really don’t want to do. However, I’m contributing to the Rust port of Automerge so hopefully soon we’ll be able to write automerge applications in Rust.

                                                      1. 4

                                                        This. I’ve been trying to build something on top of SQLite3 to do this as I need it for Rust. All the projects I come across for this type of thing seem to be geared towards NodeJs type applications.

                                                        1. 1

                                                          I have noticed something very similar, I see a lot of Node.js and JS-in-general projects cropping up such as SSB and OrbitJS. Not sure what that means for the ecosystem, actually.

                                                          1. 1

                                                            I would say that interactive use is probably the main driver so people have written it in front end-ish languages. I don’t think it means anything about this technology or theory.

                                                            1. 1

                                                              I would say your assumptions are reasonable. Thanks for sharing.

                                                        2. 1

                                                          That’s a good new keyword for my research, thanks!