1. 13
  1.  

  2. 5

    I can’t decide, whether this is a good read because of the technical insight or because its written so funnily. Anyway good read!

    1. 3

      Or you can just bruteforce the problem with CLI tools, a pinch of regex, and a whole lot of trust in the format. In my experience real world XML files are written with consistent field ordering and are actually easiest and fastest to parse with regexes, especially when you only need to do it once. I am guessing the following code is also a magnitude faster than your solution.

      unzip -p wcproduction.zip \
      | iconv -f UTF-16LE -t UTF-8 \
      | rg -o '<wcproduction .*?>.*?</wcproduction>' \
      | rg '.*<api_st_cde>(.*)</api_st_cde><api_cnty_cde>(.*)</api_cnty_cde><api_well_idn>(.*)</api_well_idn><pool_idn>(.*)</pool_idn><prodn_mth>(.*)</prodn_mth><prodn_yr>(.*)</prodn_yr><ogrid_cde>(.*)</ogrid_cde><prd_knd_cde>(.*)</prd_knd_cde><eff_dte>(.*)</eff_dte><amend_ind>(.*)</amend_ind><c115_wc_stat_cde>(.*)</c115_wc_stat_cde><prod_amt>(.*)</prod_amt><prodn_day_num>(.*)</prodn_day_num><mod_dte>(.*)</mod_dte>.*' -r '$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14' \
      | tee wcproduction.csv
      

      After doing the raw processing any additional processing is then also much easier to do with other tools (e.g. sort -t, -k12 to sort the data by the 12th field, or by writing small python scripts if something fancier is needed).

      Sure, this is not pretty, but if it only needs to work once, who cares? (Although it is probably a good idea to check that wc -l wcproduction.csv is the same as unzip -p wcproduction.zip | iconv -f UTF-16LE -t UTF-8 | rg -o '<wcproduction' | wc -l, but sanity checks like that are also missing from the presented rust code.)

      It was still very nice to see that writing this kind of a one use program in rust was rather straightforward and didn’t actually require that much code. Although the parser implementation seems way, way, more complex than is needed for this problem.

      1. 3

        I sort of apprechiate that you took the time to write a regex that seems functionally equivalent, but if you already go through the effort of writing it and making this off-handed comment about performance I think you should validate that claim. As somebody who has used both Rust regex and quick-xml to parse XML I am fairly sure there is no difference in performance, generally speaking.

        1. 3

          I’ll test it out as soon as I can get my hands on the original file, the website is giving me connection timeout currently.