1. 15
  1. 12

    This is what happens when a programmer thinks they can broadly apply programming ideals to non-programming contexts. Any system that could “check data” would probably require a full understanding of reality itself. I won’t stop the author from dreaming, their heart is clearly in the right place but perhaps a smaller goal would be better.

    1. 3

      Any system that could “check data” would probably require a full understanding of reality itself.

      This could be true but I wonder how far you could get simply by building a large corpus of data errors in writing. Perhaps if you had a dataset of 100k or 1M examples, you could train a blue squiggly model to highlight potential data errors with decent accuracy.

      1. 3

        Lets say you could and had done it. Would newsrooms use it? Programmers are not, by and large, using proof-of-correctness tools despite them being available, right? Why would journalism be different?

        1. 3

          Programmers are not, by and large, using proof-of-correctness tools despite them being available, right?

          True enough that there’s consistently turnover among good researchers who leave that sub-field questioning if their work was worth anything. Only thing keeping me in is the fact that we seem to have hit a tipping point with the tools getting great (some free) and a few FAANG’s reporting on positive results.

          One of my side projects past to present year is designing right circumstances that act as a catalyst to get a large number of developers to tip over that. I have pieces of a proposal done on it. If it happens, there’s another possibility to straight-up force companies to do it. Developers getting easy, cheap results is a pre-requisite, though.

          1. 2

            Programmers do use linters when they’re convenient. The key is to make it really convenient, like the way Google Docs points out potential typos.

          2. 3

            That’s several orders of magnitude below what you would need. There’s probably enough complexity in the subject of “eggs” to warrant 100k examples. Wild vs Farmed. Cooked vs Raw. Various types of Cooked egg. Various types of omlette. Various types of cooking styles within a specific type of omlette. Outliers such as Platypus Eggs, Ostrich eggs, Fish Eggs.

            So for example, “The average farmer makes 32 eggs a day”. You can see the various categories and nature of “eggs” can create a wide variety of expected outputs. This is a particularly contrived example, once you start examining real examples you will see why this requires a more or less full understanding of reality as we know it.

            1. 1

              To clarify I meant 100k or 1M examples of English sentences with data errors labelled. You’d train the model to classify whether a statement is referring to a big dataset or not, and if the sentence is not linked to a dataset, trigger an error.

              So the checker wouldn’t have any knowledge of the domains, it would just be trained on what types of sentences require data backing.

              So all sentences of the form “The average X …” might be highlighted, while a sentence like “My favorite X” would not.

              1. 1

                Being able to measure whether a piece of information is likely or not requires a universal inference engine and requires knowledge of the domains. Such a tool would have to have a tremendous amount of understanding of reality. Tagged/Categorized English sentences won’t save you here. The world is just an outstandingly complex place, which also defies categorization. You would need an astronomical amount of data, also you would need to know that the data you provided it is actually accurate and unbiased. Respectfully your project as it is stated is unattainable. There may be a good idea near it, however in its current form it’s not good.

                1. 1

                  You could be right, but I respectfully disagree. I train and use text classification models at least monthly. The technology is very mature and accessible as I am an amateur in NLP. My hunch is putting together a dataset of 100k to 1m sentences, with a binary label of “needs dataset”, may create a useful classifier.

                  1. 2

                    There is a huge and usually unrecognised chasm in AI development between the concept of ‘something that gives a user some useful hints and saves them time’ and ‘something that completely solves the problem in an automated way including all edge cases’.

                    While we are looking at language changes I think we should probably come up with a shorthand for talking about this. So many arguments about what is possible in AI seem to involve people not acknowledging which of these they mean. You seem to be saying we could make a useful ‘hinting’ AI for this and others are saying we cannot make a universal data error classifier. I agree with both positions and see no contradiction.

                    1. 1

                      I did say there is an idea nearby this idea that is probably a good one. However the solution provided (using NLP and sentence prediction), probably won’t meaningfully address the problem. It would highlight nearly every sentence used that included a number or proposition. How can you hint, or even be remotely true without knowing the priors. How many light switches are in the room you are in? I can guess 1, because I have prior knowledge about light switches and offices. I might be wrong but it’s a theoretically useful guess. By contrast an AI is really not meaningfully capable (yet) of predicting arbitrary facts about reality, especially useful ones. Additionally there’s a great risk of bias in your algorithm, garbage in / garbage out. Crowdsourcing this would just be a replication of the twitter bias echo. It won’t help the current problems we face, unfortunately.

                      I will say, there probably is a useful idea nearby this idea. It just isn’t specifically this idea. Knowing whether facts should exist to substantiate a claim is a simpler problem, but I think English is still quite slippery in this regard. A better approach would be to come at it as others have suggested from a linguistics perspective of highlighting words that imply certainty. However Op has also stated that they do not feel that these represent a large set of the errors made.

                    2. 1

                      Again I think if you were trying to classify statements on eggs, that might be a little shy but it could work. All conceivable statements is a much broader set of propositions. I think you should try to prove me wrong though. Put in as much effort as you can for say a month, and let me know how it went. There is no better teacher than executing an idea.

            2. 1

              Godel would like to have a word about the possibility.

              1. 1

                That’s a good point. I should state I’m more interested in a “good-enough” data checker that relies on heuristics than precise logic.

                I’m more of an empiricist than a mathematician. I’d much rather fly in an airplane that’s flown 1,000 times than one that hasn’t flown yet but was “proven” to work by something like Lean or Coq.

                1. 1

                  Bayesian inference is great! It is not unlimited in reach or power. For example, how likely is it that the following number in an article I read is accurate?

                  “around 72,000”

                  You can say, well you did qualify it with “around” so that helps, less precise prediction are more likely to be correct than more precise predictions. However without truly understanding what I’m talking about you can’t meaningfully say it’s even remotely accurate. I could be saying “there are around 72,000 hours in a day”. In this case it’s the number of people who died in the US from alcohol related deaths. Does that sound high to you? Does that sound low to you? That depends entirely on your priors, which of course your inference engine will not have, because it does not have understanding of reality, just small factoids. If anything you’d be better off simply stating whether a number seems accurate with literally no context. This is not a strong idea, but it’s a more attainable one. 4? Idk sounds fishy.

                  1. 1

                    This is good feedback that is helping me hone my thinking.

                    I’m not so much interested in verifying the logic of the conclusions, but in throwing errors if the data, reduction code, or context is missing. Then it would be up to the reader that the conclusions are valid.

                    So at this point the blue squiggly would just appear on sentences not linked to data that should be.

                    1. 1

                      What shapes whether data “should be” present, or is implicit that it isn’t or can’t? I think your idea may struggle with entire articles being marked blue.

            3. 12

              There is a value to imprecise communication: language-oriented thinkers find it easier to imagine things they find it easy to articulate. We have a hard time imagining what questions to ask if we haven’t seen variations of those questions phrased as declarations. Even lojban has the capacity to be vague – so long as you specify how vague you are being.

              I’m all for higher standards in reporting. There are some reasons why this isn’t completely compatible with open data sets.

              One reason is accessibility: folks who can understand a statement like “tensions rise in the middle east” may not be able to properly interpret a graph showing (for instance) number of violent incidents, and such a graph will need to be structured in such a way that the story it tells is sensible (even when justifying that structure – what counts as a violent incident, what sources we use for those numbers, where to cut off the beginning and end, how we set error bars, etc – may be arbitrary or may require hundreds or thousands of pages of very technical argumentation to be reasonably justified to a general audience).

              Another reason is the special role of secrecy in journalism: the most important information a journalist has is often secret, and leaked under conditions of anonymity, so a source cannot be listed; were the condition of anonymity lifted, the information would not have been disclosed (since the leaker is liable to be arrested or killed, or at least lose their job).

              Should open data sets be made available when possible? Absolutely. And, it makes sense to provide graphs & other, more numeric interpretive frameworks (along with justifications for how they are presented) too. But, journalism is also meant to give voice to conditions that are more nebulous, before the stage where it’s even possible to imagine how to apply scientific rigor. There is a material basis for a feeling that ‘tensions are rising’, & that basis absolutely could be manufactured, but even when it’s genuine it may not be properly instrumented. We don’t have the average heart rate of everybody in the region (and after all, the boundaries of what even constitutes ‘the middle east’ is fuzzy).

              This brings me to my second point. Strict definitions that do not change run counter to the function of informal writing! Most words do not have a well-defined boundary around what they refer to, but only a locus – a center point in meaning-space that corresponds to the most well-accepted example (around which we have a fog of less and less universally-accepted examples). Informal writing, including journalism, functions as a conversation in which the locus-point of words’ meanings are pushed in the direction of making those words more useful.

              The concept of ‘the middle east’ is fuzzy, the subject of absurd levels of vitriol among people who have emotional fixations on it (like Taleb, for instance), and really doesn’t and can’t have a strict definition that properly corresponds to what we intend to mean when we say it (since, like all ideas about identity and nationality, it’s more of a state of mind and a sense of association with some locus than a well-understood and well-defined cohort). Words shift meanings over time based on feedback loops between different usages, in an evolution that makes them more and more useful for describing current circumstances. One could very well argue that rising-tension is part of the ‘middle east’ state of mind & say that, were there ‘peace in the middle east’ it would no longer be what we call ‘the middle east’ simply because the network of associations and attributes that form the halo around that concept are mostly both cause and result of a long history of conflict and tension.

              This is how language works – and this is how language must work, because without this iterative process, we could not easily imagine & talk about things that there were not already existing words to describe. We would be trapped in the realm of the known-known, and could not grow intellectually except individually.

              There’s a third issue here. Journalism is already very expensive and (often) very slow. Enforcing raised standards universally is liable to make it both slower & more expensive. Expensive journalism results in centralization, typically around privileged viewpoints: the handful of people who can afford the new, more-expensive journalism will dictate, through their interests, what gets covered & what gets ignored – a loss that cannot be identified from the new open data sets, because what was once covered will no longer be covered or measured and will simply be invisible. Slow journalism means that important decisions cannot be made based on marginally-flawed reporting and must therefore be made based on no reporting at all. Journalists are trained to consider the ethical ramifications of these trade-offs & to do as much checking as is feasible without reporting too late to be of any use at all.

              We cannot rely upon the good hearts of journalists or institutions alone, but neither open data nor strict definitions will solve those problems. As much as I hate to say it, we have no choice but to rely upon individual media literacy here – media literacy to counterbalance bias in coverage, to identify and point out flaws and errors, and to do semi-journalistic groundwork.

              I am not proposing a market of ideas. Markets have accumulation effects which lead to counterproductive incentives (like the tabloid stock phrase “if it bleeds it leads”). We need something more like a concordet voting or delphi pool of ideas, where random errors cancel out entirely and biases are blunted by the same mechanism.

              1. 3

                That’s a good point that we always need an arena for loosely typed language to try out new ideas and extend the boundaries of knowledge.

                Strict definitions that do not change

                I’m very much in favor of changing definitions as understanding improves. This is why something like automated data checking could be very helpful. If your definitions changed, you could see what later statements needed to be updated.

                About the expense of data driven journalism, I agree. Hopefully technological advances will lower the cost. Yours is a very good point about the importance of timeliness in reporting.

                1. 1

                  It seems like what you’re looking for is social science. There’s a lot of good social science, & it’s done up to these standards. Social science & journalism both need to exist (and they feed into each other). If you’re not in a situation where you need the special attributes of journalism (timely access to sometimes-secret information), it’s possible to just read social science papers instead.

              2. 7

                It is very expensive and time consuming to build datasets and make data driven statements without data errors, so am I saying until we can publish content free of data errors we should stop publishing most of our content? YES! If you don’t have anything true to say, perhaps it’s best not to say anything at all.

                If you truly believe this, then why have you written this blog post? It is teeming with uncited statements, as you yourself note.

                1. 1

                  I think it is fine to post content with data errors as long as you are aware of them and disclose them.

                  1. 2

                    But you didn’t even mark and disclose all of your data errors.

                    1. 1

                      “click the button below to highlight just some of the data errors on this page alone.”

                      1. 2

                        Yes. And you didn’t bother to highlight all of your data errors. I find this telling, because you’re asking people to do a huge amount of work to mark data errors, and you didn’t even do it to completion in a single post.

                        1. 1

                          Exactly. It’s a huge amount of work to mark data errors. So I’m hoping someone invents something new to make that less work.

                2. 5

                  One issue here is going to be the streetlight effect:

                  A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, “this is where the light is”.

                  Data is all well and good, but some things are hard to measure. And just because something is hard to measure doesn’t mean it’s unimportant. Sometimes the most important things are the most difficult to measure.

                  1. 1

                    I like that anecdote. Thanks for sharing.

                    This is a very good point. A rule of thumb I use is that 80% of the most important things to measure are currently not only not measured but are completely unknown.

                    Doing a better job at organizing what we do know is important but also making it easier to discuss what we don’t yet know (perhaps by making it easier to “invent” new hypothetical datasets and generating synthesized data) is important too.

                    I’ve been wearing fitness trackers for 6 years and have lots of valuable data on sleep patterns, exercise, heart rate, etc., but I look forward to the day when I also have data on blood sugar, stress, energy levels, and a lot more.

                    I really like the process of creating definitions for hypothetical measurements and synthesizing data for them.

                    One dream is to have a repository of the definitions of millions of types of datasets to make it easier to see what we have good measurements for and figure out what areas are dark and we need to invent new things to illuminate.

                    1. 1

                      Thanks! I wasn’t familiar with that term from Linguistics. Definitely seems relevant. These sorts of links help me clarify my thoughts a bit.

                      The class of errors I’m talking about generally involve large arrays of events and not singletons. The Evidentiality concept seems to be focused on singletons (“I saw he ate it”). I don’t think there are so many errors made on those singletons. Generally I think the problem is when people try to speak about a big collection of things as if they actually had a big dataset.

                    2. 4

                      I’d just like to note that this was a great read, very nicely written. And while I feel the article goes way too far in judging things erroneous, I think there’s a valuable idea in there. And if it’s just that some Wikipedia-style attempt at marking up references could be useful in news articles. So perhaps don’t try to decide what’s a data error and flag it, but do try to find unsupported claims?

                      1. 1

                        Thanks for the kind words.

                        And while I feel the article goes way too far in judging things erroneous, I think there’s a valuable idea in there.

                        I’m glad you got that. I am similar to you in that I’m not 100% confident in the post but writing it (and the resulting conversations and references people have provided) has gotten me closer to understanding the problem.

                        So perhaps don’t try to decide what’s a data error and flag it, but do try to find unsupported claims?

                        This sounds like a good pivot. Perhaps a model could identify when a sentence is making a claim that should have a backing dataset. Then what I would like to see is a direct link to the computable dataset (and perhaps ideally an interactive page showing the work of deducing the statement). Worst is not having any link to data at all. Second worst is regurgitating a statement from a non-computable pdf.

                        I would say the closest thing to my dream system is currently Wolfram Alpha/Mathematica.

                      2. 3

                        Some quick numbers for people not familiar with the world of programming languages. Around 10,000 computer languages have been released in history (most of them in the past 70 years). About 50-100 of those have more than a million users worldwide and the names of some of them may be familiar to even non-programmers such as Java, Javascript, Python, HTML or Excel.

                        Where is the dataset for this claim! The irony :p

                        1. 3

                          Seems I’m among the few here who like this idea. Great minds think alike, I guess. :D

                          1. 3

                            I’m not sure the author understands the world. When a newspaper puts out a headline like “This is the worst …” they are not attempting to tell the truth or convey information. They are attempting to sell newspapers. The important (and increasingly rigorous) data analysis occurs in the backroom where it is decided what and how to present the raw information in order to maximize revenue on a day to day basis.

                            In earlier times this computation was based on editorial intuition. The decision to publish the story “Man bites dog” on page 1 and to relegate the story “Dog bites man” to page 20 is based on sound economic principles, even though the latter story indicates an important problem with rabid dogs in the city, while the former story is just titillating because this is unlikely to be a common problem that needs fixing.

                            In contemporary times newspapers have much better tools that allow editorial boards to more precisely track the sentiment of their readers and target groups and prioritize and editorialize information to maximize revenue from these target groups.

                            The article proposes a solution for a problem the newspapers do not have. You may say that it proposes a solution for a problem society (aka we the people) have. However, after observing people for a while I am no longer convinced the tail wags the dog. We get stupid news because we are stupid. There is no math that will fix that. We have the leaders we deserve and the news we asked for.

                            1. 3

                              It’s a fair point and I’m not surprised things are the way they are when you think of the economics of it.

                              I think we could change the economics through novel innovations that lowered the cost of data driven writing or by ending subsidies for the intellectual junk food industry.

                              If the problem was just confined to news, I don’t think it’d be so bad. But data errors are everywhere (even in your and my comments!). You see them in research, at the hospital, at the store, in government policy making, etc.

                              because we are stupid.

                              I agree, haha. That’s why we need things like spell check (and hopefully someday, data check!).

                            2. 2

                              I’ve learned enough about pbt Omega to think that it would be relevant to the author’s interest.

                              Systems of property based types (PBT) overcome many of the limitations of current automated systems because they are built on a logical foundation that differs remarkably from those of conventional digital computers and artificial intelligence (AI). In particular, systems of property based types can validly characterize anything imaginable, to answer questions in the absence of complete information, and to calculate with infinite structures and domains. The power of PBT systems derives from their unique semantic structure, ability to guarantee valid results, expressiveness, and use of abstraction, generalization and analogy.

                              1. 2

                                How are these different from dependent types?

                                1. 4

                                  I dived down the rabbit hole for ya. “Property-based types” are types defined by predicates. So you can have a “dog type”, a “dog named ‘spot’ type”, a “dog named ‘spot’ currently within 30 feet of some cat” type, etc. If a value (“example”) satisfies a type’s predicate, then it counts as that type.

                                  Cool idea, but everything the guy’s written or said is setting off crackpot alarms. Maybe someone else has done a better take ¯\_(ツ)_/¯

                                  1. 1

                                    Thanks! Will keep an eye on the term PBT, and hope to see some code soon.

                                  2. 1

                                    My guess (haven’t seen any code or any other material outside of that site) is that it’s a system of types with inheritance and mixins.

                                    Dependent types, as I understand them, are a runtime type of type system where the type of something depends on the value of something else.

                                    My guess is that PBT is sort of like property based testing, in that you can provide more and more info about your types and enhance the semantics of existing programs that use those types.

                                    I could be wildly off, but that’s my hunch from the little I see available on that site.

                                    1. 2

                                      The whole point of dependent types is to make runtime state checks into compile time errors. All implementations that I know of are static.

                                  3. 2

                                    This site is giving me really bad vibes. Do you have any sources that show it in action, or by people other than D A Fisher?

                                    1. 2

                                      I don’t. It’s largely theoretical right now but some (reportedly very slow and resource hungry) implementation exists. I can’t find a recording of the intro talk I attended but this recording of Capabilities & Limitations of Classical Logic is like chapter 2 or 3 of learning about Dr. Fisher’s ideas.

                                    2. 1

                                      Interesting. Thanks for sharing. At first glance it makes sense to me, and I see how it could be relevant.

                                      I would love to see some code to get a better sense of it.

                                      Is there a relation to the old Omega variant of Haskell?

                                      1. 2

                                        a relation to the old Omega variant of Haskell?

                                        I don’t think so. Dr. Fisher’s production experience is mostly in Ada and C, if I recall correctly.

                                    3. 2

                                      Anyone else reminded of Newspeak?

                                      1. 2

                                        Now that you mention it, me! Good reference. I just went down a little wikipedia whole looking up various constructed languages that I hadn’t come across before (like Basic English).

                                        I think it’s a good point how forcing languages has a dystopian aspect to it. I hope someone will figure out a data checked sublanguage that would be easy for lazy people like me to more easily and fluently write with strong data backing.

                                      2. 2

                                        It’s Vienna circle all around…

                                        1. 1

                                          Great reference, thank you. There is nothing new under the sun.

                                          I think we only need the most basic of logic checkers or algorithms. The main need is for widely available and useable datasets and transparent reductions. I think we could get very far with “good enough” heuristics.