1. 44
    1. 8

      How do we know the barrier is broken? What are we measuring against? Is it just “vibes”?

      I don’t use these models, so as an outsider looking in, I wish the article linked to some benchmarks.

      1. 4

        The previous barrier was “no-one has built a model that even comes close to GPT-4 in the last 12 months”.

        Now we have models that are arguably within spitting distance, and in my experience for Claude 3 Opus have stepped ahead. That’s a huge milestone!

        1. 2

          I trust vibes more than evals to be honest.

        2. 5

          I was up far too late experimenting with Claude 3 last night. I think overall it’s a little less powerful for manipulating short segments of code- but for files over 5000 tokens (roughly 500 lines) it has a clear advantage in remembering what it just did and not going off the rails. I think it’s a very promising tool to write or refactor code, and that’s not even what it was intended for. Such an exciting time!

          1. 9

            One other thing to note is the LMSYS Leaderboard, which puts each of these LLMs in head-to-head blind-judged matchups, now has all the top-tier models within spitting distance of each other. That seems to imply these companies are reaching some sort of a plateau. And it might not really matter which of the top-tier models you choose.

            1. 7

              There may be a little bit of a plateau right now, but OpenAI haven’t had a big upgrade in quite a while and Llama 3 looks to have a ton more resources than Llama 2 did. I wouldn’t be surprised to see the state of the art shift quite dramatically within the next few months.

              (I wouldn’t mind if it stays quiet for a while, would give us all the chance to figure out the current batch properly.)

              1. 1
              2. 3

                It only really tests for a single use case though, of chat bot. GPT is a pretty good chatbot, but in my experience stands apart in terms of instructability and adaptability to other use cases (as when used programmatically in a pipeline). But I’m not sure about Claude yet.

            2. 1

              That refactoring example is impressive. I’ve been skeptical of tools like Copilot because they seem to bias the operator toward adding new stuff, rather than understanding and evolving what’s there, which is a great way to make an incoherent mess of a codebase. If you could refactoring like this across a whole codebase rather than a little script it’d save a lot of regexing and manual tedium.

              1. [Comment removed by author]

                1. 2

                  What do you mean? Would you want it to be /un/available for more of the world population?

                  If unavailable is a typo and you meant available, then you’re wrong, since Claude Pro (and therfore, I assume, Claude) is available in india, which is already clearly more than 0.05% of the world population.

                  1. 2

                    Ah, that checks out then. I read that Claude Opus is only available in the US and the UK, but that was clearly wrong. I’ll delete my comment because it’s clearly wrong (also I’m getting flagged a lot for whatever reason).

                    And yeah, unavailable was a typo, and my math was also off, it should have been 5%. (I posted at 4:30 because I couldn’t sleep – clearly not my brightest moment.)

                    1. 4

                      No worries, we’ve all had one of those nights :)

                      1. 3

                        India accounts alone for more than 5% of the world population. ;)