ChatGPT API (and why it won’t improve AI devtools)

In this issue we look at why the ChatGPT API won’t improve AI devtools, the LLM breakthroughs would be game changing, Flan-UL2 20B and Facebook’s LLaMa weights gets leaked.

Louis’ AI Devtools: Tuesday 7th March, 2023

Two years ago YC backed my cofounder Gabe and I to build bloop, a natural language code search engine. Each week I’ll sift through the noise to try and understand what might stick to the wall and why.

What’s Trending: ChatGPT API (and why it won’t improve AI devtools) 🔥

The new ChatGPT API (gpt-3.5-turbo) is 10X cheaper than davinci-003, but doesn’t move the goalposts forward on most important LLM metrics for AI devtool builders: speed, prompt size and instruction.

Inference speed is mostly influenced by completion length. Using SoTA models a classification step that returns one token might take 0.5s, but generating a 500 token summarised explanation might be closer to 20s. The only way to go faster is to use smaller models.

Prompt size correlates to model size. As model size is currently increasing exponentially, so is prompt size. The model with the highest limit is Anthropic’s Claude 8k tokens, but GPT-4 is rumoured to have 32k, available possibly sometime in H1 2023.

Instruction trained models can perform previously unseen tasks with zero/few-shot prompting (including examples in the prompt). Still a few unresolved issues across all models like hallucination and rambling.

So what would be game-changing? Well that depends on your usecase…

Code completion (Copilot, Tabnine) uses smaller models for speed as autocomplete as an interface only works if suggestions are instant. Suggestions are based on the local context (current, nearby and recent files) so the smaller prompt size is not an issue. Parth Thakkar wrote an excellent blog on Copilot prompt construction. Instruction training would make a big difference, Copilot has a tendency to ignore instructions.

I’d like a function please not a comment!

Code search is a long tail problem. The most valuable searches are the most complex and often infrequent. Whether the index is public (Phind) or private code (buildt, bloop), prompt size limits the number of files that can be considered when answering a question and so it’s essential that prompt size increases so that the most complex queries in the largest codebases can be answered.

And then there’s everything else, typically non coding tasks where speed isn’t as important, eg: anything that can run in a CI job (Codeball). Many of the same considerations that apply to code search also apply here, making prompt size today’s limiting factor on usefulness.

New and useful tools 🐣 

  • Intellicode - In IDE code examples for any library

  • GitGPT - Use Git in natural language (OSS)

  • Yolo - Natural language to shell commands (OSS)

  • Adrenaline - Use AI to fix code errors

  • SwePT - Create a pull request for file changes in natural language (OSS)

  • Mendable - Chat with technical documentation

Research and resources 🛠️

New open source model released: Flan-UL2 20B. Outperforms Flan-T5 XXL and has a 2048 token window. I think this makes it the best open source LLM available today.

Last week I mentioned that FB released an ‘open’ 65B parameter LLM without open sourcing the weights. This week, someone leaked the weights, hilariously advertising the torrent link in a GitHub PR 😆 

Many AI devtools leverage LLMs for direct prediction, eg generating code line ranges or index values. A new paper from Derek Chen demonstrates SoTA performance training a small LM with synthetic data generated by an LLM.

Jobs 👷