Tl;dr: Companies with technology that allows them to uniquely generate the data needed to train and fine-tune models are well positioned to create enduring value in the age of AI. The best AI companies may be those building in atoms and not just bits.
The pace of development in AI has given many the feeling that the ground is shifting under their feet. While incredibly exciting, this has led to a fair amount of anxiety among entrepreneurs who are wondering if there’s any true defensibility in what they’re building. A battle tested strategy in startups is to build a product that’s at least 10x better, 10x cheaper, or 10x easier than what exists while you march toward a long-term moat. But given how quickly AI development is advancing, a 10x product of last month may be obsolete this month. The fear is real.
How fast are things moving? The following big releases all happened In a single week, from March 12 to March 19th:
OpenAI released GPT-4
Anthropic released Claude
Stanford students released Alpaca
Google released PaLM API
Google announced Med-PaLM 2
Google added AI to Workspaces
PyTorch 2.0 released
Microsoft Office 365 Copilot
In a single week, we saw more big announcements in AI than we’ve seen in any other year. Workflows that were previously the sole focus of some startups, over the course of a week, became well done features of companies with massive distribution. Other companies woke up to discover that their product functionality could now be replicated via simple written english.
That week even raised the question of whether foundation models are defensible. A team of Stanford students took a very clever approach of starting with LLaMA 7B, an open source LLM, and fine-tuning it on 52,000 instruction-following demonstrations from OpenAI’s text-davinvi-003. The resulting LLM, dubbed Alpaca 7B, can run on a MacBook and has roughly comparable performance to OpenAI’s GPT-3.5, which relies on massive cloud compute. It took OpenAI 4.5 years and over ~$1B raised to launch GPT-3. Alpaca? Training costs totaled less than $600 (not a typo). It’s a very big deal.
The next week was no less exciting. OpenAI released ChatGPT plugins, potentially obsoleting another round of LLM startups. Google released Bard, its chat interface LLM model. And NVIDIA released foundation models-as-a-service, potentially abstracting away something that previously required many AI scientists.
What a whirlwind. Ideas that seemed solidly defensible even one month ago, might not seem so today. Many companies that were recently the “ChatGPT for X” realized they may no longer have any product advantage after OpenAI Plugins launched. While incredibly exciting, it’s also leading many, including experienced founders and product builders, to have AI anxiety. Some have even attempted to coin phrases for the feeling. Successful founders have gone as far as to suggest, if they were thinking of starting something new in this environment, that they would instead pause and let things settle first.
So where is defensibility in the age of AI? There’s one clear answer. Defensibility in the age of AI can be found by building with atoms, not just bits.
Because robotics hasn’t kept pace with AI, companies that build physically aren’t at risk of being disrupted by AI in the near future. Companies like SpaceX, Cover, or Solugen will gain efficiencies by implementing AI but have little to fear from the latest LLM. But these are not, at their core, AI companies.
What about companies that are AI-first? As building foundation models becomes easier, and as building on top of existing models is sped up, companies that rely on publicly accessible or easily obtained data will likely face fierce competition. Products that are 10x better today may be clearly inferior in a few months. This is unique in the history of technology. Never before have capabilities jumped so far, so fast. Luckily, for AI companies, defensibility can be found in data.
Every model requires data to train with and then more data to fine-tune. DeepMind’s AlphaFold beautifully illustrates the importance of this data. By elucidating the likely structures of nearly all known proteins and making these predictions freely available, AlphaFold helped biology take a giant step forward. From accelerating basic science to enabling better bioengineering to advancing therapeutics development to helping unlock biomanufacturing, AlphaFold is a big deal.
So why were we able to accomplish so much more with AI in protein structure prediction than other areas of biology? To understand this, we turn to the world of atoms, not bits.
The key enabler for using ML to predict protein structure was in the abundance of high quality ground truth data to train the model on. AlphaFold leaned heavily on two datasets in particular: the Protein Data Bank (PDB) and UniProt.
PDB includes the precise 3-D structure of nearly 175,000 proteins. UniProt is a protein sequence and function database that contains over 200M protein sequences, with over 550,000 of them manually annotated. Building both databases required technology from the world of atoms. The scale, diversity, and precision of the UniProt dataset would not have been possible without next generation sequencing technologies, while resolving the protein structures in the PDB relied on X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy. Without these existing datasets, which required advancements in the world of atoms to collect, AlphaFold simply wouldn’t be possible.
Imagine these public databases didn’t exist and a startup had developed the only technology that could find the 3-D structure of proteins en masse. No one but that startup would have been able to develop AlphaFold’s capabilities. Quite the moat!
Biology – alongside chemistry – is particularly ripe for advancement with AI because it’s inherently “structured”. Biological reactions happen in largely predictable ways, following the laws of physics, and yet we can’t gain much from the mathematical non-AI based modeling approaches we use in physics. Biology is also inherently highly dimensional in a way that requires increasingly powerful computation to allow us to interpret and understand our data. Unfortunately, there are many areas of biology where the equivalent of PDB + UniProt don’t yet exist. In these areas, the capability for massive advancements using AI are limited.
DeepMind co-founder Demis Hassabis says “there’s still obviously a lot of biology, and a lot of chemistry, that has to be done.” What’s the key to unlocking similar AI advancements across those other areas of biology and chemistry? Mainly the ability to access large, high quality datasets. If a startup can develop technology that allows for the creation of that sort of data, then it can rocket the field forward while being able to rely on a pretty strong moat. Any startup that can alone access the best data to train and fine-tune with will have strong defensibility. Those with technologies for gathering higher quality, higher resolution, higher dimensional, and/or better annotated data are well positioned to capitalize on upcoming advances in AI. Startups that are able to uniquely generate those datasets in high throughput and at low costs hold the keys to the castle.
One might argue that with enough clever AI engineering or enough compute, the need for high-quality training and fine-tuning data could be circumvented. A counter-argument here is OpenAI’s now abandoned effort in robotics. OpenAI co-founder and CTO Ilya Sutskever was recently asked about why they couldn’t make enough progress. The issue? Not enough quality data. Ilya said “back then it really wasn't possible to continue working in robotics because there was so little data. Back then if you wanted to work on robotics, you needed to become a robotics company. You needed to have a really giant group of people working on building robots and maintaining them. And even then, if you’re gonna have 100 robots, it's a giant operation already, but you're not going to get that much data. So in a world where most of the progress comes from the combination of compute and data, there was no path to data on robotics.” OpenAI co-founder Wojciech Zaremba, who was the robotics project lead, said at the time “it turns out that we can make gigantic progress whenever we have access to data” and explained that it therefore made more sense to focus on the “plenty of domains that are very, very rich with data.”
If a startup can quickly, cheaply, and uniquely generate the data needed in a domain to train and fine-tune AI models, they can unlock incredible value. The best way of ensuring that you have proprietary access to a dataset needed to train a model is to generate it yourself. Building and working in atoms can help.
If your technology is built in atoms and lets you collect high quality data in high throughput and at low cost, then rapid advances in AI should excite you, not scare you. In that case, every leap forward in AI, rather than potentially obsoleting your product or company, simply helps you derive even more value from your unique dataset. This isn’t just true for biology. This sort of technology can come in many forms: assays to see new biology, space-based imaging technologies, new sensors based on advancements in materials science, or even an ability to synthesize a critical component in a design–build–test–learn cycle.
Companies with technology that allows them to uniquely generate the data needed to train and fine tune models are well positioned to create enduring value in the age of AI. The best AI companies may be those building in atoms and not just bits. Create the data, control your future.
Enjoy this post? 👇
Thanks to Ela Madej, Josie Kishi, Elliot Hershberg, and Gaurab Chakrabarti for offering feedback on drafts of this post.
Hi Seth, I think you articulated the importance of the shift in focus to accruing high quality datasets very clearly. It's the natural consequence of "the marginal cost of intelligence approaching zero", as Sam Altman puts it.
Are there any examples of non-government organizations that stand to make money off of the atoms to bits moat? These are great examples but they're all government and grant-funded projects, not organizations around profit-driven business models. Thanks!