The next stage of artificial intelligence may not be decided only by who has the biggest model, the fastest chip, or the most impressive demo. It may be decided by who can access the right data legally, safely, and at scale.
That is the problem Bobby Samuels is trying to solve through Protege, an AI data platform built around one of the most important questions in the industry: where will high-quality training and evaluation data come from now that public internet data is no longer enough?
For years, AI companies leaned heavily on public web data. That helped push generative AI forward, but it also created a new challenge. The easiest data to collect has already been used. The most valuable data often sits inside private systems, healthcare networks, media libraries, enterprise databases, audio archives, and other places that are hard to access responsibly.
This is where Protege comes in. Instead of treating data as an afterthought, Bobby Samuels is building a company that focuses on the data layer itself. Protege connects data owners with AI builders, helping both sides work together through licensed, curated, and AI-ready datasets.
Who is Bobby Samuels
Bobby Samuels is the co-founder and CEO of Protege, a company founded in 2024 to help unlock trusted real-world data for AI development. His work sits at the intersection of data access, privacy, compliance, and machine learning, which makes Protege more than a simple marketplace.
In a fast-moving AI market, many startups are focused on building models, apps, agents, or workflow tools. Bobby Samuels has taken a different route. He is working on the supply chain behind those products. If AI systems need better data to become more useful, then companies like Protege may become a key part of how the next generation of AI is built.
The story is also shaped by the team around him. Protege was founded with Travis May, Engy Ziedan, and Richard Ho. Travis May brings deep experience from the data world, including leadership roles connected to LiveRamp and Datavant. That matters because the problem Protege is solving is not only technical. It is also commercial, legal, operational, and trust-based.
What Protege is building
Protege is designed to make proprietary real-world data usable for AI development. In plain terms, the company helps organizations that own valuable datasets safely license and prepare that data for AI builders.
That may sound straightforward, but the reality is complicated. Data can be fragmented across systems. It may contain sensitive information. It may need to be cleaned, structured, de-identified, labeled, or formatted before it can help a model. It also has to be licensed in a way that respects the rights of data owners and reduces risk for AI companies.
Protege tries to handle this middle layer. It works with data providers, prepares datasets for AI use, and gives AI teams a more reliable path to real-world data. The company has focused on areas such as healthcare data, media content, audio recordings, speech data, video data, medical imaging, and motion capture.
This positions Protege as part marketplace, part data infrastructure company, and part trust layer for AI development.
Why AI has a serious data problem
The AI industry has spent years talking about models and compute. Those areas still matter, but data is becoming a bigger bottleneck.
Modern AI models need large amounts of data to train, fine-tune, test, and evaluate. Public data can help, but it has limits. The internet does not contain every type of human knowledge. It also does not always reflect the real-world environments where AI systems are expected to perform.
For example, a model that works well on general text may still struggle with clinical workflows, medical scans, customer support tasks, financial documents, media archives, or physical-world movement. These areas often require specialized data that is not freely available online.
That creates a gap between what AI companies want to build and the data they can actually access. Bobby Samuels has framed this as one of AI’s biggest constraints. If models are going to improve in complex real-world settings, they need better examples from the real world.
How Bobby Samuels saw the opportunity
The opportunity behind Protege is not just that AI companies need more data. It is that they need better data with clear rights, responsible access, and practical delivery.
Many organizations own data that could be valuable for AI. Healthcare providers have clinical notes, medical images, lab data, and patient journeys. Media companies have video, audio, and editorial archives. Enterprises have operational records, support conversations, workflow data, and domain-specific knowledge.
But most of these organizations are not set up to sell AI-ready datasets. They may not know how to price the data, structure a licensing deal, protect privacy, or prepare the dataset in a useful format. On the other side, AI developers do not want to negotiate hundreds of one-off deals just to collect the data needed for training or evaluation.
Bobby Samuels saw a market where both sides needed a trusted connector. Data owners needed a safe way to participate in the AI economy. AI builders needed a faster and cleaner way to access useful data. Protege was built to sit between them.
Protege’s approach to trusted AI training data
The strength of Protege is in the way it treats data as a product that needs care before it reaches an AI team.
A raw dataset is rarely ready for model development. It may be messy, incomplete, sensitive, poorly labeled, or stored in a format that is difficult to use. Protege helps turn these kinds of data assets into something AI teams can work with.
That can include sourcing data from trusted providers, handling licensing agreements, supporting privacy-conscious workflows, curating datasets, improving data structure, and making sure the output fits the needs of model builders.
For AI companies, this can reduce friction. Instead of spending months trying to find a niche dataset, negotiate terms, and prepare the data internally, they can work through a platform built for that purpose.
For data owners, it creates a new revenue path. A hospital network, media library, or enterprise data holder may have valuable information, but without the right infrastructure, that value stays locked away. Protege gives those owners a way to license data while keeping control and working through a more structured process.
Why healthcare became an important early market
Healthcare is one of the clearest examples of why real-world data matters.
AI has huge potential in healthcare, from clinical documentation and medical coding to imaging analysis, diagnostics, research, and treatment support. But healthcare data is also some of the most sensitive data in the world. It is fragmented across hospitals, clinics, labs, imaging centers, and electronic health record systems.
That makes healthcare both valuable and difficult. A model trained only on general medical text may not understand the full complexity of real patient care. Clinical decisions often depend on messy context, incomplete records, changing symptoms, test results, images, and judgment from multiple specialists.
Protege has leaned into this challenge by working with de-identified health records, medical imaging, clinical notes, and other healthcare-related datasets. The goal is not simply to collect more data. It is to help AI builders work with data that better reflects real clinical settings.
For Bobby Samuels, healthcare also shows why trust matters. If a platform can help unlock sensitive data in a responsible way, it can prove its value in one of the hardest possible markets.
Protege’s expansion beyond healthcare
Although healthcare is important, Protege is not only a healthcare AI data company. Its broader vision is to become a platform for real-world data across many domains.
That includes media data, audio data, speech datasets, video content, motion capture, and other multimodal formats. This expansion matters because AI is no longer only about text. Models are increasingly expected to understand images, sound, movement, video, human behavior, and physical-world tasks.
A voice AI system needs speech data that reflects real accents, tone, pacing, and background noise. A video model needs footage that shows real scenes, actions, and context. A robotics or physical AI system may need motion data that captures how people and objects move in the real world.
This is why Protege is focused on real-world, multimodal data. As AI moves into more complex use cases, the training data needs to become more diverse, more specific, and more closely tied to actual human activity.
Funding momentum and market validation
The growth of Protege has attracted serious investor attention.
The company raised a $10 million seed round in 2024, then announced a $25 million Series A in August 2025. In January 2026, Protege announced a further $30 million Series A extension led by Andreessen Horowitz, bringing total funding to $65 million since its founding.
Those numbers matter because they show how quickly investors have started to view AI data infrastructure as a major category. In the early generative AI wave, much of the attention went to model labs, application startups, and compute infrastructure. Protege’s funding suggests that data access is now being treated as a core layer of the AI stack.
Investors backing Protege include names such as Andreessen Horowitz, Footwork, CRV, Bloomberg Beta, Flex Capital, Shaper Capital, and Liquid 2 Ventures. That investor group reflects a broader belief that high-quality, rights-protected, real-world data will be essential as AI systems become more commercial, regulated, and domain-specific.
What makes Bobby Samuels’ Protege story different
The success story of Bobby Samuels is interesting because it is not built around hype alone. Protege is focused on a quiet but difficult problem that sits underneath the AI industry.
Many people see AI progress through the lens of the final product. They look at a chatbot, a search tool, a coding assistant, or a medical AI system. But behind every useful AI product is a long chain of data decisions.
What data was used? Was it licensed? Was it high quality? Does it represent the real world? Can it be used safely? Does it help the model perform in the specific domain where the product will be used?
These questions are becoming harder to ignore. Bobby Samuels is building Protege around the idea that data is not a side issue. It is one of the main inputs that will decide which AI systems work well and which ones fall short.
That makes Protege’s work important beyond its own growth. It points to a shift in the AI industry from scraping what is easy to licensing what is valuable.
How Protege helps data owners create new value
One of the most important parts of Protege’s model is the data owner side.
Many organizations have valuable data but no clear path to participate in AI development. They may worry about privacy, compliance, intellectual property, reputation, or operational burden. They may also lack the technical team needed to prepare data for AI customers.
Protege gives these organizations a structured way to turn private data assets into revenue. Instead of letting valuable datasets sit unused, data owners can work with Protege to license those assets under more controlled terms.
This matters because the future of AI data should not only benefit model builders. Data owners also need to be compensated and protected. A healthier AI data economy depends on both sides getting value from the exchange.
How Protege helps AI builders build better models
For AI developers, the value of Protege is practical. Better data can lead to better models.
If a company is building a healthcare AI product, general web text may not be enough. It may need clinical examples, medical images, billing patterns, doctor-patient documentation, and real workflow data. If a company is building a voice model, it may need audio that captures real speech variety. If a company is building a video or physical AI system, it may need datasets that reflect movement, action, and context.
Protege helps AI builders find and use these types of datasets more efficiently. That can support model training, fine-tuning, benchmarking, and evaluation. It can also help teams reduce the time spent hunting for data and focus more on building products that work.
In a market where many AI companies are competing on quality, reliability, and domain performance, access to the right dataset can become a major advantage.
DataLab and the next layer of Protege’s ambition
In 2026, Bobby Samuels also introduced DataLab at Protege, a research-focused effort aimed at closing the data gap in AI.
The idea behind DataLab is that AI needs serious research not only around models and chips, but also around datasets themselves. The quality of a dataset affects how a model behaves. Poor data can lead to weak performance, bias, unreliable outputs, or misleading evaluation results.
A dedicated data lab can focus on questions such as dataset quality, benchmark reliability, multimodal healthcare evaluation, data contamination, factuality, and representational bias. These are not small issues. They shape whether AI systems can be trusted in real use.
This move shows that Protege is not only trying to broker access to data. It is also trying to help define what high-quality AI data should mean.
The bigger impact of Protege on the AI industry
The rise of Protege fits into a larger shift in artificial intelligence.
The first wave of generative AI was driven by scale. Bigger models, more compute, and more public data created a rapid jump in capability. The next wave may be more focused on specificity. Companies will need models that understand industries, workflows, environments, and real-world edge cases.
That requires better data pipelines. It also requires more transparent data relationships. As copyright, privacy, and compliance questions grow, AI companies will need cleaner ways to show where their data comes from and why they have the right to use it.
This is where licensed data platforms could become more important. If Protege can help create a market where data is sourced responsibly, prepared carefully, and delivered efficiently, it could help shape a more sustainable AI ecosystem.
For Bobby Samuels, that is the bigger achievement. He is not just building a company around current demand. He is building around a structural problem that is likely to become more important as AI becomes more powerful.
What entrepreneurs can learn from Bobby Samuels
There are several useful lessons in the way Bobby Samuels is building Protege.
The first lesson is to solve a painful bottleneck. Protege is not chasing a surface-level trend. It is focused on a problem that many AI teams feel directly: access to trusted, high-quality data.
The second lesson is to build trust before scale. In sensitive markets like healthcare and proprietary data exchange, growth only matters if both sides believe the system is safe and useful.
The third lesson is to understand the full market. Protege has to serve data owners and AI builders at the same time. That requires more than software. It requires partnerships, compliance awareness, data operations, and strong commercial execution.
The fourth lesson is to build where timing matters. AI models have improved quickly, but data access has not kept pace. Bobby Samuels built Protege at a moment when the market was starting to feel that gap more clearly.
The fifth lesson is to work on infrastructure that becomes more valuable as the market grows. If AI keeps expanding into healthcare, media, robotics, enterprise workflows, and regulated industries, the need for trusted real-world data will likely grow with it.