New project makes Wikipedia data more accessible to AI

On Wednesday, Wikimedia Deutschland announced a new database that will make Wikipedia’s vast knowledge more accessible to artificial intelligence models. This initiative, called the Wikidata Embedding Project, uses a vector-based semantic search. This technique helps computers understand the meaning and relationships between words. The system applies this search to the existing data on Wikipedia and its sister platforms, which consists of nearly 120 million entries.

The project also includes new support for the Model Context Protocol, a standard that helps AI systems communicate with data sources. Together, these improvements make the data more accessible to natural language queries from large language models. The project was undertaken by Wikimedia’s German branch in collaboration with the neural search company Jina.AI and DataStax, a real-time training data company owned by IBM.

Wikidata has offered machine-readable data from Wikimedia properties for years. However, the pre-existing tools only allowed for keyword searches and SPARQL queries, which is a specialized query language. The new system is designed to work better with retrieval-augmented generation systems. These systems allow AI models to pull in external information, giving developers a chance to ground their models in knowledge that has been verified by Wikipedia editors.

The data is structured to provide crucial semantic context. For example, querying the database for the word “scientist” will produce lists of prominent nuclear scientists as well as scientists who worked at Bell Labs. The results also include translations of the word “scientist” into different languages, a Wikimedia-cleared image of scientists at work, and extrapolations to related concepts like “researcher” and “scholar.”

This new project arrives as AI developers are actively searching for high-quality data sources to fine-tune their models. The training systems themselves have become more sophisticated, often assembled as complex training environments rather than simple datasets. Despite this advancement, they still require carefully curated data to function well. For deployments that demand high accuracy, the need for reliable data is particularly urgent. While some might look down on Wikipedia, its data is significantly more fact-oriented than catchall datasets like the Common Crawl, which is a massive collection of web pages scraped from across the internet.

In some instances, the push for high-quality data has had expensive consequences for AI labs. In August, Anthropic offered to settle a lawsuit with a group of authors whose works had been used as training material. The company agreed to pay a substantial amount to end any claims of wrongdoing.

In a statement, Wikidata AI project manager Philippe Saadé emphasized his project’s independence from major AI labs or large tech companies. He stated that the Embedding Project launch demonstrates that powerful artificial intelligence does not have to be controlled by a handful of companies. He explained that it can be open, collaborative, and built to serve everyone.