Context:
The future of Artificial Intelligence (AI) cannot be secured by regulation alone. To ensure AI is safe and trustworthy for everyone, we must complement regulation with policies that promote high-quality data as a public good. This approach is essential for fostering transparency, creating a level playing field, and building public trust. Only by providing fair and broad access to data can we fully realize AI’s potential and distribute its benefits equitably.
Relevance:
GS3- Awareness in the fields of IT, Space, Computers, Robotics, Nano-technology, Bio-technology and issues relating to Intellectual Property Rights.
Mains Question:
What role does data play in the functioning of Artificial Intelligence (AI)? How can AI help in the preservation of cultural heritage and traditional knowledge? (10 Marks, 150 Words).
Data and AI:
- Data is the lifeblood of AI. In this context, the principles of neural scaling are straightforward: the more data, the better.
- For example, the more diverse and voluminous human-generated text available for unsupervised learning, the better Large Language Models (LLMs) will perform.
- Alongside computing power and algorithmic innovations, data is arguably the most crucial driver of progress in the field.
Paucity of Continuous Data:
- However, there is a problem. Humans do not produce enough digital content to sustain these ever-growing models.
- Current training datasets are already enormous: Meta’s LLama 3, for instance, is trained on 15 trillion tokens, over 10 times the size of the British Library’s book collection.
- A recent study suggests that the demand for high-quality text is such that we might reach a ‘peak data’ scenario before 2030.
- Other studies warn about the risks of public data contamination by LLMs themselves, leading to feedback loops that amplify biases and reduce diversity.
AI winter:
- Concerns about an ‘AI winter’ highlight the relentless data race in which researchers and industry players are engaged, sometimes compromising quality and ethics.
- A notable example is ‘Books3,’ a collection of pirated books believed to be used by leading LLMs.
- Whether this practice falls under fair-use policy is a legal debate.
- More troubling is the hoarding of these books without any clear guiding principle.
- Even though progress is being made, partly due to regulation, LLMs are still primarily trained on an opaque mix of licensed content, ‘publicly available data,’ and ‘social media interactions.’
- Studies indicate that these data reflect and sometimes even worsen existing distortions in our cyberspace, creating a predominantly anglophone and present-centric world.
The Absence of Primary Sources:
- The idea that Large Language Models (LLMs) are trained on a comprehensive collection of human knowledge is a fanciful delusion. Current LLMs are far from the universal library imagined by thinkers like Leibniz and Borges.
- While repositories of stolen texts like ‘Books3’ may include some scholarly works, these are mostly secondary sources written in English—commentaries that barely scratch the surface of human culture.
- Notably absent are primary sources and their diverse languages: archival documents, oral traditions, forgotten books in public collections, and inscriptions on stone—the raw materials of our cultural heritage.
- These documents represent an untapped reservoir of linguistic data. Take Italy, for example. The State Archives of Italy alone house at least 1,500 kilometers of shelved documents (measured linearly)—not counting the vast holdings of the Vatican.
- Estimating the total volume of tokens that could be derived from this heritage is challenging.
- However, considering the hundreds of archives spread across our five continents, it’s reasonable to believe they could match or even exceed the data currently used to train LLMs.
- If harnessed, this data would not only enrich AI’s understanding of humanity’s cultural wealth but also make it more accessible to the world.
- They could revolutionize our understanding of history while safeguarding the world’s cultural heritage from neglect, war, and climate change.
- Additionally, they promise significant economic benefits. By helping neural networks scale up, their release into the public domain would allow smaller companies, startups, and the open-source AI community to use these large pools of free and transparent data to develop their own applications, leveling the playing field against Big Tech and fostering global innovation.
Examples from Italy and Canada:
- Advancements in the digital humanities, particularly through AI, have significantly reduced the cost of digitization, allowing us to extract text from printed and manuscript documents with remarkable accuracy and speed.
- Italy recognized this potential and allocated €500 million from its ‘Next Generation EU’ package for the ‘Digital Library’ project.
- Unfortunately, this ambitious initiative, aimed at making Italy’s rich heritage accessible as open data, has since been deprioritized and restructured, showing a lack of foresight.
- Canada’s Official Languages Act offers a valuable lesson here. Although initially criticized as wasteful, this policy mandating bilingual institutions eventually produced one of the most valuable datasets for training translation software.
- However, recent discussions about adopting regional languages in the Spanish Cortes and European Union institutions have overlooked this important aspect.
- Even supporters have failed to acknowledge the cultural, economic, and technological benefits of promoting the digitization of low-resource languages as complementary.
Conclusion:
As we accelerate the digital transition, we must not overlook the immense potential of our world’s cultural heritage. Digitizing it is crucial for preserving history, democratizing knowledge, and enabling truly inclusive AI innovation.