Language models like GPT-4 and Claude have proven to be powerful tools in various applications, yet the shroud of secrecy surrounding their training data has raised concerns. In a bid to address this issue, the Allen Institute for AI (AI2) is taking a groundbreaking step by introducing a substantial text dataset named Dolma, which is both freely accessible and open to scrutiny.
Dolma, the cornerstone of AI2’s upcoming open language model project known as OLMo (short for “Data to feed OLMo’s Appetite”), is designed to foster collaborative research within the AI community. Just as the OLMo model is intended to be freely modifiable and employable, AI2 advocates for the same transparency and openness in the dataset that fuels its creation.
The unveiling of Dolma marks AI2’s first “data artifact” connected to OLMo. Luca Soldaini of AI2 elaborates on the data selection process, the reasoning behind specific methodologies, and the efforts to refine the dataset for optimal utilization in AI systems. While AI2 plans to provide a comprehensive paper detailing their work, a blog post currently offers insights into their approach.
Major players in AI, such as OpenAI and Meta, share limited details about their dataset statistics; however, substantial portions of this information remain proprietary. This approach not only hampers comprehensive scrutiny and advancement but also fuels speculations about potential ethical and legal concerns surrounding data acquisition, including the possibility of incorporating unauthorized content.
AI2’s illustrative chart underlines the gaps in information provided by even the largest and most recent models. Questions arise regarding the omitted details and reasons behind their exclusion. AI2’s initiative stands in contrast, aiming to furnish the AI community with a comprehensive understanding of dataset sources, processing steps, and decisions like text quality assessment and privacy preservation.
In an AI landscape marked by fierce competition, it is within the rights of companies to safeguard the intricacies of their model training processes. However, this strategy poses challenges for external researchers, as it renders their datasets and models inscrutable and challenging to replicate or study.
Dolma, introduced by AI2, is positioned as an antidote to such opacity. Distinguished by its openly documented sources and methodologies, Dolma surpasses previous endeavors in both scale and accessibility. Boasting an impressive content volume of 3 billion tokens—an AI-specific metric—it presents itself as a pioneering endeavor that prioritizes ease of use and permissions.
Operating under the “ImpACT license for medium-risk artifacts,” Dolma’s usage guidelines require interested parties to:
- Provide contact information and articulate intended use cases.
- Disclose any derivative creations arising from Dolma.
- Distribute these derivatives under the same licensing terms.
- Refrain from applying Dolma to forbidden applications such as surveillance or disinformation.
For individuals concerned about their personal data being inadvertently included in the dataset, AI2 offers a specific removal request form. This form caters to individual cases, ensuring that privacy concerns are addressed on a case-by-case basis.
If these provisions align with your requirements, access to the Dolma dataset can be obtained through Hugging Face, ushering in a new era of transparency and collaboration in the realm of AI research.