AI Singapore, Google to build regional dataset for LLMs

AI Singapore and Google Research Asia-Pacific will build a corpus of open-source training data through Project SEALD to drive the development of LLMs in Southeast Asia

Aaron Tan, Informa TechTarget

Published: 11 Mar 2024 10:30

AI Singapore and Google Research Asia-Pacific have teamed up on a research project to build a corpus of training data that can be used to train, finetune and evaluate large language models (LLMs) in Southeast Asian languages.

Dubbed Project SEALD (Southeast Asian Languages in One Network Data), the initiative will drive the creation of training data in Indonesian, Thai, Tamil, Filipino and Burmese, and is expected to improve the performance of LLMs such as AI Singapore’s Sea-Lion.

Sea-Lion was trained primarily from data sourced from the internet, specifically the publicly available CommonCrawl dataset. However, much of the data was of low quality and had to be cleaned and pre-processed before it could be used to train the model.

Through Project SEALD, researchers will develop translocalisation and translation models, establish best practices for instruction tuning datasets, create tools to enable translocalisation at scale, and publish pre-training recipes for Southeast Asian languages.

The datasets and output from the project, which will also involve industry players in areas such as data collection, curation and quality checks, as well as academia on evaluation and benchmarking techniques, will be open-sourced to advance the development of regional LLMs and support local use cases.

For example, the project team is looking to improve communications with migrant workers in Singapore who may be more fluent in their mother tongues than in English. The datasets from Project SEALD could capture linguistic nuances within this community, enabling employers and the government to engage migrant workers more effectively.

Yolyn Ang, vice-president for knowledge and information partnerships at Google Asia-Pacific, said by focusing on Southeast Asian languages, Project SEALD will significantly improve the existing corpus and evaluation benchmarks for these languages. “This will open new opportunities and make AI more inclusive, accessible, and helpful for individuals and businesses throughout the region,” she added.

Meanwhile, AISG is collaborating with Indonesian, Malaysian, and Vietnamese entities to develop datasets and applications for regional LLMs. It has also engaged partners in Thailand, the Philippines, and Indonesia to build resources on regional language syntax and semantics.

“The Sea-Lion LLM project has always been about building a community and ecosystem that will continuously work together to enhance the quality of the Sea-Lion data corpus and continuously improve Sea-Lion’s capabilities,” said Leslie Teo, AISG’s senior director of AI products.

AI Singapore, Google to build regional dataset for LLMs

AI Singapore and Google Research Asia-Pacific will build a corpus of open-source training data through Project SEALD to drive the development of LLMs in Southeast Asia

Read more about artificial intelligence in ASEAN

Read more on Artificial intelligence, automation and robotics

Alibaba unveils Qwen 3.7 Max at inaugural Singapore conference

Nvidia releases synthetic dataset to support Singapore’s AI ambitions

Amazon CTO on the dawn of the renaissance developer

Sea-Lion powering AI tools for migrant workers, local businesses