hit1912 - stock.adobe.com
Nvidia releases synthetic dataset to support Singapore’s AI ambitions
The AI chip giant has developed a synthetic dataset of personas to help developers build AI models that understand Singapore’s demographic and cultural nuances without using personally identifiable information
Nvidia has released a synthetic dataset of Singapore personas to help local developers build artificial intelligence (AI) models that better reflect the city-state’s cultural and linguistic diversity.
Dubbed Nemotron-Personas-Singapore, the dataset was launched in collaboration with AI Singapore (AISG) and contains 888,000 synthetic personas, which are fictional profiles that reflect Singapore's demographic distribution, cultural traits and other characteristics.
Training or fine-tuning models on this data allows developers to create AI agents that understand Singapore’s multi-cultural nuances without using sensitive real-world data.
Most foundation models are trained on publicly available information from the internet, which is predominantly English and Western-centric. This often results in models that misinterpret cultural facts or fail to grasp local intent, posing challenges for regional developers.
The Nemotron-Personas-Singapore dataset addresses the issue by providing 148,000 records, each with six variations of a persona. The records cover 38 distinct fields, ranging from basic demographics to contextual details like occupation and life stages, all grounded in Singapore’s public census data, as well as name distribution data from NLB Name Authorities and CEA salesperson information on data.gov.sg.
Because the data is entirely synthetic and generated by NeMo Data Designer, Nvidia’s synthetic data generation microservice, government agencies and businesses can build AI applications that reflect the local population while avoiding the legal and ethical risks associated with using personally identifiable information.
For example, financial institutions can build potential AI applications that perform persona-based evaluations, bias testing, suitability checks, and stress testing for vulnerable scenarios without reusing sensitive customer data. In healthcare, the dataset can be used to develop patient-facing chatbots and medical translation systems across patient demographics, literacy levels, and care settings.
The launch of the dataset comes amid growing interest in sovereign AI, where countries seek to build and control their own AI infrastructure and intelligence rather than rely on models imported from US and Chinese tech giants.
“Singapore has established itself as a leader in building AI systems that are both innovative and responsibly governed,” Nvidia said in a blog post announcing the initiative. “Through interoperable governance frameworks, applied privacy research, and clear guidance on synthetic data, the country has demonstrated that AI sovereignty is ultimately about trust, transparency, and alignment with local norms.”
The dataset is licensed under Creative Commons (CC BY 4.0), which allows for both commercial and public-sector use. It works with Nvidia’s Nemotron models and other open source large language models (LLMs), such as AISG’s Sea-Lion, which was specifically built to understand the languages and contexts of Southeast Asia.
The Singapore release is part of a wider rollout of Nvidia’s synthetic persona collection that includes similar datasets for other markets such as US, Brazil, Japan, and India.
Earlier in 2024, AISG also teamed up with Google Research Asia-Pacific on a research project to build a corpus of training data that can be used to train, fine-tune, and evaluate LLMs in Southeast Asian languages including Indonesian, Thai, Tamil, Filipino, and Burmese.
The datasets and output from the project, which involved industry players in areas such as data collection, curation, and quality checks, as well as academia on evaluation and benchmarking techniques, were open-sourced to advance the development of regional LLMs and support local use cases.
Read more about AI in ASEAN
- Malaysia’s Ryt Bank is using its own LLM and agentic AI framework to allow customers to perform banking transactions in natural language, replacing traditional menus and buttons.
- Vietnam’s Techcombank has built AI capabilities to deliver hyper-personalised offers to 15 million customers and expand its footprint beyond its traditional affluent base.
- Indonesia’s GoTo has migrated half its infrastructure to Alibaba Cloud, paving the way for AI initiatives to solve real-world business problems and support local languages.
- Underwater video cameras equipped with AI smarts have been deployed in the waters off Pangatalan island in the Philippines to track the barometers of coral reef health.
