daboost - stock.adobe.com

Indian large language models gain momentum

Indian large language models trained on Indic languages are now being used by businesses and governments to better serve the needs of a diverse multilingual country

India is home to great writing systems such as the Brāhmī and the Kharosthī that date back centuries. It’s also a land where languages such as Sanskrit, Tamil and Kannada have etched an oral history of more than 1,000 years.

Today, modern India boasts 22 official languages, numerous dialects and a heterogenous terrain where tone, accent and vocabulary can vary every few kilometres. This cultural diversity lends itself to the development of local large language models (LLMs) that better suit the needs of a multilingual country.

Most global LLMs are trained on data in English, sourced from Reddit and the internet, which is not representative of India, said Ankush Sabharwal, CEO of CoRover, the conversational artificial intelligence (AI) startup behind BharatGPT, India’s homegrown generative AI (GenAI) initiative.

“When Indians ask questions, the way they ask and their approach is different,” said Sabharwal, noting that BharatGPT has been powering virtual assistants in well over 100 live implementations.

Sabharwal claimed that CoRover has amassed over 1.3 billion users who can converse with the company’s virtual assistant in 22 languages by text and 14 languages by voice. CoRover is already running pilots with a range of customers, from banks and the Chennai police force to Indian states working on e-government projects.

Another Indian LLM initiative that has been making headway is Krutim. Backed by ride-hailing giant Ola, Krutim was trained on over two trillion tokens and claimed to have a large representation of Indic language tokens.

There’s also the Google-backed Project Vaani, which has 12 states and 80 districts represented in one of the largest datasets of Indian dialects. The data is expected to be open sourced through platforms such as Bhashini under the Ministry of Electronics and Information Technology’s national language translation initiative, and spur the development of automatic speech recognition and natural language processing technologies that better understand how Indians speak.

Read more about IT in India

Another LLM, OpenHathi 7B Hi, is said to rival GPT-3.5’s performance in Indic languages. Developed by AI startup Sarvam AI, it was trained via a two-stage training process that included bilingual language modelling so that the model can run cross-lingually across tokens. OpenHathi 7B Hi was built on Meta’s Llama2–7B.

When translating Devanagari Hindi to English, OpenHathi 7B Hi outperformed GPT-3.5 and GPT-4 in the Bilingual Evaluation Understudy metric that measures the similarity between machine-translated text and high-quality reference translations.

It trailed slightly behind IndicTrans2 and Google Translate, both of which have been fine-tuned for translation. In translating English to Devanagari Hindi, the model was on par with IndicTrans2 and Google Translate, and outperformed GPT-3.5 and GPT-4 by a fair margin.

Powering local applications

There is plenty of scope for local governments and businesses to embrace the use of Indian LLMs. Indian Railway’s platform, IRCTC, for example, has leveraged its AskDisha conversational AI chatbot to reduce cost and generate revenue. Powered by CoRover, the chatbot has solved over 150,000 passenger queries daily with 90% accuracy.

Indraprastha Gas, a natural gas distribution company, is also using a chatbot dubbed Ask Maitri for users to file complaints, and enquire on products and services across different channels. Within six weeks of the chatbot’s launch, the company has reduced call volumes by over 35%.

There’s also Asian Tobacco Company, whose Ask ATC bot has reduced machine downtime by more than 30%, with support for vernacular languages, including Kannada, Tamil and Telugu.

LLM challenges

Inadequacy, scalability and gaps of access are issues that developers of Indian LLMs will need to grapple with.

Sumit Singh, CEO and co-founder of DashLoc, an AI marketing platform that uses a local LLM to power its services, pointed out challenges like acquiring high-quality training data and balancing accuracy with computational resources.

Gaurav Kheterpal, founder and CEO of Vanshiv Technologies, a technology consulting firm, believes the data is out there, and that it’s just a question of collating, cleansing and utilising it.

“The big question, of course, is around how such data can be responsibly used by these LLMs without compromising privacy and ensuring there are no infringements of any sort,” he said.

India’s Ministry of Electronics and Information Technology (MeitY) has recently given a firm reminder to LLM players of their regulatory responsibilities.

In an advisory released on 1 March 2024, MeitY said organisations would need to seek permission from the government to use and provide under-tested or unreliable AI models and software to users in the country.

After permission has been granted, the AI models and software can be deployed “only after appropriately labelling the possible and inherent fallibility or unreliability of the output generated”.

Rajeev Chandrasekhar, union minister of state for electronics and information technology, later clarified that the advisory was aimed at large platforms and not startups.

Read more on Artificial intelligence, automation and robotics

Data Center
Data Management