This is a guest post for the Computer Weekly Developer Network written by Aparna Lakshmiratan, VP of Product at Snorkel AI – a company known for its technology that equips enterprises to build or adapt foundation models (FMs) and Large Language Models (LLMs) to perform with high accuracy on domain-specific entities.
Lakshmiratan writes in full as follows …
When done correctly, data labelling can improve the developer experience.
That assertion might sound strange. The data labelling process often represents a necessary and unpleasant slog. But, taking a programmatic and iterative approach can help build cross-functional understanding between subject matter experts and data teams as they encode organisational knowledge to build better, more valuable models faster.
Data scientists and machine learning practitioners know how to handle data and build models. Point them at a desired outcome and they will use data to build an application that achieves that outcome efficiently.
But that outcome may not always align perfectly with the business’s evolving needs. In siloed organisations, a leader makes a plan and each team executes its portion. When the original goal is off-target, the final execution will be as well. This can lead to an unpleasant experience for model developers when disappointing results in production force an immediate overhaul.
By engaging with the logic behind labelling, developers can spot those disconnects and adjust accordingly.
Outsourced data labelling breaks the chain of understanding
Outsourced data labelling exacerbates the problem of siloed understanding. When outsourcing, companies provide the labelling contractor with raw data along with guidelines for when to apply each label.
Leaving aside the inherent security risks, this approach breaks the chain of knowledge within your organisation. Your subject matter experts define rules. The contractor applies them. Your data team receives labeled data with little to no understanding of why those labels apply. This leaves scant opportunity to learn previously-unseen contours of the problem and bend to them.
Investigating underlying assumptions
Programmatic labelling can alleviate the challenge of siloed knowledge through hands-on collaboration. Data scientists and internal experts work together to codify hard-earned intuition into scalable functions. Sometimes that’s simple keyword searches. Sometimes that’s sophisticated calls to large language models. Regardless, this process forces a conversation across teams that helps investigate underlying assumptions.
Some rules defined by the subject matter experts will be too broad and lead to false positives. The defined set of rules may also leave large parts of the data untouched. Or, perhaps, the process reveals that the defined label schema doesn’t fit the application’s actual needs.
We once worked with a large U.S. bank that needed a model to classify loan documents. The bank couldn’t outsource data labeling due to the documents’ sensitivity. So, internal experts sorted the contracts into eight categories by hand. This took six months. Then, the line of business leaders realised that the task actually called for 30 categories. Faced with the prospect of another six-month labeling project, the bank looked for another solution and settled on programmatic labeling—much to their internal experts’ relief.
This outcome was extreme, but not uncommon. People working on labelling projects frequently discover that schemas need adjustments. Perhaps a less-important label occurs so rarely that it should be ignored. Other labels may need to be combined or split. Programmatically labelling allows subject matter experts and machine learning practitioners to discover and account for these initial shortcomings in-flight instead of waiting for deployment feedback.
Understanding the “why”
Programmatic labelling helps data scientists and machine learning practitioners understand the “why” of a project.
Building a model involves tradeoffs. Data scientists love to talk about headline metrics such as accuracy, precision and F1 scores, but business settings demand more nuance.
In the case of the bank above, the institution needed to identify which customers the bank was legally obligated to pay.
Misclassifying those contracts carried a much greater risk than misclassifying others.
Understanding the “why” behind a model helps data teams make the right tradeoffs and create the greatest possible value for the company. Ideally, this leads to fewer revisions, more projects completed and happier developers.