Yahoo develops and contributes to Apache open source project Hadoop. Rising data volumes and the high licensing fees of proprietary software were the key reasons behind the deployment of an open source BI tool, informs Rohit Chatter, BI & Data Architect - Data & Insights Group (R&D), Yahoo India.
Why open source BI tool
A decision to invest in an open source BI tool was taken following the company’s search deal with Microsoft in July 2009. “Yahoo Search, being powered by Microsoft’s Bing, was expected to result in larger datasets running into petabytes,” Chatter recalls.
The open source BI tool project was conceived when data analysts and internal account managers at Yahoo demanded data to be processed through cubes for complex report generation. Slicing and dicing unknown data and presenting it in a dimensional form were the main business requirements that Yahoo had.
The company was already using Oracle’s BI tool but the high licensing costs with became a challenge with the rise in data volumes. The company therefore opted for open source BI tool, Apache Hadoop, to save on licensing costs. Hadoop was chosen as open source BI tool since Yahoo invests in Apache Hadoop (an open source Java framework) project and supports it through Yahoo Developer Network.
“Yahoo wanted to use grid using Hadoop and PIG to implement enterprise data warehouse to handle petabytes of data. The action plan decided upon was to develop reporting using private cloud model,” Chatter informs.
As the business models of the two companies were different, the parameters to measure business performance needed to be revised. The existing BI system housed a few hundred reports that spanned across units like marketplace, finance ops, sales, and a few others. These were now being reconsidered to be delivered in short span with equal functionality.
This offered a challenge given the complexity of the deal, systems involved, and the immovable date for the deal going live. Yahoo took a two-forked strategy on this:
- It provided operational reporting through a commercial BI tool over the relational RDBMS.
- It built a basic reporting system on top of the Hadoop grid using metadata layer that defined the dimensional model for ASCII feeds.
What Hadoop delivers
Although Yahoo has not been able to quantify the returns on investments, the benefits of deploying open source BI tool are quick to show. The number of people required to handle reporting requirements have decreased almost by half.
The time taken to generate reports has reduced considerably. “Reports that used to take five days for generation now take one day. Business users get more granular data and analytical information even after mining and processing terabytes of data. Frameworks can also be customized to support user defined functions,” says Chatter.
The future development at Yahoo will focus on user interface and the needs of the power users.