A new Datawarehouse strategy for Software as a Service Companies
Many Software-as-a-Service (SaaS) start-ups discover enormous value in aggregating data to provide their customers with insights into industry trends and best practices.
For example, Intuit asks new QuickBooks customers a few questions and then automatically configures a chart of accounts based on the aggregated experience of other similar customers. Healthcare industry SaaS providers help their physician practice customers discover and notify patients who are candidates for a new vaccine. These use cases and thousands more depend on Datawarehouses as the platform for data aggregation and analytics.
When SaaS startups first establish their Datawarehouse platform, they often start by building a 'Star Schema Architecture' Datawarehouse on the same commodity hardware and open-source software platform they use for their transactional SaaS offering. This is a good place to start; low-cost and familiar. But, as the SaaS provider grows, data volumes grow and new requirements surface to add data sets and use more sophisticated analytics. Quickly, the commodity and open-source platform becomes a bottleneck. The capacity needs of the data Extract-Transform-Load process crowds out reporting. Queries run more slowly. The commodity and open-source platform does not scale linearly. When this happens, a new architectural approach may be called for.
A Datawarehouse based on a Massively Parallel Processor Architecture (MPPA) eliminates the Star Schema and its fact tables, indexes and data marts, which reduces Datawarehouse storage needs by 50%. Furthermore, an MPP architecture uses only the dimension table data and compresses it by 50% or more. Ultimately, Datawarehouse storage requirements can be reduced by 75% or more.
In addition, eliminating star-schema fact tables, indexes and data mart means the ETL process can be much more streamlined. SaaS companies that have switched to MPP architecture have re-purposed more than 50% of their ETL capacity.
One type of an MPPA platform combines IBM's PureData for Analytics (formerly Netezza) and Fuzzy Logix analytics software.
EMR case study
One SaaS provider of electronic medical records systems to physician practices was growing both organically and by acquisition of other SaaS providers. Their challenge was to add the historical data from the acquired systems to the Star-Schema Datawarehouse. This required transforming each of the acquired EMR records to match the related dimension table structure and updating key field values to match the existing dimension data format and value. This looked like a prohibitively expensive effort.
Instead, this SaaS provider migrated to an MPP architecture using IBM PureData for Analytics and Fuzzy Logix analytics software. They loaded their existing Datawarehouse dimension data and reduced their storage requirement by more than 50%. They loaded the dimension data from the acquired systems as well, without transforming them to a single record layout and without the fact table and data marts. This reduced the ETL platform capacity requirements by more than 50% as well, which allowed them to re-purpose most of their ETL platform.
Ultimately, the classification and clustering algorithms that enabled the Datawarehouse to identify candidate patients for new vaccine and other treatments executed in a small fraction of the duration required by the Star-Schema Datawarehouse because the Fuzzy Logix analytics software took advantage of the distributed processing capability of the MPP hardware.
The SaaS provider then had more capacity available for future acquisitions and a repeatable and efficient process for extracting and loading the data without significant transformation.
For a demonstration of the differences between a star-schema Datawarehouse based on a commodity and open-source platform and an MPPA Datawarehouse contact HHG.