One of the things that strikes me the most about the data science community outside of Silicon Valley is how afraid people seem to be of large datasets. Indeed, not a day goes by that I don’t hear from users of my own open datasets, including those at universities with access to substantial computing clusters and HPC centers, complaining that hundreds of gigabytes, let alone terabytes, are so far beyond their ability to analyze as to be utterly inaccessible to them. How have we reached a world in which five years ago Google engineers were casually running sorting benchmarks on 50PB, three years ago Facebook was generating 4 petabytes of new data per day and today real world companies are storing hundred petabyte archives in Google’s BigQuery platform, yet outside Silicon Valley I hear data scientists talk about analyzing terabytes as pushing the boundaries of the possible?
Having spent nearly a decade in the NSF-funded supercomputing world, working my up from high school intern to undergraduate student, graduate student and then staff affiliate of the supercomputer center that brought Mosaic (the first graphical web browser) to the masses, I saw firsthand the academic world’s fixation on processing power over storage capability. At least in the United States, academic supercomputers were historically designed to run scientific simulations and thus emphasized processing power over memory and storage capability. At the same time Google was running petabyte sorting benchmarks, I was struggling to store just a few terabytes of data in the academic world, while at times I would see less than 5MB/s of disk performance because of system designs that didn’t take into account multiple users all placing high loads on the disk system.
Yet, fast forward 18 years later, and storage space and IO performance are still the greatest limiting factors for big data research in the academic world. It was therefore astonishing to me when I first started working more closely with Silicon Valley half a decade ago that for the first time in my professional career, the notion of petabytes and analyses using tens of thousands of processors were simply the norm, rather than a dream decades in the future.
On a daily basis I see myriad academic papers and press releases cross my inbox touting the latest groundbreaking data analysis, only to find that the “petabytes” or “hundreds of terabytes” touted in their announcement was just be the size of the original corpus from which the researchers extracted a few tens of gigabytes for their actual analysis. A “10 petabyte” analysis becomes “we used a search tool to extract 10 gigabytes from the 10 petabyte archive and our analysis examined just those 10 gigabytes.” Search Google Scholar and you’ll turn up a myriad data science papers that open by touting the petascale world of industry and Silicon Valley before doubling back to the actual tens-of-gigabytes analysis the authors themselves performed.
This raises the fascinating question – why are petabyte analyses so elusive in a world drowning in data? Likely the single biggest reason is cost. Ordering 125 8TB external USB drives from Amazon will give you a combined total of around a petabyte of disk, but then you have to factor in the additional space needed for RAID5 or RAID6 to protect your data. It is also unlikely that you could simply daisy chain 125+ USB drives to a single PC to create a usable one petabyte partition and even if you could find a way of making it work, the performance would be simply unusable without adding a cluster of machines to spread the load out. Even then, if every drive performed perfectly and you bought enough computers to fully saturate all of these drives and even if your data was aligned to allow 180MB/s sustained reads per drive, you’d still be looking at half a day to conduct a complete table scan of all of your data and you likely wouldn’t have enough CPU power to do much with that data once you’ve read it.
In short, even in today’s world of commodity 10TB USB drives, building a production one petabyte storage system that is both durable and high performance is far from cheap and the maintenance and upkeep and power and cooling costs mean it isn’t something you can run from your spare bedroom as a hobby project.
Yet, the unimaginable economies of scale that Google, Amazon and the rest of the commercial cloud vendors are able to achieve has allowed them to commoditize petascale storage. Google’s Coldline storage costs just $7,000 a month per petabyte and the data can be accessed instantly (milliseconds delay) and securely downloaded to anywhere in the world. Moreover, included in this cost is not only all of the hardware needed to store that data, but the power and cooling and some of the world’s best system administrators and engineers managing it all and keeping everything secure. Also, keep in mind that the way that Google and the other cloud vendors achieve their high levels of availability and durability is that they make multiple copies of your data (Google Persistent Disk makes 3.3 copies of your data for example). This means that for that for $7,000 a month you are storing just a petabyte of data, but behind the scenes you may actually be consuming several petabytes of actual physical hard drives to provide the necessary redundancy and durability. You can even analyze your data piecemeal by streaming it through a set of Compute Engine instances, taking advantage of Google’s pricing model in which data transfers between its storage and compute offerings in the same location is free.
Finally, for customers who don’t need to access their stored data instantly (such as offsite backups where an access delay of minutes to hours may be fine), Amazon’s Glacier storage offers a petabyte of storage even cheaper at just $4,000 a month, all with incredible redundancy and durability.
Storing a petabyte of data in the cloud is fairly trivial, but how do you actually analyze it? Enter the emerging world of cloud-based petascale analytics platforms like Google BigQuery. BigQuery essentially bolts a massive purpose-built on-demand analytics cluster on top of Google’s global storage infrastructure and allows you to rent thousands or even tens of thousands of processors in durations of seconds to crunch through your data. Storing a petabyte in BigQuery has the same pricing as the rest of Google’s cloud storage offerings, down to around $10,000 a month per petabyte for long term storage. Some of Google’s commercial customers store more than a hundred petabytes in BigQuery and add a petabyte or more a day.
Yet, where tools like BigQuery truly shine is in their ability to leverage the cloud to brute force through all of that data. In the case of BigQuery, a single line of ordinary SQL can perform a table scan over an entire petabyte in just under 3.7 minutes. Conducting an analysis over the entire Internet Archive’s 20-year 15PB web archive would take just under 56 minutes. Most importantly, since BigQuery is essentially a specialized cluster rented in time intervals of seconds, performing petabyte analyses doesn’t require purchasing racks of permanent hardware or even spinning up a massive cloud cluster – you can simply rent precisely the amount of compute power it requires to process your data for that single query and have, for a few brief minutes, a few thousand processors capable of brute forcing their way through a petabyte in 3.7 minutes.
Putting this all together, we see that Silicon Valley has not only commoditized the petabyte, but has gone further to leverage the unimaginable scale of its data centers to bring on-demand petascale “big data” analysis to reality. Looking to the future, data scientists that limit their analyses to examining micro extracts will find themselves further and further behind in a world in which we can explore the nuanced patterns of petabytes in just minutes. In short, through the power of the commercial cloud, we as data scientists need not fear the petabyte any longer.
This article was written by Kalev Leetaru from Forbes and was legally licensed through the NewsCred publisher network.