g‎ > ‎


2:01 pm ET
Jun 14, 2015

Big Data

IBM Wants to Push Spark, Real-Time Big Data Tool, Into Mainstream

  • Article
  • Comments (1)


  • IBM


  • By
  • Steve Rosenbush
    • @Steve_Rosenbush
    • steven.rosenbush@wsj.com
    • Biography
      • @Steve_Rosenbush
      • steven.rosenbush@wsj.com
      • Biography
    IBM’s Watson Experience Center in New York
    Steve Rosenbush/WSJ

    International Business Machines Corp. has thrown its weight behind Spark, an increasingly popular tool that is used to analyze large amounts of data in real time. Spark opens up all sorts of emerging business applications, such as the ability to quickly target ads toward people passing in front of a digital billboard.

    The number of such uses is expected to greatly expand as Internet-connected sensors become pervasive in physical objects, a phenomenon known as the Internet of Things. By working to push Spark into the mainstream now, IBM hopes to catch an emerging market at the very early stages.

    IBM isn’t the first company to embrace Spark, and Spark isn’t the only tool that can be used to analyze real-time data. But given the scale of IBM and its customer base, its support for the technology, already one of the fastest-growing open-source software projects, could be significant.

    Spark, which emerged from the University of California at Berkeley in 2009, addresses some of the limitations of Hadoop, an older open-source software framework that employs a distributed architecture to analyze large amounts of data. Hadoop is less suited for analyzing data in real time, and it has a reputation for being tricky for developers to use, according to Bob Picciano, senior vice president of IBM Analytics. Gartner Inc. has characterized the adoption of Hadoop as steady but slow, with 26% of surveyed companies saying they had a Hadoop project underway, and 46% saying they expected to within a few years. Some users have found it difficult to scale, as the Wall Street Journal has reported.

    “We think this will  help take Big Data to a whole new space and unleash much more developer innovation, which has been somewhat encumbered by the limitations in Hadoop architecture and the limitations in how easy it is for developers to take their skills and apply them to things like real time understanding of customer sentiment or risk analysis,” Mr. Picciano said.  “Hadoop is good as a collection point of large amounts of historical information, and it can use data that is both structured and unstructured. One thing that is missing is the ease of use for developers and the speed to gain insights from data moving in and out of repositories.”

    He said Spark is less intricate for developers than MapReduce, the Hadoop case that allows it to scale. Spark can run on top of other software, such as the Hadoop HDFS file system, or on its own, according to Mr. Picciano. That means developers don’t need to write different versions of Spark applications for different platforms, he said. And it picks up speed, he said, by achieving more efficient clustering and in-memory operations.

    IBM is expected to  formally announce Monday that it will embed Spark into its analytics and commerce platforms, and to offer Spark as a service on its Bluemix development platform. IBM also is expected to announce that it will assign more than 3,500 engineers and developers to work on Spark-related projects around the world, and that it will donate its IBM SystemML machine learning technology to the Spark open-source ecosystem. It also is expected to launch a Spark technology center in San Francisco and help educate data scientists and engineers in the use of Spark on a mass basis.

    The announcements are expected to coincide with the opening of the Spark Summit on Monday in San Francisco.

    Under Armour Inc. is using Spark as it seeks an advantage in the burgeoning market for quantifying peoples’ health, CIO Journal reported. Data scientists at the company’s nutrition software unit, MyFitnessPal, are using the technology to comb through calorie data crowdsourced from its 80 million users. Chul Lee, the unit’s head of data engineering and science, says the team originally relied on Hadoop to process the 2.5 terabytes of data in its database. But it took days to churn through the data.

    Spark is part of a broader group of tools and companies that are pushing the frontiers of data analytics. Jeremy Burton, president of products and marketing at EMC Corp., told CIO Journal in March that the information infrastructure giant was launching a faster, more up to date, more scalable, and open data platform.

    Several startups are working on analytics technology that approaches real-time, too. “What we can do is know everyone who is on the call right now, and analyze whether they were at Starbucks yesterday,” said J. Andrew Rogers, the founder and CTO of a startup in Seattle called SpaceCurve Inc. Mr. Rogers, a database expert who said he worked on large-scale analytic and geospatial systems with Google Earth, developed new technology for SpaceCurve. He told CIO Journal it opens up new business applications such as targeting ads in real time to people within the vicinity of digital billboards.

    Bottlenose Inc., a startup based in Los Angeles, has developed technology that helps companies analyze what co-founder Nova Spivack calls “streaming data.” It can be used to spot real-time trends on social media, with applications in areas such as business, politics and government.

    IBM has bet on Spark from the very beginning. It was one of the founders of the Berkeley lab where Spark was first developed.


    Subpages (3): 1 a l