Dataproc

1. Overview

  • Suitable for big data users who wish to manage cluster size.
  • Dataproc is a fast, easy, managed way to run Hadoop, Spark, Hive, and Pig on Google Cloud.
  • Apache Hadoop is an open source framework for Big Data.
  • Apache Hadoop is based on the MapReduce programming model.
  • In MapReduce model, the reduce function builds a final result set based on all intermediate results.
  • Hadoop is commonly used to refer to Apache Hadoop and related projects such as Apache Spark, Apache Pig, and Apache Hive.
  • Dataproc can build a Hadoop cluster in 90 seconds or less on Compute Engine virtual machines.
  • Datapoc users have control over the number and type of virtual machines in the cluster.
  • Dataproc clusters can be scaled up or down while running, based on needs.
  • Hadoop cluster can be monitored using Operations capabilities.
  • Running Hadoop jobs in Dataproc enables users to only pay for hardware resources used during the life of the cluster.
  • Dataproc is billed in one second clock time increments, subject to a one minute minimum billing.
  • When done with a Dataproc cluster, it has to be deleted to stop billing.
  • Dataproc clusters can use preemptible compute engine instances for batch processing.
  • When using Dataproc, Spark and Spark SQL can be used for data mining.
  • MMLib, Apache Spark's machine learning libraries can be used to discover patterns through machine learning.
  • Dataproc users get a significant break in the cost of the instances with preemptive VMs.
  • Preemptible instances are around 80 percent cheaper.
  • To use Preemptible virtual machines, it must be possible to restart jobs cleanly.