- Suitable for big data users who wish to manage cluster size.
- Dataproc is a fast, easy, managed way to run Hadoop, Spark, Hive, and Pig on Google Cloud.
- Apache Hadoop is an open source framework for Big Data.
- Apache Hadoop is based on the MapReduce programming model.
- In MapReduce model, the reduce function builds a final result set based on all intermediate results.
- Hadoop is commonly used to refer to Apache Hadoop and related projects such as Apache Spark, Apache Pig, and Apache Hive.
- Dataproc can build a Hadoop cluster in 90 seconds or less on Compute Engine virtual machines.
- Datapoc users have control over the number and type of virtual machines in the cluster.
- Dataproc clusters can be scaled up or down while running, based on needs.
- Hadoop cluster can be monitored using Operations capabilities.
- Running Hadoop jobs in Dataproc enables users to only pay for hardware resources used during the life of the cluster.
- Dataproc is billed in one second clock time increments, subject to a one minute minimum billing.
- When done with a Dataproc cluster, it has to be deleted to stop billing.
- Dataproc clusters can use preemptible compute engine instances for batch processing.
- When using Dataproc, Spark and Spark SQL can be used for data mining.
- MMLib, Apache Spark's machine learning libraries can be used to discover patterns through machine learning.
- Dataproc users get a significant break in the cost of the instances with preemptive VMs.
- Preemptible instances are around 80 percent cheaper.
- To use Preemptible virtual machines, it must be possible to restart jobs cleanly.