Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架,Spark,拥有Hadoop MapReduce所具有的优点;但不同于MapReduce的是Job中间输出结果可以保存在内存中,从而不再需要读写HDFS,因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。


这本书提供 访问Spark平台的真实文档和示例,以构建大型企业级机器学习应用程序。

在过去的十年里,机器学习取得了一系列惊人的进步。这些突破正在影响我们的日常生活,并对每个行业产生影响。下一代机器学习Spark提供了Spark和Spark MLlib的介绍,并在标准Spark MLlib库之外,向更强大的第三方机器学习算法和库迈进。在这本书的结尾,你将能够通过许多实际的例子和有洞察力的解释将你的知识应用到现实世界的用例中

  • 介绍机器学习、Spark和Spark MLlib 2.4.x
  • 使用XGBoost4J Spark和LightGBM库在Spark上实现闪电般的快速渐变增强
  • 用Spark的隔离林算法检测异常
  • 使用支持多种语言的Spark NLP和Stanford CoreNLP库
  • 使用Alluxio内存数据加速器for Spark优化ML工作负载
  • 使用GraphX和GraphFrames进行图形分析
  • 利用卷积神经网络进行图像识别
  • 利用Keras框架和Spark分布式深度学习库


数据科学家和机器学习工程师,他们希望将自己的知识提升到一个新的水平,使用Spark和更强大的下一代算法和库,而不是标准Spark MLlib库中提供的;同时也是有抱负的数据科学家和工程师的入门书,他们需要机器学习的入门知识,Spark,SparkMLlib。



Research has a long history of discussing what is superior in predicting certain outcomes: statistical methods or the human brain. This debate has repeatedly been sparked off by the remarkable technological advances in the field of artificial intelligence (AI), such as solving tasks like object and speech recognition, achieving significant improvements in accuracy through deep-learning algorithms (Goodfellow et al. 2016), or combining various methods of computational intelligence, such as fuzzy logic, genetic algorithms, and case-based reasoning (Medsker 2012). One of the implicit promises that underlie these advancements is that machines will 1 day be capable of performing complex tasks or may even supersede humans in performing these tasks. This triggers new heated debates of when machines will ultimately replace humans (McAfee and Brynjolfsson 2017). While previous research has proved that AI performs well in some clearly defined tasks such as playing chess, playing Go or identifying objects on images, it is doubted that the development of an artificial general intelligence (AGI) which is able to solve multiple tasks at the same time can be achieved in the near future (e.g., Russell and Norvig 2016). Moreover, the use of AI to solve complex business problems in organizational contexts occurs scarcely, and applications for AI that solve complex problems remain mainly in laboratory settings instead of being implemented in practice. Since the road to AGI is still a long one, we argue that the most likely paradigm for the division of labor between humans and machines in the next decades is Hybrid Intelligence. This concept aims at using the complementary strengths of human intelligence and AI, so that they can perform better than each of the two could separately (e.g., Kamar 2016).