大数据“革命”¶
- Sloan Digital Sky Survey https://www.sdss.org/
- 生成的图像如此之多,以至于大多数永远不会被查看...
- 基因组数据:https://en.wikipedia.org/wiki/Genome_project
- 网络爬取
- 20e9个网页;仅存储页面(不包括图片等)就需要约400TB
- 社交媒体数据
- Twitter:每天约500e6条推文 YouTube:每分钟上传超过300小时的内容 (这个数字已经是几年前的了)
大数据的三个方面¶
- 体量:TB或PB规模的数据
- 需要新的处理范式,例如,分布式计算
- 速度:数据以前所未有的速度生成
- 例如,网络流量数据,Twitter,气候/天气数据
- 多样性:数据以许多不同的格式出现
- 数据库,但也包括非结构化的文本、音频、视频……混乱的数据需要不同的工具
MapReduce框架¶
- 基本思想:
- 将任务分解为独立的子任务
- 指定如何将子任务的结果组合成你的答案
- 独立的子任务是一个关键点:
- 如果我们不断需要共享信息,那么分解任务就是低效的
- 因为我们将花费更多的时间在通信上,而不是实际计数
MapReduce的假设¶
- 任务可以分解成多个部分
- 部分可以并行处理...
- ...进程之间的通信最少。
- 每个部分的结果可以组合以获得答案。
MapReduce:“大数据”的主力军¶
- Hadoop、Google MapReduce、Spark等都基于这个框架
- 指定一个“map”操作,应用于数据集中的每个元素
- 指定一个“reduce”操作,将列表组合成输出
- 然后我们在多台机器之间分割数据,并组合它们的结果
MapReduce对你来说并不新鲜¶
- 你已经知道Map模式:
- Python:
[f(x) for x in mylist]
- Python:
- ...和Reduce模式:
- Python:
sum([f(x) for x in mylist])(map和reduce)
- Python:
- 唯一新的东西是计算模型
词频统计¶
- 假设我们有一大堆书籍...
- 例如,Google ngrams:https://books.google.com/ngrams/info
- ...我们想要计算每个词在集合中出现的次数。
- 分而治之!
- 每个人都拿一本书,并制作一个(word,count)对的列表。
- 合并列表,加上具有相同词键的计数。
MapReduce的基本单位:(key,value) pair¶
- 例子:
- 语言数据:<word, count>
- 入学数据:<student, major>
- 气候数据:<location, wind speed>
- 在某些环境中,值可以是更复杂的对象
- 例如,列表,字典,其他数据结构
- 社交媒体数据:<person, list_of_friends>
- 例如,列表,字典,其他数据结构
一个典型的MapReduce程序¶
- 从文件中读取记录(即,数据片段)
- Map:对于每条记录,提取你关心的信息 以<key,value>对的形式输出这些信息
- Combine:根据它们的键对提取的<key,value>对进行排序和分组
- Reduce:对于每组,汇总,过滤,分组,聚合等,以获得一些新值,v2 将<key, v2>对作为结果文件中的一行输出
澄清术语¶
- MapReduce:最初由Google开发的大规模计算框架
- 后来通过Apache基金会以Hadoop MapReduce的形式开源
- Apache Hadoop:Apache基金会提供的一套开源工具
- 包括Hadoop MapReduce、Hadoop HDFS、Hadoop YARN
- Hadoop MapReduce:实现了MapReduce框架
- Hadoop分布式文件系统(HDFS):分布式文件系统
- 专为Hadoop MapReduce设计
- 运行在与MapReduce相同的通用硬件上
MapReduce:词汇表¶
- 集群:一组设备(即,计算机)
- 网络连接以实现快速通信,通常用于分布式计算
- 由Sun/Oracle grid engine、Slurm、TORQUE或YARN等程序调度作业
- 节点:集群中的单个计算“单元”
- 大致上,计算机==节点,但每台机器可以有多个节点 通常是通用的(即,非专用的,廉价的)硬件
- 步骤:一个单独的map->combine->reduce“链”
- 一个步骤不一定包含所有三个map、combine和reduce
所以MapReduce让事情变得更简单¶
- 我们不需要担心分割数据、组织机器之间的通信等,
- 我们只需要指定:
- Map
- Combine(可选)
- Reduce
- Hadoop后端将处理其他所有事情。
MapReduce中的词频统计:版本1¶
MapReduce中的词频统计:版本1¶
MapReduce中的词频统计:版本2¶
MapReduce中的词频统计:版本2¶
MapReduce中的词频统计:版本3¶
MapReduce中的词频统计:版本3¶
MapReduce:内部工作机制¶
- MR作业由以下组成:
- 一个作业跟踪器或资源管理器节点
- 一些工作节点
- 资源管理器:
- 调度和分配任务给工作节点
- 监控节点,如果工作节点失败则重新调度任务
- 工作节点:
- 根据资源管理器的指示执行计算
- 将结果传达给下游节点(例如,Mapper -> Reducer)
Hadoop分布式文件系统(HDFS)¶
- Hadoop的存储系统
- 文件系统分布在网络上的多个节点上
- 与所有文件都在一台计算机上相比
- 容错
- 文件的多个副本存储在不同的节点上
- 如果节点失败,仍然可以恢复
- 高吞吐量
- 许多大文件,可以同时被多个读写者访问
mrjob Python包¶
- 安装:pip install mrjob(或conda,或从源代码安装...)
- 可以在没有Hadoop实例的情况下本地运行...
- ...但也可以运行在Hadoop或Spark上
使用Apache Spark进行并行计算¶
- Apache Spark是一个用于大规模并行处理的计算框架
- 由加州大学伯克利分校AMPLab(现为RISELab)开发,
- 现在由Apache基金会维护
- 提供Java、Scala和Python的实现
- 并且这些可以实现交互式运行!
为什么使用Spark?¶
- “等等,Hadoop/mrjob不是已经做了这些事情吗?”
- 简短回答:是的!
- 不那么简短的回答:Spark比Hadoop更快、更灵活
- 并且由于Spark看起来正在取代Hadoop在行业中的地位
- Spark仍然遵循MapReduce框架,但是更适合于:
- 交互式会话
- 缓存(即,数据存储在将要处理它的节点的RAM中,而不是磁盘上)
- 重复更新计算(例如,随着新数据的到来进行更新)
- 容错和恢复
Apache Spark:概览¶
- 用Scala实现
- 流行的函数式编程语言(差不多...)
- 运行在Java虚拟机(JVM)之上 http://www.scala-lang.org/
- 但是可以从Scala、Java和Python调用Spark
- 我们将使用Python进行所有编码
- PySpark:https://spark.apache.org/docs/0.9.0/python-programming-guide.html
- 但是你学到的几乎所有东西都可以在其他支持的语言中以最小的变化应用
运行Spark¶
- 选项1:以交互模式运行
- 在命令行中输入pyspark
- PySpark提供了一个类似于Python解释器的接口
- Scala、Java和R也提供了他们自己的交互模式
- 选项2:在集群上运行
- 编写你的代码,然后通过调度器启动它 spark-submit
- https://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
- 两个基本概念
- SparkContext
- 对应于连接到Spark集群的对象
- 在交互模式下自动创建
- 必须在通过调度器运行时显式创建(我们很快就会看到示例)
- 维护有关数据存储位置的信息
- 可以通过提供SparkConf对象进行配置
- 对应于连接到Spark集群的对象
- 弹性分布式数据集(RDD)
- 表示一个数据集合
- 分布式存储在节点上,具有容错能力(与HDFS类似)
更多关于RDD的信息¶
- RDD是Spark的基本单位
- “分布在集群节点上的元素集合,可以并行操作。”(https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#overview)
- RDD的元素类似于MapReduce中的<key,value>对
- RDD大致类似于R中的dataframe
- RDD元素有点像表中的行
- Spark还可以保持(持久化,在Spark的术语中)
- RDD在内存中
- 允许稍后重用或额外处理
- RDD是不可变的,就像Python的元组和字符串一样。
RDD操作¶
- 将RDD视为代表数据集
- 两个基本操作:
- 转换:结果为另一个RDD
- (例如,map对每个RDD元素应用某个函数)
- 动作:计算一个值并报告给驱动程序
- (例如,reduce计算所有元素并计算一些汇总统计量)
- 转换:结果为另一个RDD
开发容器(Dev Container)¶
- 新建文件夹
.devcontainer
- 新建文件
devcontainer.json
- 开发容器(Dev Containers)功能的核心配置文件。它定义了如何设置开发环境,以及如何使用 Docker 或 Docker Compose 来运行该环境。
- 新建文件
docker-compose.yaml
- 定义和管理多容器应用。对于开发容器来说,它通常用来协调多个服务(例如数据库、缓存服务、应用服务)或扩展基础容器配置。
- 新建文件
Dockerfile
- Docker 的核心文件,用来定义一个单独的容器镜像。它描述了如何构建环境,包括基础镜像、依赖安装、环境配置等。
In [1]:
!cat .devcontainer/devcontainer.json
{ "name": "PySpark local cluster", "dockerComposeFile": ["./docker-compose.yaml"], "service": "spark", "workspaceFolder": "/home/jovyan/code", "customizations": { "vscode" : { "settings": { "terminal.integrated.profiles.linux": { "bash": { "path": "/bin/bash" } }, "terminal.integrated.defaultProfile.linux": "bash", "python.linting.enabled": true, "python.linting.pylintEnabled": true }, "extensions": [ "ms-python.python", "ms-toolsai.jupyter" ] } } }
In [2]:
!cat .devcontainer/docker-compose.yaml
version: '3' services: spark: build: context: . dockerfile: Dockerfile volumes: - ..:/home/jovyan/code ports: - "8888:8888" command: start.sh jupyter notebook --NotebookApp.token='' --NotebookApp.disable_check_xsrf=true --NotebookApp.allow_origin='*' --NotebookApp.ip='0.0.0.0'
In [3]:
!cat .devcontainer/Dockerfile
# Choose your desired base image FROM jupyter/pyspark-notebook:latest # name your environment and choose the python version ARG conda_env=vscode_pyspark ARG py_ver=3.11 # you can add additional libraries you want mamba to install by listing them below the first line and ending with "&& \" RUN mamba create --yes -p "${CONDA_DIR}/envs/${conda_env}" python=${py_ver} ipython ipykernel && \ mamba clean --all -f -y # alternatively, you can comment out the lines above and uncomment those below # if you'd prefer to use a YAML file present in the docker build context # COPY --chown=${NB_UID}:${NB_GID} environment.yml "/home/${NB_USER}/tmp/" # RUN cd "/home/${NB_USER}/tmp/" && \ # mamba env create -p "${CONDA_DIR}/envs/${conda_env}" -f environment.yml && \ # mamba clean --all -f -y # create Python kernel and link it to jupyter RUN "${CONDA_DIR}/envs/${conda_env}/bin/python" -m ipykernel install --user --name="${conda_env}" && \ fix-permissions "${CONDA_DIR}" && \ fix-permissions "/home/${NB_USER}" # any additional pip installs can be added by uncommenting the following line RUN "${CONDA_DIR}/envs/${conda_env}/bin/pip" install pyspark pandas --no-cache-dir # if you want this environment to be the default one, uncomment the following line: RUN echo "conda activate ${conda_env}" >> "${HOME}/.bashrc"
开发容器(Dev Container)¶
-PyCharm - https://www.jetbrains.com/help/pycharm/start-dev-container-inside-ide.html
In [49]:
!pyspark
Python 3.11.11 | packaged by conda-forge | (main, Dec 5 2024, 14:17:24) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 24/12/09 15:18:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 24/12/09 15:18:04 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0 /_/ Using Python version 3.11.11 (main, Dec 5 2024 14:17:24) Spark context Web UI available at http://703d76e82caa:4041 Spark context available as 'sc' (master = local[*], app id = local-1733757485024). SparkSession available as 'spark'. >>> Traceback (most recent call last): File "<stdin>", line 0, in <module> File "/usr/local/spark/python/pyspark/context.py", line 382, in signal_handler raise KeyboardInterrupt() KeyboardInterrupt >>>
In [6]:
from pyspark.sql import SparkSession
# 创建了一个 SparkSession 对象。SparkSession 是与 Spark 进行交互的入口
spark = SparkSession.builder.master("local").getOrCreate()
# SparkContext 是 Spark 的核心,用于连接到 Spark 集群并负责与集群管理器进行通信,可以用来执行分布式任务
sc = spark.sparkContext
sc
Out[6]:
In [7]:
# 从指定的文件创建一个 RDD。
# PySpark 假设我们引用的是 HDFS 上的文件。
data = sc.textFile('demo_file.txt')
print(type(data))
# collect() 会将 RDD 中的元素收集到一个列表中。
data.collect()
<class 'pyspark.rdd.RDD'>
Out[7]:
['This is just a demo file. ', 'Normally, a file this small would have no reaon to be on HDFS.']
In [8]:
# PySpark 会跟踪原始数据所在的位置。
# MapPartitionsRDD 类似于我们创建的所有 RDD 的一个数组
data
Out[8]:
demo_file.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0
In [9]:
# Simple MapReduce task: Summations
data = sc.textFile('number.txt')
data.collect()
Out[9]:
['10', '23', '16', '7', '12', '0', '1', '1', '2', '3', '5', '8', '-1', '42', '64', '101', '-101', '3']
In [10]:
# 在这里使用strip()是多余的, 这只是为了展示一个例子
intdata = data.map(lambda n: int(n))
intdata.collect()
Out[10]:
[10, 23, 16, 7, 12, 0, 1, 1, 2, 3, 5, 8, -1, 42, 64, 101, -101, 3]
In [12]:
intdata.reduce(lambda x,y: x+y)
Out[12]:
196
示例RDD转换¶
- map:对RDD的每个元素应用一个函数
- filter:只保留满足条件的元素
- flatMap:应用一个map,但是“展平”结构(详情在接下来的几张幻灯片中)
- sample:从RDD的元素中随机抽取样本
- distinct:移除RDD中的重复条目
- reduceByKey:在RDD的(K, V)对上,返回(K, V)对的RDD
- 每个键的值使用给定的reduce函数聚合。
- 更多:https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#transformations
In [ ]:
# RDD.map()
def polynomialk(x):
return x**2 + 1
data = sc.textFile('number.txt')
data.collect()
doubles = data.map(lambda n: int(n)).map(lambda n: 2*n)
doubles.collect()
data.map(lambda n: int(n)).map(polynomialk).collect()
Out[ ]:
[101, 530, 257, 50, 145, 1, 2, 2, 5, 10, 26, 65, 2, 1765, 4097, 10202, 10202, 10]
In [14]:
## RDD.filter()
data = sc.textFile('number.txt').map(lambda n: int(n))
evens = data.filter(lambda n: n%2==0)
print(evens.collect())
odds = data.filter(lambda n: n%2!=0)
print(odds.collect())
sc.addPyFile('prime.py')
from prime import is_prime
# filter()接受一个布尔函数作为参数,只保留评估为True的元素。
primes = data.filter(is_prime)
print(primes.collect())
[10, 16, 12, 0, 2, 8, 42, 64] [23, 7, 1, 1, 3, 5, -1, 101, -101, 3] [23, 7, 2, 3, 5, 101, 3]
In [15]:
# RDD.sample()
# sample(withReplacement, fraction, [seed])
# RDD.sample()主要用于在数据的一小部分上进行测试。
data = sc.textFile('number.txt').map(lambda n: int(n))
samp = data.sample(False, 0.5)
print(samp.collect())
samp = data.sample(True, 0.5)
print(samp.collect())
[7, 12, 2, 3, 5, 8, -101, 3] [10, 23, 0, 0, 2, 3, 3, 5, -1, 42, 64, 3]
In [ ]:
# 如果我的RDD元素比数字更复杂怎么办?
## 类似数据库的文件
data = sc.textFile('scientists.txt')
data.collect()
Out[ ]:
['Claude Shannon 3.1 EE 1916', 'Eugene Wigner 3.2 Physics 1902', 'Albert Einstein 4.0 Physics 1879', 'Ronald Fisher 3.25 Statistics 1890', 'Max Planck 2.9 Physics 1858', 'Leonard Euler 3.9 Mathematics 1707', 'Jerzy Neyman 3.5 Statistics 1894', 'Ky Fan 3.55 Mathematics 1914']
In [ ]:
# 最初读取时,每一行都是RDD中的一个单独元素。
# 在每个元素上分割空格后,我们得到了我们想要的——每个元素都是一个字符串元组。
# 注意:RDD.collect()返回一个列表,但在RDD内部,元素是元组,而不是列表。
data = data.map(lambda line: line.split())
data.collect()
Out[ ]:
[['Claude', 'Shannon', '3.1', 'EE', '1916'], ['Eugene', 'Wigner', '3.2', 'Physics', '1902'], ['Albert', 'Einstein', '4.0', 'Physics', '1879'], ['Ronald', 'Fisher', '3.25', 'Statistics', '1890'], ['Max', 'Planck', '2.9', 'Physics', '1858'], ['Leonard', 'Euler', '3.9', 'Mathematics', '1707'], ['Jerzy', 'Neyman', '3.5', 'Statistics', '1894'], ['Ky', 'Fan', '3.55', 'Mathematics', '1914']]
In [33]:
## RDD.distinct()
data = sc.textFile('scientists.txt')
data = data.map(lambda line: line.split())
fields = data.map(lambda t: t[3]).distinct()
fields.collect()
Out[33]:
['EE', 'Physics', 'Statistics', 'Mathematics']
In [34]:
## RDD.flatMap()
data = sc.textFile('numbers_weird.txt')
data.collect()
#同一个数字列表,但现在它们不是每行一个了...
#来自PySpark文档:flatMap(func)类似于map,但是每个输入项可以映射到0个或更多的输出项(所以func应该返回一个序列而不是单个项)。 https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
Out[34]:
['10 23 16', '7 12', '0', '1 1 2 3 5 8', '-1 42', '64 101 -101', '3']
In [ ]:
## RDD.flatMap()
# 所以我们可以认为flatMap()为每个RDD元素产生一个列表,然后将这些列表连接起来。但至关重要的是,输出是另一个RDD,而不是列表。这种操作称为展平,它是函数式编程中的一个常见模式。
flattened = data.flatMap(lambda line: [x for x in line.split()])
flattened.collect()
flattened.map(lambda n: int(n)).reduce(lambda x,y: x+y)
Out[ ]:
196
示例RDD动作¶
- reduce:使用函数聚合RDD的元素
- collect:在驱动程序中返回所有RDD元素作为数组。
- count:返回RDD中元素的数量。
- countByKey:返回<key, int>对,计算每个键的出现次数。
- 仅适用于形式为<key,value>的RDD元素。
- 更多:https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#actions
In [38]:
# RDD.count()
data = sc.textFile('demo_file.txt')
data = data.flatMap(lambda line: line.split())
data = data.map(lambda w: w.lower())
data.collect()
Out[38]:
['this', 'is', 'just', 'a', 'demo', 'file.', 'normally,', 'a', 'file', 'this', 'small', 'would', 'have', 'no', 'reaon', 'to', 'be', 'on', 'hdfs.']
In [39]:
uniqwords = data.distinct()
uniqwords.count()
Out[39]:
17
In [46]:
# RDD.countByKey()
# 注意:在上面的例子中,每个单词都有一个键0,但请注意,在countByKey产生的字典中,值对应于该键出现了多少次。这是因为countByKey统计每个键出现的次数并忽略它们的值。
data = sc.textFile('demo_file.txt')
data = data.flatMap(lambda line: line.split())
data = data.map(lambda w: (w.lower(), 0))
data.countByKey()
Out[46]:
defaultdict(int, {'this': 2, 'is': 1, 'just': 1, 'a': 2, 'demo': 1, 'file.': 1, 'normally,': 1, 'file': 1, 'small': 1, 'would': 1, 'have': 1, 'no': 1, 'reaon': 1, 'to': 1, 'be': 1, 'on': 1, 'hdfs.': 1})
在集群上运行PySpark¶
- 到目前为止,我们只是在交互模式下运行。
- 问题:
- 交互模式适合原型制作和测试…
- ...但不适合运行大型作业。
- 解决方案:PySpark也可以提交到网格并在网格上运行。
提交到队列:spark-submit¶
In [50]:
!cat ps_wordcount.py
from pyspark import SparkConf, SparkContext import sys if len(sys.argv) != 3: print("Usage: " + sys.argv[0] + " <in> <out>") sys.exit(1) inputlocation = sys.argv[1] outputlocation = sys.argv[2] conf = SparkConf().setAppName("WordCount") sc = SparkContext(conf=conf) data = sc.textFile(inputlocation) data = data.flatMap(lambda line: line.split()) data = data.map(lambda w: (w.lower(),1)) data = data.reduceByKey(lambda a,b: a+b) data.saveAsTextFile(outputlocation) sc.stop()
In [53]:
!pwd
!spark-submit ps_wordcount.py demo_file.txt wc_demo
/home/jovyan/code 24/12/09 15:45:51 INFO SparkContext: Running Spark version 3.5.0 24/12/09 15:45:51 INFO SparkContext: OS info Linux, 5.15.153.1-microsoft-standard-WSL2, amd64 24/12/09 15:45:51 INFO SparkContext: Java version 17.0.8.1 24/12/09 15:45:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 24/12/09 15:45:52 INFO ResourceUtils: ============================================================== 24/12/09 15:45:52 INFO ResourceUtils: No custom resources configured for spark.driver. 24/12/09 15:45:52 INFO ResourceUtils: ============================================================== 24/12/09 15:45:52 INFO SparkContext: Submitted application: WordCount 24/12/09 15:45:52 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 24/12/09 15:45:52 INFO ResourceProfile: Limiting resource is cpu 24/12/09 15:45:52 INFO ResourceProfileManager: Added ResourceProfile id: 0 24/12/09 15:45:52 INFO SecurityManager: Changing view acls to: jovyan 24/12/09 15:45:52 INFO SecurityManager: Changing modify acls to: jovyan 24/12/09 15:45:52 INFO SecurityManager: Changing view acls groups to: 24/12/09 15:45:52 INFO SecurityManager: Changing modify acls groups to: 24/12/09 15:45:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: jovyan; groups with view permissions: EMPTY; users with modify permissions: jovyan; groups with modify permissions: EMPTY 24/12/09 15:45:52 INFO Utils: Successfully started service 'sparkDriver' on port 38469. 24/12/09 15:45:52 INFO SparkEnv: Registering MapOutputTracker 24/12/09 15:45:52 INFO SparkEnv: Registering BlockManagerMaster 24/12/09 15:45:52 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 24/12/09 15:45:52 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 24/12/09 15:45:52 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 24/12/09 15:45:52 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b755f455-0030-4702-b2c0-5382485271ed 24/12/09 15:45:52 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB 24/12/09 15:45:52 INFO SparkEnv: Registering OutputCommitCoordinator 24/12/09 15:45:52 INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI 24/12/09 15:45:52 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 24/12/09 15:45:52 INFO Utils: Successfully started service 'SparkUI' on port 4041. 24/12/09 15:45:52 INFO Executor: Starting executor ID driver on host 703d76e82caa 24/12/09 15:45:52 INFO Executor: OS info Linux, 5.15.153.1-microsoft-standard-WSL2, amd64 24/12/09 15:45:52 INFO Executor: Java version 17.0.8.1 24/12/09 15:45:52 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): '' 24/12/09 15:45:52 INFO Executor: Created or updated repl class loader org.apache.spark.util.MutableURLClassLoader@4c1710d9 for default. 24/12/09 15:45:52 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36169. 24/12/09 15:45:52 INFO NettyBlockTransferService: Server created on 703d76e82caa:36169 24/12/09 15:45:52 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 24/12/09 15:45:52 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 703d76e82caa, 36169, None) 24/12/09 15:45:52 INFO BlockManagerMasterEndpoint: Registering block manager 703d76e82caa:36169 with 434.4 MiB RAM, BlockManagerId(driver, 703d76e82caa, 36169, None) 24/12/09 15:45:52 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 703d76e82caa, 36169, None) 24/12/09 15:45:52 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 703d76e82caa, 36169, None) 24/12/09 15:45:53 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 221.5 KiB, free 434.2 MiB) 24/12/09 15:45:53 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 32.6 KiB, free 434.2 MiB) 24/12/09 15:45:53 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 703d76e82caa:36169 (size: 32.6 KiB, free: 434.4 MiB) 24/12/09 15:45:53 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:0 24/12/09 15:45:53 INFO FileInputFormat: Total input files to process : 1 Traceback (most recent call last): File "/home/jovyan/code/ps_wordcount.py", line 17, in <module> data.saveAsTextFile(outputlocation) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 3425, in saveAsTextFile File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o46.saveAsTextFile. : org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/home/jovyan/code/wc_demo already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131) at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.assertConf(SparkHadoopWriter.scala:299) at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopDataset$1(PairRDDFunctions.scala:1091) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:407) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1089) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$4(PairRDDFunctions.scala:1062) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:407) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1027) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$3(PairRDDFunctions.scala:1009) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:407) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1008) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$2(PairRDDFunctions.scala:965) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:407) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:963) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$2(RDD.scala:1620) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:407) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1620) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$1(RDD.scala:1606) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:407) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1606) at org.apache.spark.api.java.JavaRDDLike.saveAsTextFile(JavaRDDLike.scala:564) at org.apache.spark.api.java.JavaRDDLike.saveAsTextFile$(JavaRDDLike.scala:563) at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:45) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:833) 24/12/09 15:45:53 INFO SparkContext: Invoking stop() from shutdown hook 24/12/09 15:45:53 INFO SparkContext: SparkContext is stopping with exitCode 0. 24/12/09 15:45:53 INFO SparkUI: Stopped Spark web UI at http://703d76e82caa:4041 24/12/09 15:45:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 24/12/09 15:45:53 INFO MemoryStore: MemoryStore cleared 24/12/09 15:45:53 INFO BlockManager: BlockManager stopped 24/12/09 15:45:53 INFO BlockManagerMaster: BlockManagerMaster stopped 24/12/09 15:45:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 24/12/09 15:45:53 INFO SparkContext: Successfully stopped SparkContext 24/12/09 15:45:53 INFO ShutdownHookManager: Shutdown hook called 24/12/09 15:45:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-15e616e8-be48-4da9-adc6-6b19851903dd 24/12/09 15:45:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-8f8834e1-c012-40d9-b350-ddf928ff7107 24/12/09 15:45:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-8f8834e1-c012-40d9-b350-ddf928ff7107/pyspark-af91f98f-1d6e-4d46-97fa-befe34e8201d
In [54]:
!cat wc_demo/*
('this', 2) ('just', 1) ('demo', 1) ('file.', 1) ('normally,', 1) ('file', 1) ('small', 1) ('would', 1) ('have', 1) ('no', 1) ('to', 1) ('is', 1) ('a', 2) ('reaon', 1) ('be', 1) ('on', 1) ('hdfs.', 1)