Spark NLP Models Hub介绍¶
- Spark NLP Models Hub 是一个集中托管预训练 NLP 模型的库(https://sparknlp.org/models%EF%BC%89%EF%BC%8C%E5%8F%AF%E7%9B%B4%E6%8E%A5%E5%9C%A8 Spark NLP 中加载使用。
- 特点:覆盖分词、词向量、命名实体识别(NER)、依存/句法分析、情感、文本分类、文本解释等;支持多语言与不同精度/速度的模型版本。
- 分词:将文本拆分为单词或标记,便于后续处理。
- 词向量:将词语转换为向量表示,用于机器学习和深度学习模型。
- 命名实体识别(NER):识别文本中的专有名词、地名、机构等实体。
- 依存/句法分析:分析句子结构,识别词语之间的语法关系。
- 情感分析:判断文本的情感倾向,如正面或负面。
- 文本分类:将文本归类到不同主题或类别。
- 文本解释:自动生成文本的结构化注释,辅助理解内容。
- 使用场景:快速搭建生产级 NLP 管线、对比不同预训练模型、在大数据集上并行推理。
- https://sparknlp.org/docs/en/quickstart
安装Spark NLP¶
Spark NLP 支持 Python 3.7.x 及以上版本,具体取决于所用的 PySpark 主版本。
注意:自 Spark 3.2 起,Python 3.6 已不再支持。如果你仍在使用该版本,请考虑使用较低版本的 Spark。
GPU(可选):
Spark NLP 6.2.0 内置 ONNX 1.17.0 和 TensorFlow 2.7.1 深度学习引擎。仅在需要 GPU 支持时,需满足以下 NVIDIA® 软件最低要求:
- NVIDIA® GPU 驱动版本 450.80.02 或更高
- CUDA® Toolkit 11.2
- cuDNN SDK 8.1.0
java -version
# should be Java 8 (Oracle or OpenJDK)
conda create -n sparknlp python=3.8 -y
conda activate sparknlp
pip install spark-nlp==6.2.0 pyspark==3.3.1
pip install johnsnowlabs
pip install jupyter
从 Python 启动 Spark NLP 会话¶
可以通过 sparknlp.start() 创建(或获取)Spark NLP 的 Spark 会话:
import sparknlp
spark = sparknlp.start()
#spark = sparknlp.start(gpu=True)
25/11/13 11:17:24 WARN Utils: Your hostname, legion resolves to a loopback address: 127.0.1.1; using 192.168.1.2 instead (on interface enp3s0) 25/11/13 11:17:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/legion/miniconda3/envs/sparknlp/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/legion/.ivy2/cache The jars for the packages stored in: /home/legion/.ivy2/jars com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-5aace11f-4e53-4048-9842-108e0f27911c;1.0 confs: [default] found com.johnsnowlabs.nlp#spark-nlp_2.12;6.1.3 in central found com.typesafe#config;1.4.2 in central found org.rocksdb#rocksdbjni;6.29.5 in central found com.amazonaws#aws-java-sdk-s3;1.12.500 in central found com.amazonaws#aws-java-sdk-kms;1.12.500 in central found com.amazonaws#aws-java-sdk-core;1.12.500 in central found commons-logging#commons-logging;1.1.3 in central found commons-codec#commons-codec;1.15 in central found org.apache.httpcomponents#httpclient;4.5.13 in central found org.apache.httpcomponents#httpcore;4.4.13 in central found software.amazon.ion#ion-java;1.0.2 in central found joda-time#joda-time;2.8.1 in central found com.amazonaws#jmespath-java;1.12.500 in central found com.github.universal-automata#liblevenshtein;3.0.0 in central found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central found com.google.code.gson#gson;2.3 in central found it.unimi.dsi#fastutil;7.0.12 in central found org.projectlombok#lombok;1.16.8 in central found com.google.cloud#google-cloud-storage;2.20.1 in central found com.google.guava#guava;31.1-jre in central found com.google.guava#failureaccess;1.0.1 in central found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in central found com.google.errorprone#error_prone_annotations;2.18.0 in central found com.google.j2objc#j2objc-annotations;1.3 in central found com.google.http-client#google-http-client;1.43.0 in central found io.opencensus#opencensus-contrib-http-util;0.31.1 in central found com.google.http-client#google-http-client-jackson2;1.43.0 in central found com.google.http-client#google-http-client-gson;1.43.0 in central found com.google.api-client#google-api-client;2.2.0 in central found com.google.oauth-client#google-oauth-client;1.34.1 in central found com.google.http-client#google-http-client-apache-v2;1.43.0 in central found com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 in central found com.google.code.gson#gson;2.10.1 in central found com.google.cloud#google-cloud-core;2.12.0 in central found io.grpc#grpc-context;1.53.0 in central found com.google.auto.value#auto-value-annotations;1.10.1 in central found com.google.auto.value#auto-value;1.10.1 in central found javax.annotation#javax.annotation-api;1.3.2 in central found com.google.cloud#google-cloud-core-http;2.12.0 in central found com.google.http-client#google-http-client-appengine;1.43.0 in central found com.google.api#gax-httpjson;0.108.2 in central found com.google.cloud#google-cloud-core-grpc;2.12.0 in central found io.grpc#grpc-alts;1.53.0 in central found io.grpc#grpc-grpclb;1.53.0 in central found org.conscrypt#conscrypt-openjdk-uber;2.5.2 in central found io.grpc#grpc-auth;1.53.0 in central found io.grpc#grpc-protobuf;1.53.0 in central found io.grpc#grpc-protobuf-lite;1.53.0 in central found io.grpc#grpc-core;1.53.0 in central found com.google.api#gax;2.23.2 in central found com.google.api#gax-grpc;2.23.2 in central found com.google.auth#google-auth-library-credentials;1.16.0 in central found com.google.auth#google-auth-library-oauth2-http;1.16.0 in central found com.google.api#api-common;2.6.2 in central found io.opencensus#opencensus-api;0.31.1 in central found com.google.api.grpc#proto-google-iam-v1;1.9.2 in central found com.google.protobuf#protobuf-java;3.21.12 in central found com.google.protobuf#protobuf-java-util;3.21.12 in central found com.google.api.grpc#proto-google-common-protos;2.14.2 in central found org.threeten#threetenbp;1.6.5 in central found com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha in central found com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha in central found com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha in central found com.google.code.findbugs#jsr305;3.0.2 in central found io.grpc#grpc-api;1.53.0 in central found io.grpc#grpc-stub;1.53.0 in central found org.checkerframework#checker-qual;3.31.0 in central found io.perfmark#perfmark-api;0.26.0 in central found com.google.android#annotations;4.1.1.4 in central found org.codehaus.mojo#animal-sniffer-annotations;1.22 in central found io.opencensus#opencensus-proto;0.2.0 in central found io.grpc#grpc-services;1.53.0 in central found com.google.re2j#re2j;1.6 in central found io.grpc#grpc-netty-shaded;1.53.0 in central found io.grpc#grpc-googleapis;1.53.0 in central found io.grpc#grpc-xds;1.53.0 in central found com.navigamez#greex;1.0 in central found dk.brics.automaton#automaton;1.11-8 in central found org.jsoup#jsoup;1.18.2 in central found jakarta.mail#jakarta.mail-api;2.1.3 in central found jakarta.activation#jakarta.activation-api;2.1.3 in central found org.eclipse.angus#angus-mail;2.0.3 in central found org.eclipse.angus#angus-activation;2.0.2 in central found org.apache.poi#poi-ooxml;4.1.2 in central found org.apache.poi#poi;4.1.2 in central found org.apache.commons#commons-collections4;4.4 in central found org.apache.commons#commons-math3;3.6.1 in central found com.zaxxer#SparseBitSet;1.2 in central found org.apache.poi#poi-ooxml-schemas;4.1.2 in central found org.apache.xmlbeans#xmlbeans;3.1.0 in central found org.apache.commons#commons-compress;1.19 in central found com.github.virtuald#curvesapi;1.06 in central found org.apache.poi#poi-scratchpad;4.1.2 in central found org.apache.pdfbox#pdfbox;2.0.28 in central found org.apache.pdfbox#fontbox;2.0.28 in central found com.vladsch.flexmark#flexmark-all;0.61.34 in central found com.vladsch.flexmark#flexmark;0.61.34 in central found com.vladsch.flexmark#flexmark-util-ast;0.61.34 in central found com.vladsch.flexmark#flexmark-util-collection;0.61.34 in central found com.vladsch.flexmark#flexmark-util-misc;0.61.34 in central found org.jetbrains#annotations;15.0 in central found com.vladsch.flexmark#flexmark-util-data;0.61.34 in central found com.vladsch.flexmark#flexmark-util-sequence;0.61.34 in central found com.vladsch.flexmark#flexmark-util-visitor;0.61.34 in central found com.vladsch.flexmark#flexmark-util-builder;0.61.34 in central found com.vladsch.flexmark#flexmark-util-dependency;0.61.34 in central found com.vladsch.flexmark#flexmark-util-format;0.61.34 in central found com.vladsch.flexmark#flexmark-util-html;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-abbreviation;0.61.34 in central found com.vladsch.flexmark#flexmark-util;0.61.34 in central found com.vladsch.flexmark#flexmark-util-options;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-autolink;0.61.34 in central found org.nibor.autolink#autolink;0.6.0 in central found com.vladsch.flexmark#flexmark-ext-admonition;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-anchorlink;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-aside;0.61.34 in central found com.vladsch.flexmark#flexmark-jira-converter;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-gfm-strikethrough;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-tables;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-wikilink;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-ins;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-superscript;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-attributes;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-definition;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-emoji;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-enumerated-reference;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-escaped-character;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-footnotes;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-gfm-issues;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-gfm-tasklist;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-gfm-users;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-gitlab;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-jekyll-front-matter;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-yaml-front-matter;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-jekyll-tag;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-media-tags;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-macros;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-xwiki-macros;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-toc;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-typographic;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-youtube-embedded;0.61.34 in central found com.vladsch.flexmark#flexmark-html2md-converter;0.61.34 in central found com.vladsch.flexmark#flexmark-pdf-converter;0.61.34 in central found com.openhtmltopdf#openhtmltopdf-core;1.0.0 in central found com.openhtmltopdf#openhtmltopdf-pdfbox;1.0.0 in central found org.apache.pdfbox#xmpbox;2.0.16 in central found de.rototor.pdfbox#graphics2d;0.24 in central found com.openhtmltopdf#openhtmltopdf-rtl-support;1.0.0 in central found com.ibm.icu#icu4j;59.1 in central found com.openhtmltopdf#openhtmltopdf-jsoup-dom-converter;1.0.0 in central found com.vladsch.flexmark#flexmark-profile-pegdown;0.61.34 in central found com.vladsch.flexmark#flexmark-youtrack-converter;0.61.34 in central found com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 in central found com.microsoft.onnxruntime#onnxruntime;1.19.2 in central found com.johnsnowlabs.nlp#jsl-llamacpp-cpu;1.0.2 in central found org.jetbrains#annotations;24.1.0 in central found com.johnsnowlabs.nlp#jsl-openvino-cpu_2.12;0.2.0 in central :: resolution report :: resolve 824ms :: artifacts dl 22ms :: modules in use: com.amazonaws#aws-java-sdk-core;1.12.500 from central in [default] com.amazonaws#aws-java-sdk-kms;1.12.500 from central in [default] com.amazonaws#aws-java-sdk-s3;1.12.500 from central in [default] com.amazonaws#jmespath-java;1.12.500 from central in [default] com.github.universal-automata#liblevenshtein;3.0.0 from central in [default] com.github.virtuald#curvesapi;1.06 from central in [default] com.google.android#annotations;4.1.1.4 from central in [default] com.google.api#api-common;2.6.2 from central in [default] com.google.api#gax;2.23.2 from central in [default] com.google.api#gax-grpc;2.23.2 from central in [default] com.google.api#gax-httpjson;0.108.2 from central in [default] com.google.api-client#google-api-client;2.2.0 from central in [default] com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha from central in [default] com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha from central in [default] com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha from central in [default] com.google.api.grpc#proto-google-common-protos;2.14.2 from central in [default] com.google.api.grpc#proto-google-iam-v1;1.9.2 from central in [default] com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 from central in [default] com.google.auth#google-auth-library-credentials;1.16.0 from central in [default] com.google.auth#google-auth-library-oauth2-http;1.16.0 from central in [default] com.google.auto.value#auto-value;1.10.1 from central in [default] com.google.auto.value#auto-value-annotations;1.10.1 from central in [default] com.google.cloud#google-cloud-core;2.12.0 from central in [default] com.google.cloud#google-cloud-core-grpc;2.12.0 from central in [default] com.google.cloud#google-cloud-core-http;2.12.0 from central in [default] com.google.cloud#google-cloud-storage;2.20.1 from central in [default] com.google.code.findbugs#jsr305;3.0.2 from central in [default] com.google.code.gson#gson;2.10.1 from central in [default] com.google.errorprone#error_prone_annotations;2.18.0 from central in [default] com.google.guava#failureaccess;1.0.1 from central in [default] com.google.guava#guava;31.1-jre from central in [default] com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava from central in [default] com.google.http-client#google-http-client;1.43.0 from central in [default] com.google.http-client#google-http-client-apache-v2;1.43.0 from central in [default] com.google.http-client#google-http-client-appengine;1.43.0 from central in [default] com.google.http-client#google-http-client-gson;1.43.0 from central in [default] com.google.http-client#google-http-client-jackson2;1.43.0 from central in [default] com.google.j2objc#j2objc-annotations;1.3 from central in [default] com.google.oauth-client#google-oauth-client;1.34.1 from central in [default] com.google.protobuf#protobuf-java;3.21.12 from central in [default] com.google.protobuf#protobuf-java-util;3.21.12 from central in [default] com.google.re2j#re2j;1.6 from central in [default] com.ibm.icu#icu4j;59.1 from central in [default] com.johnsnowlabs.nlp#jsl-llamacpp-cpu;1.0.2 from central in [default] com.johnsnowlabs.nlp#jsl-openvino-cpu_2.12;0.2.0 from central in [default] com.johnsnowlabs.nlp#spark-nlp_2.12;6.1.3 from central in [default] com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 from central in [default] com.microsoft.onnxruntime#onnxruntime;1.19.2 from central in [default] com.navigamez#greex;1.0 from central in [default] com.openhtmltopdf#openhtmltopdf-core;1.0.0 from central in [default] com.openhtmltopdf#openhtmltopdf-jsoup-dom-converter;1.0.0 from central in [default] com.openhtmltopdf#openhtmltopdf-pdfbox;1.0.0 from central in [default] com.openhtmltopdf#openhtmltopdf-rtl-support;1.0.0 from central in [default] com.typesafe#config;1.4.2 from central in [default] com.vladsch.flexmark#flexmark;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-all;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-abbreviation;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-admonition;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-anchorlink;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-aside;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-attributes;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-autolink;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-definition;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-emoji;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-enumerated-reference;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-escaped-character;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-footnotes;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-gfm-issues;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-gfm-strikethrough;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-gfm-tasklist;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-gfm-users;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-gitlab;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-ins;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-jekyll-front-matter;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-jekyll-tag;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-macros;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-media-tags;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-superscript;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-tables;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-toc;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-typographic;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-wikilink;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-xwiki-macros;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-yaml-front-matter;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-youtube-embedded;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-html2md-converter;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-jira-converter;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-pdf-converter;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-profile-pegdown;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-ast;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-builder;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-collection;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-data;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-dependency;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-format;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-html;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-misc;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-options;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-sequence;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-visitor;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-youtrack-converter;0.61.34 from central in [default] com.zaxxer#SparseBitSet;1.2 from central in [default] commons-codec#commons-codec;1.15 from central in [default] commons-logging#commons-logging;1.1.3 from central in [default] de.rototor.pdfbox#graphics2d;0.24 from central in [default] dk.brics.automaton#automaton;1.11-8 from central in [default] io.grpc#grpc-alts;1.53.0 from central in [default] io.grpc#grpc-api;1.53.0 from central in [default] io.grpc#grpc-auth;1.53.0 from central in [default] io.grpc#grpc-context;1.53.0 from central in [default] io.grpc#grpc-core;1.53.0 from central in [default] io.grpc#grpc-googleapis;1.53.0 from central in [default] io.grpc#grpc-grpclb;1.53.0 from central in [default] io.grpc#grpc-netty-shaded;1.53.0 from central in [default] io.grpc#grpc-protobuf;1.53.0 from central in [default] io.grpc#grpc-protobuf-lite;1.53.0 from central in [default] io.grpc#grpc-services;1.53.0 from central in [default] io.grpc#grpc-stub;1.53.0 from central in [default] io.grpc#grpc-xds;1.53.0 from central in [default] io.opencensus#opencensus-api;0.31.1 from central in [default] io.opencensus#opencensus-contrib-http-util;0.31.1 from central in [default] io.opencensus#opencensus-proto;0.2.0 from central in [default] io.perfmark#perfmark-api;0.26.0 from central in [default] it.unimi.dsi#fastutil;7.0.12 from central in [default] jakarta.activation#jakarta.activation-api;2.1.3 from central in [default] jakarta.mail#jakarta.mail-api;2.1.3 from central in [default] javax.annotation#javax.annotation-api;1.3.2 from central in [default] joda-time#joda-time;2.8.1 from central in [default] org.apache.commons#commons-collections4;4.4 from central in [default] org.apache.commons#commons-compress;1.19 from central in [default] org.apache.commons#commons-math3;3.6.1 from central in [default] org.apache.httpcomponents#httpclient;4.5.13 from central in [default] org.apache.httpcomponents#httpcore;4.4.13 from central in [default] org.apache.pdfbox#fontbox;2.0.28 from central in [default] org.apache.pdfbox#pdfbox;2.0.28 from central in [default] org.apache.pdfbox#xmpbox;2.0.16 from central in [default] org.apache.poi#poi;4.1.2 from central in [default] org.apache.poi#poi-ooxml;4.1.2 from central in [default] org.apache.poi#poi-ooxml-schemas;4.1.2 from central in [default] org.apache.poi#poi-scratchpad;4.1.2 from central in [default] org.apache.xmlbeans#xmlbeans;3.1.0 from central in [default] org.checkerframework#checker-qual;3.31.0 from central in [default] org.codehaus.mojo#animal-sniffer-annotations;1.22 from central in [default] org.conscrypt#conscrypt-openjdk-uber;2.5.2 from central in [default] org.eclipse.angus#angus-activation;2.0.2 from central in [default] org.eclipse.angus#angus-mail;2.0.3 from central in [default] org.jetbrains#annotations;24.1.0 from central in [default] org.jsoup#jsoup;1.18.2 from central in [default] org.nibor.autolink#autolink;0.6.0 from central in [default] org.projectlombok#lombok;1.16.8 from central in [default] org.rocksdb#rocksdbjni;6.29.5 from central in [default] org.threeten#threetenbp;1.6.5 from central in [default] software.amazon.ion#ion-java;1.0.2 from central in [default] :: evicted modules: commons-logging#commons-logging;1.2 by [commons-logging#commons-logging;1.1.3] in [default] commons-codec#commons-codec;1.11 by [commons-codec#commons-codec;1.15] in [default] com.google.protobuf#protobuf-java-util;3.0.0-beta-3 by [com.google.protobuf#protobuf-java-util;3.21.12] in [default] com.google.protobuf#protobuf-java;3.0.0-beta-3 by [com.google.protobuf#protobuf-java;3.21.12] in [default] com.google.code.gson#gson;2.3 by [com.google.code.gson#gson;2.10.1] in [default] commons-codec#commons-codec;1.13 by [commons-codec#commons-codec;1.15] in [default] org.jetbrains#annotations;15.0 by [org.jetbrains#annotations;24.1.0] in [default] org.jsoup#jsoup;1.11.3 by [org.jsoup#jsoup;1.18.2] in [default] org.apache.pdfbox#pdfbox;2.0.16 by [org.apache.pdfbox#pdfbox;2.0.28] in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 163 | 0 | 0 | 9 || 154 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-5aace11f-4e53-4048-9842-108e0f27911c confs: [default] 0 artifacts copied, 154 already retrieved (0kB/11ms) 25/11/13 11:17:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 25/11/13 11:17:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
概念¶
Spark ML 提供了一套机器学习应用,主要由两大组件构成:Estimator(评估器)和 Transformer(转换器)。Estimator 通过 fit() 方法对数据进行训练,生成模型;Transformer 通常是训练后的结果,用于对目标数据集进行转换。这些组件已集成到 Spark NLP 中,可直接用于自然语言处理任务。
Pipeline(管道)是一种将多个 Estimator 和 Transformer 组合到单一工作流中的机制,支持在机器学习任务中进行多步链式转换。
Annotations(注释数据结构)¶
基本的 Spark NLP 操作结果是一个注释(annotation)。其结构包括:
- annotatorType:生成当前注释的注释器类型
- begin:匹配内容在原始文本中的起始位置
- end:匹配内容在原始文本中的结束位置
- result:注释的主要输出
- metadata:匹配结果的内容及附加信息
- embeddings:(2.0 新增)如需向量映射则包含向量信息
该对象由注释器在 transform 过程后自动生成,无需手动操作。但理解 annotation 的结构对于高效使用非常重要。
Annotators(注释器)¶
Annotators 是 Spark NLP 中实现 NLP 功能的核心组件。主要分为两类:
- Annotator Approach:属于 Spark ML 的 Estimator,需要通过 fit(data) 进行训练,训练后生成 Annotator Model(Transformer)。
- Annotator Model:属于 Transformer,拥有 transform(data) 方法,可对 DataFrame 添加新的注释列。所有 Transformer 都是增量式的,只会追加信息,不会替换或删除已有数据。
两种 Annotator 都可加入 Pipeline,Pipeline 中的所有 Annotator 会按顺序自动执行并转换数据。Pipeline 在 fit() 后变为 PipelineModel,可随时保存和加载。
快速标注文本¶
Explain Document ML¶
定义:
explain_document_ml 是 Spark NLP 提供的一个 通用文档分析预训练管线,基于传统机器学习模型(非深度学习),适合快速理解文本结构和内容。
核心功能:
将原始文本转换为 NLP 注释流(Annotations)
自动执行常见 NLP 基础任务:
文档分句(Sentence Detection):自动识别文本中的句子边界,将长文本拆分为独立句子,便于后续处理。
分词(Tokenization):将句子进一步切分为单词或标记(token),为词级分析做准备。
词性标注(POS Tagging):为每个单词分配词性标签(如名词、动词等),帮助理解句法结构。
词形还原(Lemmatization):将单词还原为词典中的基本形式(如“running”还原为“run”),便于统一分析。
命名实体识别(NER):识别文本中的专有名词、地名、机构等实体信息,实现结构化抽取。
停用词过滤(Stopword Removal):自动去除如“的”、“和”、“是”等常见但无实际语义贡献的词,提升分析效果。
适用场景:
快速文本预处理与结构化
NER、POS、Lemmatization 初步分析
NLP 流水线教学与调试
import sparknlp
spark = sparknlp.start()
from sparknlp.pretrained import PretrainedPipeline
explain_document_pipeline = PretrainedPipeline("explain_document_ml")
annotations = explain_document_pipeline.annotate("We are very happy about SparkNLP")
print(annotations)
25/11/19 07:33:23 WARN Utils: Your hostname, legion resolves to a loopback address: 127.0.1.1; using 192.168.1.2 instead (on interface enp3s0) 25/11/19 07:33:23 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/legion/miniconda3/envs/sparknlp/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/legion/.ivy2/cache The jars for the packages stored in: /home/legion/.ivy2/jars com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-73000933-a013-4409-92c5-17b3ebf12849;1.0 confs: [default] found com.johnsnowlabs.nlp#spark-nlp_2.12;6.1.3 in central found com.typesafe#config;1.4.2 in central found org.rocksdb#rocksdbjni;6.29.5 in central found com.amazonaws#aws-java-sdk-s3;1.12.500 in central found com.amazonaws#aws-java-sdk-kms;1.12.500 in central found com.amazonaws#aws-java-sdk-core;1.12.500 in central found commons-logging#commons-logging;1.1.3 in central found commons-codec#commons-codec;1.15 in central found org.apache.httpcomponents#httpclient;4.5.13 in central found org.apache.httpcomponents#httpcore;4.4.13 in central found software.amazon.ion#ion-java;1.0.2 in central found joda-time#joda-time;2.8.1 in central found com.amazonaws#jmespath-java;1.12.500 in central found com.github.universal-automata#liblevenshtein;3.0.0 in central found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central found com.google.code.gson#gson;2.3 in central found it.unimi.dsi#fastutil;7.0.12 in central found org.projectlombok#lombok;1.16.8 in central found com.google.cloud#google-cloud-storage;2.20.1 in central found com.google.guava#guava;31.1-jre in central found com.google.guava#failureaccess;1.0.1 in central found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in central found com.google.errorprone#error_prone_annotations;2.18.0 in central found com.google.j2objc#j2objc-annotations;1.3 in central found com.google.http-client#google-http-client;1.43.0 in central found io.opencensus#opencensus-contrib-http-util;0.31.1 in central found com.google.http-client#google-http-client-jackson2;1.43.0 in central found com.google.http-client#google-http-client-gson;1.43.0 in central found com.google.api-client#google-api-client;2.2.0 in central found com.google.oauth-client#google-oauth-client;1.34.1 in central found com.google.http-client#google-http-client-apache-v2;1.43.0 in central found com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 in central found com.google.code.gson#gson;2.10.1 in central found com.google.cloud#google-cloud-core;2.12.0 in central found io.grpc#grpc-context;1.53.0 in central found com.google.auto.value#auto-value-annotations;1.10.1 in central found com.google.auto.value#auto-value;1.10.1 in central found javax.annotation#javax.annotation-api;1.3.2 in central found com.google.cloud#google-cloud-core-http;2.12.0 in central found com.google.http-client#google-http-client-appengine;1.43.0 in central found com.google.api#gax-httpjson;0.108.2 in central found com.google.cloud#google-cloud-core-grpc;2.12.0 in central found io.grpc#grpc-alts;1.53.0 in central found io.grpc#grpc-grpclb;1.53.0 in central found org.conscrypt#conscrypt-openjdk-uber;2.5.2 in central found io.grpc#grpc-auth;1.53.0 in central found io.grpc#grpc-protobuf;1.53.0 in central found io.grpc#grpc-protobuf-lite;1.53.0 in central found io.grpc#grpc-core;1.53.0 in central found com.google.api#gax;2.23.2 in central found com.google.api#gax-grpc;2.23.2 in central found com.google.auth#google-auth-library-credentials;1.16.0 in central found com.google.auth#google-auth-library-oauth2-http;1.16.0 in central found com.google.api#api-common;2.6.2 in central found io.opencensus#opencensus-api;0.31.1 in central found com.google.api.grpc#proto-google-iam-v1;1.9.2 in central found com.google.protobuf#protobuf-java;3.21.12 in central found com.google.protobuf#protobuf-java-util;3.21.12 in central found com.google.api.grpc#proto-google-common-protos;2.14.2 in central found org.threeten#threetenbp;1.6.5 in central found com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha in central found com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha in central found com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha in central found com.google.code.findbugs#jsr305;3.0.2 in central found io.grpc#grpc-api;1.53.0 in central found io.grpc#grpc-stub;1.53.0 in central found org.checkerframework#checker-qual;3.31.0 in central found io.perfmark#perfmark-api;0.26.0 in central found com.google.android#annotations;4.1.1.4 in central found org.codehaus.mojo#animal-sniffer-annotations;1.22 in central found io.opencensus#opencensus-proto;0.2.0 in central found io.grpc#grpc-services;1.53.0 in central found com.google.re2j#re2j;1.6 in central found io.grpc#grpc-netty-shaded;1.53.0 in central found io.grpc#grpc-googleapis;1.53.0 in central found io.grpc#grpc-xds;1.53.0 in central found com.navigamez#greex;1.0 in central found dk.brics.automaton#automaton;1.11-8 in central found org.jsoup#jsoup;1.18.2 in central found jakarta.mail#jakarta.mail-api;2.1.3 in central found jakarta.activation#jakarta.activation-api;2.1.3 in central found org.eclipse.angus#angus-mail;2.0.3 in central found org.eclipse.angus#angus-activation;2.0.2 in central found org.apache.poi#poi-ooxml;4.1.2 in central found org.apache.poi#poi;4.1.2 in central found org.apache.commons#commons-collections4;4.4 in central found org.apache.commons#commons-math3;3.6.1 in central found com.zaxxer#SparseBitSet;1.2 in central found org.apache.poi#poi-ooxml-schemas;4.1.2 in central found org.apache.xmlbeans#xmlbeans;3.1.0 in central found org.apache.commons#commons-compress;1.19 in central found com.github.virtuald#curvesapi;1.06 in central found org.apache.poi#poi-scratchpad;4.1.2 in central found org.apache.pdfbox#pdfbox;2.0.28 in central found org.apache.pdfbox#fontbox;2.0.28 in central found com.vladsch.flexmark#flexmark-all;0.61.34 in central found com.vladsch.flexmark#flexmark;0.61.34 in central found com.vladsch.flexmark#flexmark-util-ast;0.61.34 in central found com.vladsch.flexmark#flexmark-util-collection;0.61.34 in central found com.vladsch.flexmark#flexmark-util-misc;0.61.34 in central found org.jetbrains#annotations;15.0 in central found com.vladsch.flexmark#flexmark-util-data;0.61.34 in central found com.vladsch.flexmark#flexmark-util-sequence;0.61.34 in central found com.vladsch.flexmark#flexmark-util-visitor;0.61.34 in central found com.vladsch.flexmark#flexmark-util-builder;0.61.34 in central found com.vladsch.flexmark#flexmark-util-dependency;0.61.34 in central found com.vladsch.flexmark#flexmark-util-format;0.61.34 in central found com.vladsch.flexmark#flexmark-util-html;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-abbreviation;0.61.34 in central found com.vladsch.flexmark#flexmark-util;0.61.34 in central found com.vladsch.flexmark#flexmark-util-options;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-autolink;0.61.34 in central found org.nibor.autolink#autolink;0.6.0 in central found com.vladsch.flexmark#flexmark-ext-admonition;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-anchorlink;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-aside;0.61.34 in central found com.vladsch.flexmark#flexmark-jira-converter;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-gfm-strikethrough;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-tables;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-wikilink;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-ins;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-superscript;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-attributes;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-definition;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-emoji;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-enumerated-reference;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-escaped-character;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-footnotes;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-gfm-issues;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-gfm-tasklist;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-gfm-users;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-gitlab;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-jekyll-front-matter;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-yaml-front-matter;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-jekyll-tag;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-media-tags;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-macros;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-xwiki-macros;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-toc;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-typographic;0.61.34 in central found com.vladsch.flexmark#flexmark-ext-youtube-embedded;0.61.34 in central found com.vladsch.flexmark#flexmark-html2md-converter;0.61.34 in central found com.vladsch.flexmark#flexmark-pdf-converter;0.61.34 in central found com.openhtmltopdf#openhtmltopdf-core;1.0.0 in central found com.openhtmltopdf#openhtmltopdf-pdfbox;1.0.0 in central found org.apache.pdfbox#xmpbox;2.0.16 in central found de.rototor.pdfbox#graphics2d;0.24 in central found com.openhtmltopdf#openhtmltopdf-rtl-support;1.0.0 in central found com.ibm.icu#icu4j;59.1 in central found com.openhtmltopdf#openhtmltopdf-jsoup-dom-converter;1.0.0 in central found com.vladsch.flexmark#flexmark-profile-pegdown;0.61.34 in central found com.vladsch.flexmark#flexmark-youtrack-converter;0.61.34 in central found com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 in central found com.microsoft.onnxruntime#onnxruntime;1.19.2 in central found com.johnsnowlabs.nlp#jsl-llamacpp-cpu;1.0.2 in central found org.jetbrains#annotations;24.1.0 in central found com.johnsnowlabs.nlp#jsl-openvino-cpu_2.12;0.2.0 in central :: resolution report :: resolve 831ms :: artifacts dl 43ms :: modules in use: com.amazonaws#aws-java-sdk-core;1.12.500 from central in [default] com.amazonaws#aws-java-sdk-kms;1.12.500 from central in [default] com.amazonaws#aws-java-sdk-s3;1.12.500 from central in [default] com.amazonaws#jmespath-java;1.12.500 from central in [default] com.github.universal-automata#liblevenshtein;3.0.0 from central in [default] com.github.virtuald#curvesapi;1.06 from central in [default] com.google.android#annotations;4.1.1.4 from central in [default] com.google.api#api-common;2.6.2 from central in [default] com.google.api#gax;2.23.2 from central in [default] com.google.api#gax-grpc;2.23.2 from central in [default] com.google.api#gax-httpjson;0.108.2 from central in [default] com.google.api-client#google-api-client;2.2.0 from central in [default] com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha from central in [default] com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha from central in [default] com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha from central in [default] com.google.api.grpc#proto-google-common-protos;2.14.2 from central in [default] com.google.api.grpc#proto-google-iam-v1;1.9.2 from central in [default] com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 from central in [default] com.google.auth#google-auth-library-credentials;1.16.0 from central in [default] com.google.auth#google-auth-library-oauth2-http;1.16.0 from central in [default] com.google.auto.value#auto-value;1.10.1 from central in [default] com.google.auto.value#auto-value-annotations;1.10.1 from central in [default] com.google.cloud#google-cloud-core;2.12.0 from central in [default] com.google.cloud#google-cloud-core-grpc;2.12.0 from central in [default] com.google.cloud#google-cloud-core-http;2.12.0 from central in [default] com.google.cloud#google-cloud-storage;2.20.1 from central in [default] com.google.code.findbugs#jsr305;3.0.2 from central in [default] com.google.code.gson#gson;2.10.1 from central in [default] com.google.errorprone#error_prone_annotations;2.18.0 from central in [default] com.google.guava#failureaccess;1.0.1 from central in [default] com.google.guava#guava;31.1-jre from central in [default] com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava from central in [default] com.google.http-client#google-http-client;1.43.0 from central in [default] com.google.http-client#google-http-client-apache-v2;1.43.0 from central in [default] com.google.http-client#google-http-client-appengine;1.43.0 from central in [default] com.google.http-client#google-http-client-gson;1.43.0 from central in [default] com.google.http-client#google-http-client-jackson2;1.43.0 from central in [default] com.google.j2objc#j2objc-annotations;1.3 from central in [default] com.google.oauth-client#google-oauth-client;1.34.1 from central in [default] com.google.protobuf#protobuf-java;3.21.12 from central in [default] com.google.protobuf#protobuf-java-util;3.21.12 from central in [default] com.google.re2j#re2j;1.6 from central in [default] com.ibm.icu#icu4j;59.1 from central in [default] com.johnsnowlabs.nlp#jsl-llamacpp-cpu;1.0.2 from central in [default] com.johnsnowlabs.nlp#jsl-openvino-cpu_2.12;0.2.0 from central in [default] com.johnsnowlabs.nlp#spark-nlp_2.12;6.1.3 from central in [default] com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 from central in [default] com.microsoft.onnxruntime#onnxruntime;1.19.2 from central in [default] com.navigamez#greex;1.0 from central in [default] com.openhtmltopdf#openhtmltopdf-core;1.0.0 from central in [default] com.openhtmltopdf#openhtmltopdf-jsoup-dom-converter;1.0.0 from central in [default] com.openhtmltopdf#openhtmltopdf-pdfbox;1.0.0 from central in [default] com.openhtmltopdf#openhtmltopdf-rtl-support;1.0.0 from central in [default] com.typesafe#config;1.4.2 from central in [default] com.vladsch.flexmark#flexmark;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-all;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-abbreviation;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-admonition;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-anchorlink;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-aside;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-attributes;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-autolink;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-definition;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-emoji;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-enumerated-reference;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-escaped-character;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-footnotes;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-gfm-issues;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-gfm-strikethrough;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-gfm-tasklist;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-gfm-users;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-gitlab;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-ins;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-jekyll-front-matter;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-jekyll-tag;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-macros;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-media-tags;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-superscript;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-tables;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-toc;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-typographic;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-wikilink;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-xwiki-macros;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-yaml-front-matter;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-ext-youtube-embedded;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-html2md-converter;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-jira-converter;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-pdf-converter;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-profile-pegdown;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-ast;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-builder;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-collection;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-data;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-dependency;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-format;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-html;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-misc;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-options;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-sequence;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-util-visitor;0.61.34 from central in [default] com.vladsch.flexmark#flexmark-youtrack-converter;0.61.34 from central in [default] com.zaxxer#SparseBitSet;1.2 from central in [default] commons-codec#commons-codec;1.15 from central in [default] commons-logging#commons-logging;1.1.3 from central in [default] de.rototor.pdfbox#graphics2d;0.24 from central in [default] dk.brics.automaton#automaton;1.11-8 from central in [default] io.grpc#grpc-alts;1.53.0 from central in [default] io.grpc#grpc-api;1.53.0 from central in [default] io.grpc#grpc-auth;1.53.0 from central in [default] io.grpc#grpc-context;1.53.0 from central in [default] io.grpc#grpc-core;1.53.0 from central in [default] io.grpc#grpc-googleapis;1.53.0 from central in [default] io.grpc#grpc-grpclb;1.53.0 from central in [default] io.grpc#grpc-netty-shaded;1.53.0 from central in [default] io.grpc#grpc-protobuf;1.53.0 from central in [default] io.grpc#grpc-protobuf-lite;1.53.0 from central in [default] io.grpc#grpc-services;1.53.0 from central in [default] io.grpc#grpc-stub;1.53.0 from central in [default] io.grpc#grpc-xds;1.53.0 from central in [default] io.opencensus#opencensus-api;0.31.1 from central in [default] io.opencensus#opencensus-contrib-http-util;0.31.1 from central in [default] io.opencensus#opencensus-proto;0.2.0 from central in [default] io.perfmark#perfmark-api;0.26.0 from central in [default] it.unimi.dsi#fastutil;7.0.12 from central in [default] jakarta.activation#jakarta.activation-api;2.1.3 from central in [default] jakarta.mail#jakarta.mail-api;2.1.3 from central in [default] javax.annotation#javax.annotation-api;1.3.2 from central in [default] joda-time#joda-time;2.8.1 from central in [default] org.apache.commons#commons-collections4;4.4 from central in [default] org.apache.commons#commons-compress;1.19 from central in [default] org.apache.commons#commons-math3;3.6.1 from central in [default] org.apache.httpcomponents#httpclient;4.5.13 from central in [default] org.apache.httpcomponents#httpcore;4.4.13 from central in [default] org.apache.pdfbox#fontbox;2.0.28 from central in [default] org.apache.pdfbox#pdfbox;2.0.28 from central in [default] org.apache.pdfbox#xmpbox;2.0.16 from central in [default] org.apache.poi#poi;4.1.2 from central in [default] org.apache.poi#poi-ooxml;4.1.2 from central in [default] org.apache.poi#poi-ooxml-schemas;4.1.2 from central in [default] org.apache.poi#poi-scratchpad;4.1.2 from central in [default] org.apache.xmlbeans#xmlbeans;3.1.0 from central in [default] org.checkerframework#checker-qual;3.31.0 from central in [default] org.codehaus.mojo#animal-sniffer-annotations;1.22 from central in [default] org.conscrypt#conscrypt-openjdk-uber;2.5.2 from central in [default] org.eclipse.angus#angus-activation;2.0.2 from central in [default] org.eclipse.angus#angus-mail;2.0.3 from central in [default] org.jetbrains#annotations;24.1.0 from central in [default] org.jsoup#jsoup;1.18.2 from central in [default] org.nibor.autolink#autolink;0.6.0 from central in [default] org.projectlombok#lombok;1.16.8 from central in [default] org.rocksdb#rocksdbjni;6.29.5 from central in [default] org.threeten#threetenbp;1.6.5 from central in [default] software.amazon.ion#ion-java;1.0.2 from central in [default] :: evicted modules: commons-logging#commons-logging;1.2 by [commons-logging#commons-logging;1.1.3] in [default] commons-codec#commons-codec;1.11 by [commons-codec#commons-codec;1.15] in [default] com.google.protobuf#protobuf-java-util;3.0.0-beta-3 by [com.google.protobuf#protobuf-java-util;3.21.12] in [default] com.google.protobuf#protobuf-java;3.0.0-beta-3 by [com.google.protobuf#protobuf-java;3.21.12] in [default] com.google.code.gson#gson;2.3 by [com.google.code.gson#gson;2.10.1] in [default] commons-codec#commons-codec;1.13 by [commons-codec#commons-codec;1.15] in [default] org.jetbrains#annotations;15.0 by [org.jetbrains#annotations;24.1.0] in [default] org.jsoup#jsoup;1.11.3 by [org.jsoup#jsoup;1.18.2] in [default] org.apache.pdfbox#pdfbox;2.0.16 by [org.apache.pdfbox#pdfbox;2.0.28] in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 163 | 0 | 0 | 9 || 154 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-73000933-a013-4409-92c5-17b3ebf12849 confs: [default] 0 artifacts copied, 154 already retrieved (0kB/13ms) 25/11/19 07:33:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
explain_document_ml download started this may take some time. Approx size to download 9 MB [ | ]
25/11/19 07:33:45 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use. 25/11/19 07:33:45 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
explain_document_ml download started this may take some time.
Approximate size to download 9 MB
Download done! Loading the resource.
[OK!]
{'document': ['We are very happy about SparkNLP'], 'spell': ['We', 'are', 'very', 'happy', 'about', 'SparkNLP'], 'pos': ['PRP', 'VBP', 'RB', 'JJ', 'IN', 'NNP'], 'lemmas': ['We', 'be', 'very', 'happy', 'about', 'SparkNLP'], 'token': ['We', 'are', 'very', 'happy', 'about', 'SparkNLP'], 'stems': ['we', 'ar', 'veri', 'happi', 'about', 'sparknlp'], 'sentence': ['We are very happy about SparkNLP']}
下载和使用预训练管线¶
Explain Document ML(explain_document_ml)是一个预训练的管线,能够完成各种 NLP 相关任务。下面我们用 Python 试试它。首次运行时会从服务器下载预训练管线,可能需要等待片刻。
# OUTPUT:
# {
# 'stem': ['we', 'ar', 'veri', 'happi', 'about', 'sparknlp'],
# 'checked': ['We', 'are', 'very', 'happy', 'about', 'SparkNLP'],
# 'lemma': ['We', 'be', 'very', 'happy', 'about', 'SparkNLP'],
# 'document': ['We are very happy about SparkNLP'],
# 'pos': ['PRP', 'VBP', 'RB', 'JJ', 'IN', 'NNP'],
# 'token': ['We', 'are', 'very', 'happy', 'about', 'SparkNLP'],
# 'sentence': ['We are very happy about SparkNLP']
# }
这个结果是 Spark NLP explain_document_ml 预训练管线对输入文本 "We are very happy about SparkNLP" 的注释输出。各字段含义如下:
- stem:词干化结果(去除词尾变化,得到词根)
- checked:拼写检查后的词
- lemma:词形还原结果(还原为词典中的基本形式)
- document:原始文本
- pos:词性标注(如 PRP 代表人称代词,VBP 代表动词等)
- token:分词结果(将句子切分为单词)
- sentence:句子边界检测结果(原始句子)
这些结果可用于后续的文本分析、信息抽取等 NLP 任务。
如上例所示,explain_document_ml 能够“开箱即用”地对任意“文档”进行注释,输出词干、拼写检查、词形还原、词性标注、分词和句子边界检测等结果。
full = explain_document_pipeline.fullAnnotate("We are very happy about SparkNLP")
# full 是一个列表,元素为字典,键即为所有可用的 annotation 字段
print("fullAnnotate keys:", list(full[0].keys()))
print("完整结构示例(token、pos、dependency 等):")
for k in ["document", "token", "pos", "lemmas", "stems", "dependency"]:
print(k, "->", full[0].get(k))
fullAnnotate keys: ['document', 'spell', 'pos', 'lemmas', 'token', 'stems', 'sentence']
完整结构示例(token、pos、dependency 等):
document -> [Annotation(document, 0, 31, We are very happy about SparkNLP, {}, [])]
token -> [Annotation(token, 0, 1, We, {'sentence': '0'}, []), Annotation(token, 3, 5, are, {'sentence': '0'}, []), Annotation(token, 7, 10, very, {'sentence': '0'}, []), Annotation(token, 12, 16, happy, {'sentence': '0'}, []), Annotation(token, 18, 22, about, {'sentence': '0'}, []), Annotation(token, 24, 31, SparkNLP, {'sentence': '0'}, [])]
pos -> [Annotation(pos, 0, 1, PRP, {'word': 'We', 'sentence': '0'}, []), Annotation(pos, 3, 5, VBP, {'word': 'are', 'sentence': '0'}, []), Annotation(pos, 7, 10, RB, {'word': 'very', 'sentence': '0'}, []), Annotation(pos, 12, 16, JJ, {'word': 'happy', 'sentence': '0'}, []), Annotation(pos, 18, 22, IN, {'word': 'about', 'sentence': '0'}, []), Annotation(pos, 24, 31, NNP, {'word': 'SparkNLP', 'sentence': '0'}, [])]
lemmas -> [Annotation(token, 0, 1, We, {'confidence': '1.0', 'sentence': '0'}, []), Annotation(token, 3, 5, be, {'confidence': '1.0', 'sentence': '0'}, []), Annotation(token, 7, 10, very, {'confidence': '1.0', 'sentence': '0'}, []), Annotation(token, 12, 16, happy, {'confidence': '1.0', 'sentence': '0'}, []), Annotation(token, 18, 22, about, {'confidence': '1.0', 'sentence': '0'}, []), Annotation(token, 24, 31, SparkNLP, {'confidence': '0.0', 'sentence': '0'}, [])]
stems -> [Annotation(token, 0, 1, we, {'confidence': '1.0', 'sentence': '0'}, []), Annotation(token, 3, 5, ar, {'confidence': '1.0', 'sentence': '0'}, []), Annotation(token, 7, 10, veri, {'confidence': '1.0', 'sentence': '0'}, []), Annotation(token, 12, 16, happi, {'confidence': '1.0', 'sentence': '0'}, []), Annotation(token, 18, 22, about, {'confidence': '1.0', 'sentence': '0'}, []), Annotation(token, 24, 31, sparknlp, {'confidence': '0.0', 'sentence': '0'}, [])]
dependency -> None
Annotation 的形式是: Annotation(annotatorType, begin, end, result, metadata, embeddings)。
逐项解释(针对 Annotation(document, 0, 31, We are very happy about SparkNLP, {}, [])):
- annotatorType:document,说明这是由 DocumentAnnotator/DocumentAssembler 产生的注释类型。
- begin:0,注释在原始字符串中的起始字符索引(0-based)。
- end:31,注释在原始字符串中的结束字符索引(包含该位置)。
- result:We are very happy about SparkNLP,注释的主要输出(通常是原文片段或处理结果)。
- metadata:{},字典,包含额外信息(如句子索引、语言、规则来源等),此处为空。
- embeddings:[],如果有词向量或句向量则为数值向量列表,此处为空列表表示无向量。
使用预训练管线处理 Spark DataFrame¶
你也可以将预训练管线应用于 Spark DataFrame。只需先创建一个包含 “text” 列的 Spark DataFrame,作为管线的输入,然后使用 .transform() 方法运行管线,将各组件的输出存储在新的 Spark DataFrame 中。
- transform 是 Spark ML 的 Transformer 接口,用于把预训练管线应用到 Spark DataFrame 上,返回带有注释列(数组
)的分布式 DataFrame,适合批量/集群处理。 - annotate(或 fullAnnotate / LightPipeline.annotate)是面向 Python 的便捷方法,接收单个字符串或字符串列表,返回 Python 字典/列表(扁平化或完整注释),适合交互式或小量数据的快速查看。
import sparknlp
spark = sparknlp.start()
sentences = [
['Hello, this is an example sentence'],
['And this is a second sentence.']
]
# spark is the Spark Session automatically started by pyspark.
data = spark.createDataFrame(sentences).toDF("text")
# Download the pretrained pipeline from Johnsnowlab's servers
explain_document_pipeline = PretrainedPipeline("explain_document_ml")
annotations_df = explain_document_pipeline.transform(data)
annotations_df.show()
Warning::Spark Session already created, some configs may not take. explain_document_ml download started this may take some time.
25/11/13 11:19:43 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Approx size to download 9 MB [OK!]
25/11/13 11:19:47 WARN DAGScheduler: Broadcasting large task binary with size 6.1 MiB
25/11/13 11:19:48 WARN DAGScheduler: Broadcasting large task binary with size 6.1 MiB
25/11/13 11:19:49 WARN DAGScheduler: Broadcasting large task binary with size 6.1 MiB
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| text| document| sentence| token| spell| lemmas| stems| pos|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Hello, this is an...|[{document, 0, 33...|[{document, 0, 33...|[{token, 0, 4, He...|[{token, 0, 4, He...|[{token, 0, 4, He...|[{token, 0, 4, he...|[{pos, 0, 4, UH, ...|
|And this is a sec...|[{document, 0, 29...|[{document, 0, 29...|[{token, 0, 2, An...|[{token, 0, 2, An...|[{token, 0, 2, An...|[{token, 0, 2, an...|[{pos, 0, 2, CC, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
25/11/13 11:19:51 WARN DAGScheduler: Broadcasting large task binary with size 6.1 MiB
操作管道¶
前面 DataFrame 的输出是 Annotation 对象。这样的输出并不方便处理,可以通过如下代码查看:
annotations_df.select("token").show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|token |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{token, 0, 4, Hello, {sentence -> 0}, []}, {token, 5, 5, ,, {sentence -> 0}, []}, {token, 7, 10, this, {sentence -> 0}, []}, {token, 12, 13, is, {sentence -> 0}, []}, {token, 15, 16, an, {sentence -> 0}, []}, {token, 18, 24, example, {sentence -> 0}, []}, {token, 26, 33, sentence, {sentence -> 0}, []}]|
|[{token, 0, 2, And, {sentence -> 0}, []}, {token, 4, 7, this, {sentence -> 0}, []}, {token, 9, 10, is, {sentence -> 0}, []}, {token, 12, 12, a, {sentence -> 0}, []}, {token, 14, 19, second, {sentence -> 0}, []}, {token, 21, 28, sentence, {sentence -> 0}, []}, {token, 29, 29, ., {sentence -> 0}, []}] |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
- 如果我们只想处理最终的注释结果,可以使用 Finisher 注释器
- Finisher 注释器的作用是将 Spark NLP 管道中的注释结果(如 token、lemma、pos 等)转换为易于读取和处理的 Python 列表或字符串格式,去除复杂的结构和元数据,方便后续分析或导出。
- 将 Explain Document ML 预训练管道和 Finisher 组合到 Spark ML Pipeline 中。注意,预训练管道要求输入列名为 “text”。
from sparknlp import Finisher
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline
finisher = Finisher().setInputCols(["token", "lemmas", "pos"])
explain_pipeline_model = PretrainedPipeline("explain_document_ml").model
pipeline = Pipeline() \
.setStages([
explain_pipeline_model,
finisher
])
sentences = [
['Hello, this is an example sentence'],
['And this is a second sentence.']
]
data = spark.createDataFrame(sentences).toDF("text")
model = pipeline.fit(data)
annotations_finished_df = model.transform(data)
annotations_finished_df.select('finished_token').show(truncate=False)
explain_document_ml download started this may take some time.
25/11/13 11:20:36 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Approx size to download 9 MB [OK!] +-------------------------------------------+ |finished_token | +-------------------------------------------+ |[Hello, ,, this, is, an, example, sentence]| |[And, this, is, a, second, sentence, .] | +-------------------------------------------+
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
DocumentAssembler:数据输入¶
为了进行 NLP 处理,首先需要将原始数据进行注释。DocumentAssembler 是一个特殊的转换器,它会创建第一个类型为 Document 的注释,供后续注释器使用。
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
句子检测与分词¶
在这个快速示例中,我们将识别输入文档中的句子。SentenceDetector 需要 Document 类型的注释,由 DocumentAssembler 输出提供,并且本身也是 Document 类型的标注。Tokenizer 也需要 Document 类型的注释,这意味着它既可以处理 DocumentAssembler 的输出,也可以处理 SentenceDetector 的输出。如下例所示,我们使用 sentence 的输出。
sentenceDetector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("Sentence")
regexTokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
Spark NLP 还包含一个特殊的转换器 Finisher,用于将 token 等注释结果以人类可读的形式输出。
Finisher:输出注释结果¶
在每个管道或 Spark NLP 的任意阶段结束时,通常希望将结果导出到其他管道或写入磁盘。Finisher 注释器可以帮助你清理元数据(如果设置为 true),并将结果以数组形式输出:
finisher = Finisher() \
.setInputCols(["token"]) \
.setIncludeMetadata(True)
如果你需要将除了 struct 类型列之外的任何注解展平成一个 DataFrame(将每个子数组放到新的列中),可以使用 Spark SQL 的 explode 函数。你也可以使用 Apache Spark 的函数(SQL)以任何你需要的方式操作输出的 DataFrame。这里我们将 tokens 和 NER 结果合并在一起:
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
regexTokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
finisher = Finisher() \
.setInputCols(["token"]) \
.setIncludeMetadata(True)
使用 Spark ML Pipeline¶
现在我们要将上述内容整合起来并获取结果,可以使用 Pipeline。由于这些 pipeline 阶段都不需要训练,因此 fit() 和 transform 使用相同的数据。
pipeline = Pipeline() \
.setStages([
documentAssembler,
sentenceDetector,
regexTokenizer,
finisher
])
data = spark.createDataFrame([("hello, this is an example sentence",)], ["text"])
model = pipeline.fit(data)
annotations = model.transform(data)
annotations.show(truncate=False)
+----------------------------------+-------------------------------------------+---------------------------------------------------------------------------------------------------------+
|text |finished_token |finished_token_metadata |
+----------------------------------+-------------------------------------------+---------------------------------------------------------------------------------------------------------+
|hello, this is an example sentence|[hello, ,, this, is, an, example, sentence]|[{sentence, 0}, {sentence, 0}, {sentence, 0}, {sentence, 0}, {sentence, 0}, {sentence, 0}, {sentence, 0}]|
+----------------------------------+-------------------------------------------+---------------------------------------------------------------------------------------------------------+
使用 Spark NLP 的 LightPipeline¶
LightPipeline 是 Spark NLP 特有的 Pipeline 类,类似于 Spark ML Pipeline。不同之处在于它的执行不遵循 Spark 分布式原则,而是全部在本地(但并行)计算,以便在处理少量数据时获得极快的结果。这意味着我们不输入 Spark DataFrame,而是直接输入字符串或字符串数组进行注释。要创建 LightPipeline,需要传入已经训练(fit)的 Spark ML Pipeline。其 transform() 阶段被转换为 annotate() 方法。
from sparknlp.base import LightPipeline
explain_document_pipeline = PretrainedPipeline("explain_document_ml")
lightPipeline = LightPipeline(explain_document_pipeline.model)
lightPipeline.annotate("Hello world, please annotate my text")
explain_document_ml download started this may take some time.
25/11/13 11:27:18 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Approx size to download 9 MB [OK!]
{'document': ['Hello world, please annotate my text'],
'spell': ['Hello', 'world', ',', 'please', 'annotate', 'my', 'text'],
'pos': ['UH', 'NN', ',', 'VB', 'NN', 'PRP$', 'NN'],
'lemmas': ['Hello', 'world', ',', 'please', 'annotate', 'i', 'text'],
'token': ['Hello', 'world', ',', 'please', 'annotate', 'my', 'text'],
'stems': ['hello', 'world', ',', 'pleas', 'annot', 'my', 'text'],
'sentence': ['Hello world, please annotate my text']}
训练注释器¶
训练方法论¶
在实际场景中,训练自己的注释器是关键概念。虽然上述的预训练管道和模型可以直接应用于特定用例,但针对你的实际需求进行微调通常能获得更好的效果。面对真实问题时,往往需要训练自己的模型。在 Spark NLP 中,支持三种自定义注释器的训练方式:
从数据集训练:大多数注释器都可以像 Spark ML 一样,通过 fit() 方法从数据集进行训练。带有 Approach 后缀的注释器即为可训练注释器。通过 fit() 训练是 Spark ML 的标准行为。不同注释器对训练数据的 schema 有不同要求,请查阅参考文档了解每个注释器的具体要求。
从外部资源训练:部分注释器支持通过 setCorpus() 或 setDictionary() 等参数方法,传入外部文件或文件夹进行训练。你可以设置 Spark NLP 以 Spark 数据集或 LINE_BY_LINE 方式读取这些资源,后者通常适用于小文件且速度更快。
深度学习模型训练:部分注释器基于深度学习。这些模型可以像其他注释器一样通过 AnnotatorApproach API 标准方式训练。对于高级用户,还支持导入自定义计算图,甚至可以在 Python 中训练后转换为 AnnotatorModel。
Spark NLP 导入说明¶
- base 包含通用的 Spark NLP 转换器和概念,
- annotator 包含当前所有注释器,
from sparknlp.base import *
from sparknlp.annotator import *
Spark ML Pipelines¶
SparkML 管道是一种统一结构,有助于创建和调优实际的机器学习流水线。Spark NLP 与其无缝集成,因此理解这一概念非常重要。管道通过 fit() 训练后,会变为 PipelineModel。
from pyspark.ml import Pipeline
pipeline = Pipeline().setStages([...])
LightPipeline¶
LightPipeline 是将 Spark ML 管道转换为单机多线程任务的工具,对于较小数据量(如 5 万句以内)可获得 10 倍以上加速。使用方法:只需将已训练(fit)的管道传入即可。
from sparknlp.base import LightPipeline
LightPipeline(someTrainedPipeline).annotate(someStringOrArray)
- annotate(string 或 string[]): 返回注释结果的字典列表
- fullAnnotate(string 或 string[]): 返回完整注释内容的字典列表
Spark NLP - 训练¶
训练数据集¶
参考项目地址:JohnSnowLabs/spark-nlp
Spark NLP 提供专用类,用于加载常见 NLP 训练数据集,支持词性标注、命名实体识别、拼写检查等任务的注释器训练。
POS 数据集¶
训练词性标注(Part of Speech Tagger)注释器时,可通过 Spark NLP 组件将语料库文本文件读取并转换为 Spark DataFrame,便于后续模型训练和处理。
POS 数据集通常采用如下格式,每行为一个句子,词与词性标签之间用分隔符(如 |)分隔:
The|DT quick|JJ brown|JJ fox|NN jumps|VB over|IN the|DT lazy|JJ dog|NN
- 每个词后跟一个词性标签(如 DT、JJ、NN、VB 等),两者之间用分隔符分开。
- 不同句子分行排列。
- 训练时可通过 Spark NLP 的
POS().readDataset()方法加载此类数据集。
from sparknlp.training import POS
pos = POS()
path = "src/test/resources/anc-pos-corpus-small/test-training.txt"
posDf = pos.readDataset(spark, path, "|", "tags")
posDf.selectExpr("explode(tags) as tags").show(3, truncate=False)
+---------------------------------------+
|tags |
+---------------------------------------+
|{pos, 0, 5, NNP, {word -> Pierre}, []} |
|{pos, 7, 12, NNP, {word -> Vinken}, []}|
|{pos, 14, 14, ,, {word -> ,}, []} |
+---------------------------------------+
only showing top 3 rows
CoNLL 数据集¶
训练命名实体识别(NER DL)注释器时,需将 CoNLL 2003 格式的数据集读取为 Spark DataFrame。Spark NLP 提供专用组件,可直接读取纯文本文件并转换为结构化数据集,便于后续模型训练。
- 数据格式:每行包含单词、词性标签、实体标签等信息,采用标准 CoNLL 2003 IOB 格式。
- 加载方法:通过
CoNLL().readDataset()方法快速导入,生成包含句子、分词、标签等字段的 DataFrame。 - 适用场景:NER、POS、文本分类等任务的数据准备。
示例(CoNLL 2003 格式):
EU NNP B-ORG rejects VBZ O German JJ B-MISC call NN O to TO O boycott VB O British JJ B-MISC lamb NN O . . O- 每行依次为:单词、词性标签(POS)、实体标签(NER)。
- 不同句子之间用空行分隔。
from sparknlp.training import CoNLL
trainingData = CoNLL().readDataset(spark, "src/test/resources/conll2003/eng.train")
trainingData.selectExpr(
"text",
"token.result as tokens",
"pos.result as pos",
"label.result as label"
).show(3, False)
+------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+ |text |tokens |pos |label | +------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+ |EU rejects German call to boycott British lamb .|[EU, rejects, German, call, to, boycott, British, lamb, .]|[NNP, VBZ, JJ, NN, TO, VB, JJ, NN, .]|[B-ORG, O, B-MISC, O, O, O, B-MISC, O, O]| |Peter Blackburn |[Peter, Blackburn] |[NNP, NNP] |[B-PER, I-PER] | |BRUSSELS 1996-08-22 |[BRUSSELS, 1996-08-22] |[NNP, CD] |[B-LOC, O] | +------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+ only showing top 3 rows
25/11/13 11:39:42 WARN TaskSetManager: Stage 39 contains a task of very large size (4261 KiB). The maximum recommended task size is 1000 KiB.
CoNLL-U 数据集¶
训练依存句法分析(DependencyParserApproach)时,需将 CoNLL-U 格式的数据集加载为 Spark DataFrame。Spark NLP 提供专用组件,可直接读取 CoNLL-U 文件并转换为结构化数据,便于后续模型训练和处理。
- 数据格式:每行包含单词、词性、依存关系等信息,采用标准 CoNLL-U 格式。
- 加载方法:通过
CoNLLU().readDataset()方法快速导入,生成包含句子、分词、词性、词元等字段的 DataFrame。 - 适用场景:依存句法分析、POS、词形还原等任务的数据准备。
示例(CoNLL-U 格式):
# sent_id = 1
# text = They buy and sell books.
1 They _ PRON PRP _ 2 nsubj _ _
2 buy _ VERB VBP _ 0 root _ _
3 and _ CONJ CC _ 4 cc _ _
4 sell _ VERB VBP _ 2 conj _ _
5 books _ NOUN NNS _ 2 obj _ _
6 . _ PUNCT . _ 2 punct _ _
- 每行依次为:词序号、词、词元、词性(UPOS)、词性(XPOS)、词法特征、依存头、依存关系、附加信息、空间信息。
- 不同句子之间用空行分隔。
from sparknlp.training import CoNLLU
conlluFile = "src/test/resources/conllu/en.test.conllu"
conllDataSet = CoNLLU(explodeSentences=False).readDataset(spark, conlluFile)
conllDataSet.selectExpr(
"text",
"form.result as form",
"upos.result as upos",
"xpos.result as xpos",
"lemma.result as lemma"
).show(1, False)
+-----------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+ |text |form |upos |xpos |lemma | +-----------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+ |What if Google Morphed Into GoogleOS?\n\n|[What, if, Google, Morphed, Into, GoogleOS, ?]|[PRON, SCONJ, PROPN, VERB, ADP, PROPN, PUNCT]|[WP, IN, NNP, VBD, IN, NNP, .]|[what, if, Google, morph, into, GoogleOS, ?]| +-----------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+ only showing top 1 row
PubTator 数据集¶
PubTator 格式常用于医学文本标注,包含论文的标题、摘要及实体标注(如疾病、基因、化合物等)。每篇文献以 PMID 开头,后续为标题、摘要和标注片段。可通过 PubTator 文本文件快速创建 Spark DataFrame,便于后续医学实体抽取和分析。
例如 PubTator 格式如下:
12345678 Title Novel gene associated with disease.
12345678 Abstract This study identifies a new gene linked to the condition.
12345678 0 5 gene GeneName
12345678 25 32 disease DiseaseName
- 第一行为 PMID、类型(Title/Abstract)、文本内容。
- 后续行为实体标注:PMID、起始位置、结束位置、实体类型、实体名称。
from sparknlp.training import PubTator
pubTatorFile = "./src/test/resources/corpus_pubtator_sample.txt"
pubTatorDataSet = PubTator().readDataset(spark, pubTatorFile)
pubTatorDataSet.show(1)
25/11/13 11:41:25 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use. 25/11/13 11:41:25 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
Download done! Loading the resource.
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
| doc_id| finished_token| finished_pos| finished_ner|finished_token_metadata|finished_pos_metadata|finished_label_metadata|
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
|25763772|[DCTN4, as, a, mo...|[NNP, IN, DT, NN,...|[B-T116, O, O, O,...| [{sentence, 0}, {...| [{word, DCTN4}, {...| [{word, DCTN4}, {...|
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
文本处理¶
这些注释器可用于训练文本处理任务,包括依存句法分析、词形还原、词性标注、句子检测和分词。
DependencyParserApproach(依存句法分析器)¶
用于训练无标签依存句法分析模型,发现句子中词语之间的语法关系。
依存句法分析器能够识别动词的主语和宾语,以及修饰词等语法结构,有助于精确理解句子成分和语义。
训练数据支持两种格式(每个模型只能选择一种):
- Penn Treebank 格式:通过
setDependencyTreeBank设置,需提供依存树库文本文件。 - CoNLL-U 格式:通过
setConllU设置,需提供标准 CoNLL-U 格式的数据集。
依存句法分析广泛应用于信息抽取、问答系统和文本理解等 NLP 任务。
# 依存句法分析器训练示例
# 本代码块演示如何使用 Spark NLP 训练无标签依存句法分析器。
# 依赖于 Penn Treebank 格式的依存树库(dependency_treebank)。
# 包含:文本输入、句子检测、分词、词性标注、依存句法分析器。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# DocumentAssembler 的作用是将原始文本转换为 Spark NLP 的 Document 类型注释,
# 作为后续 NLP 流水线的输入。它是所有管道的起点,负责结构化文本数据。
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols("document") \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
posTagger = PerceptronModel.pretrained() \
.setInputCols("sentence", "token") \
.setOutputCol("pos")
dependencyParserApproach = DependencyParserApproach() \
.setInputCols("sentence", "pos", "token") \
.setOutputCol("dependency") \
.setDependencyTreeBank("src/test/resources/parser/unlabeled/dependency_treebank")
# 构建依存句法分析管道并进行推理
# 依赖于 Penn Treebank 格式的依存树库,无需额外训练数据
pipeline = Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
posTagger,
dependencyParserApproach
])
# Additional training data is not needed, the dependency parser relies on the dependency tree bank / CoNLL-U only.
emptyDataSet = spark.createDataFrame([[""]]).toDF("text")
pipelineModel = pipeline.fit(emptyDataSet)
data = spark.createDataFrame([("Hello, this is an example sentence.",)], ["text"])
pipelineModel.transform(data).selectExpr("token.result as tokens").show(truncate=False)
pos_anc download started this may take some time.
25/11/13 11:49:02 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Approximate size to download 3.9 MB [OK!] +----------------------------------------------+ |tokens | +----------------------------------------------+ |[Hello, ,, this, is, an, example, sentence, .]| +----------------------------------------------+
Lemmatizer(词形还原器)¶
Lemmatizer 用于将单词还原为词典中的基础词元(lemma),提取单词的主要部分,便于统一分析和后续处理。使用时需通过 setDictionary 方法指定预定义词元的字典文件,支持分隔文本格式。也可通过 LemmatizerModel.pretrained 加载预训练模型,快速应用于标准场景。
主要用途:
- 规范化文本,统一不同词形(如 "running" → "run")
- 提升文本分析和信息抽取的准确性
输入注释器类型: TOKEN
输出注释器类型: TOKEN
# 词形还原器训练示例
# 本代码块演示如何使用 Spark NLP 训练自定义词形还原器(Lemmatizer)。
# 依赖于自定义词形还原字典(lemmas_small.txt),
# 格式为:key -> value1 value2 ...,分隔符分别为 "->" 和 "\t"
# 包含的流程如下:
# 1. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 2. SentenceDetector:检测句子边界,将文本拆分为句子。
# 3. Tokenizer:将句子分词为 token。
# 4. Lemmatizer:根据自定义词形还原字典,将 token 还原为词元(lemma)。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
lemmatizer = Lemmatizer() \
.setInputCols(["token"]) \
.setOutputCol("lemma") \
.setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")
# 构建词形还原管道并进行推理
# 包含:文本输入、句子检测、分词、词形还原
pipeline = Pipeline() \
.setStages([
documentAssembler,
sentenceDetector,
tokenizer,
lemmatizer
])
data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]) \
.toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("lemma.result").show(truncate=False)
+------------------------------------------------------------------+ |result | +------------------------------------------------------------------+ |[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]| +------------------------------------------------------------------+
PerceptronApproach(感知机词性标注器)¶
用于训练平均感知机模型,实现对单词的词性标注(POS Tagging),为句子中的每个单词分配词性标签。
训练数据要求:
- 输入为 Spark DataFrame,需包含类型为 POS 的注释列。
- 注释的
result字段为词性标签,metadata字段需包含"word"映射到对应单词。 - 可通过辅助类
POS快速创建训练用 DataFrame。
应用场景:
- 自动词性标注
- 语法分析与文本结构化
# 感知机词性标注器训练示例
# 本代码块演示如何使用 Spark NLP 训练自定义词性标注器(PerceptronApproach)。
# 包含的流程如下:
# 1. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 2. SentenceDetector:检测句子边界,将文本拆分为句子。
# 3. Tokenizer:将句子分词为 token。
# 4. POS().readDataset:加载词性标注训练数据集,生成包含 tags 列的 DataFrame。
# 5. PerceptronApproach:根据训练数据集训练词性标注模型(trainedPos)。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt"
trainingPerceptronDF = POS().readDataset(spark, datasetPath)
trainedPos = PerceptronApproach() \
.setInputCols(["document", "token"]) \
.setOutputCol("pos") \
.setPosColumn("tags") \
.fit(trainingPerceptronDF)
# 构建词性标注管道并进行推理
# 包含:文本输入、句子检测、分词、词性标注
# 1. documentAssembler:将原始文本转换为 Document 类型注释
# 2. sentence:检测句子边界
# 3. tokenizer:将句子分词为 token
# 4. trainedPos:使用训练好的感知机词性标注器进行词性标注
# 输出解释:
# result.selectExpr("pos.result").show(truncate=False)
# 展示每个 token 的词性标注结果(如 'NN' 表示名词,'VB' 表示动词等),
# 结果为一个列表,对应输入文本中的每个分词。
pipeline = Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
trainedPos
])
data = spark.createDataFrame([["To be or not to be, is this the question?"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("pos.result").show(truncate=False)
+--------------------------------------------------+ |result | +--------------------------------------------------+ |[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]| +--------------------------------------------------+
SentenceDetectorDLApproach(深度学习句子检测器)¶
用于训练基于深度学习的句子边界检测模型。
- 支持 CNN 架构(默认模型为 "cnn",基于论文《Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection》)。
- 可通过 setModelArchitecture 设置模型架构(未来支持更多类型)。
- 针对断句和特殊换行符进行了优化,提升边界检测准确率。
- 输出每个句子为数组,或通过设置 explodeSentences=true 展开为独立行。
- 预训练模型见 SentenceDetectorDLModel。
- 更多用法示例见官方 Examples。
输入注释器类型:DOCUMENT
输出注释器类型:DOCUMENT
# 训练深度学习句子检测器(SentenceDetectorDLApproach)示例
# 本代码块演示如何使用 Spark NLP 训练自定义句子检测模型。
# 包含的流程如下:
# 1. 读取训练数据,每行为一个句子。
# 2. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 3. SentenceDetectorDLApproach:基于深度学习的句子检测器,设置输入输出列及训练轮数。
# 4. 构建管道并训练模型。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
trainingData = spark.read.text("train.txt").toDF("text")
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLApproach() \
.setInputCols(["document"]) \
.setOutputCol("sentences") \
.setEpochsNumber(100)
pipeline = Pipeline().setStages([documentAssembler, sentenceDetector])
model = pipeline.fit(trainingData)
TypedDependencyParser(有标签依存句法分析器)¶
用于训练有标签的依存句法分析模型,发现句子中词语之间的具体语法关系(如主谓、修饰、宾语等)。输入数据需为 CoNLL-U 或 CoNLL 2009 格式,包含 TOKEN、POS、DEPENDENCY 类型注释。
- 依存句法分析器可输出词语之间的关系类型,帮助理解句子结构和语义。
- 训练时需先获得依存词(如通过 DependencyParser),然后用 TypedDependencyParser 进行关系标注。
- 训练数据通过
setConllU或setConll2009方法设置,格式详见官方文档和 API 参考。 - 预训练模型可直接通过 TypedDependencyParserModel 加载使用。
输入注释器类型:TOKEN、POS、DEPENDENCY
输出注释器类型:LABELED_DEPENDENCY
# 有标签依存句法分析器训练示例
# 本代码块演示如何使用 Spark NLP 训练有标签依存句法分析器(TypedDependencyParserApproach)。
# 包含的流程如下:
# 1. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 2. SentenceDetector:检测句子边界,将文本拆分为句子。
# 3. Tokenizer:将句子分词为 token。
# 4. PerceptronModel.pretrained:加载预训练的感知机词性标注器,进行词性标注。
# 5. DependencyParserModel.pretrained:加载预训练的依存句法分析器,生成依存关系。
# 6. TypedDependencyParserApproach:根据 CoNLL-U 格式训练数据,训练有标签依存句法分析器,
# 输出词语之间的具体语法关系(如主谓、修饰、宾语等)。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
posTagger = PerceptronModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
dependencyParser = DependencyParserModel.pretrained() \
.setInputCols(["sentence", "pos", "token"]) \
.setOutputCol("dependency")
typedDependencyParser = TypedDependencyParserApproach() \
.setInputCols(["dependency", "pos", "token"]) \
.setOutputCol("dependency_type") \
.setConllU("src/test/resources/parser/labeled/train_small.conllu.txt") \
.setNumberOfIterations(1)
# 构建有标签依存句法分析管道并进行推理
# 依赖于 CoNLL-U 格式的依存树库,无需额外训练数据
# 包含:文本输入、句子检测、分词、词性标注、依存句法分析、有标签依存句法分析
pipeline = Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
posTagger,
dependencyParser,
typedDependencyParser
])
# Additional training data is not needed, the dependency parser relies on CoNLL-U only.
emptyDataSet = spark.createDataFrame([[""]]).toDF("text")
pipelineModel = pipeline.fit(emptyDataSet)
data = spark.createDataFrame([("Hello, this is an example sentence.",)], ["text"])
pipelineModel.transform(data).selectExpr("dependency_type.result as labeled_dependencies").show(truncate=False)
pos_anc download started this may take some time.
25/11/13 11:51:42 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Approximate size to download 3.9 MB [OK!] dependency_conllu download started this may take some time.
25/11/13 11:51:46 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Approximate size to download 16.7 MB [ | ]
25/11/13 11:51:47 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use. 25/11/13 11:51:47 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
dependency_conllu download started this may take some time. Approximate size to download 16.7 MB [ | ]Download done! Loading the resource.
[Stage 70:===========================================> (3 + 1) / 4]
[ — ]
[ \ ]Download done! Loading the resource. [ | ]
[Stage 73:===========================================> (3 + 1) / 4]
[ / ]
[OK!]
25/11/13 11:55:14 WARN TypedDependencyParser: Couldn't find coarse POS map for this language 25/11/13 11:55:15 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:15 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:16 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:16 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:16 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:16 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:16 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:16 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:16 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:17 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:17 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:17 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:17 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:17 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:18 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:18 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:18 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:18 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:19 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:19 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:19 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:20 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:20 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:20 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:21 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:21 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:21 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:22 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:22 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:22 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:23 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:23 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:24 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:24 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:24 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:25 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:25 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:26 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:26 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:27 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:27 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:28 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:28 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:29 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:29 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:30 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:30 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:31 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:31 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:32 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:32 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:32 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:32 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:33 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:33 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:33 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:33 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:34 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:34 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:34 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:34 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:35 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:35 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:36 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:36 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:36 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:37 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:37 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:38 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:38 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:39 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:39 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:40 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:40 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:41 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:41 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:42 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:43 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:43 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n 25/11/13 11:55:44 WARN TypedDependencyParser: Power method didn't converge.rankFirstOrderTensor=%d sigma=%f%n
+----------------------------------------------------------------------------------------+ |labeled_dependencies | +----------------------------------------------------------------------------------------+ |[<no-type>, <no-type>, <no-type>, <no-type>, <no-type>, <no-type>, <no-type>, <no-type>]| +----------------------------------------------------------------------------------------+
WordSegmenterApproach(分词器)¶
WordSegmenter 用于训练非英文或非空格分隔语言(如中文、日语、韩语)的分词模型,将连续文本正确切分为语义单元(token)。
许多语言的句子并非通过空格分隔,分词需结合语言知识。WordSegmenter 通过学习带有词性标注(POS tags)的训练数据,自动理解并分割文本。
训练方法:
- 训练数据需包含词性标注,可用辅助类
POS读取为 DataFrame。 - 通过
setPosColumn指定词性标注列。
输入注释器类型: DOCUMENT
输出注释器类型: TOKEN
# 中文分词器训练示例
# 本代码块演示如何使用 Spark NLP 训练中文分词器(WordSegmenterApproach)。
# 包含的流程如下:
# 1. 导入 Spark NLP 相关模块。
# 2. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 3. WordSegmenterApproach:基于词性标注的分词器,设置输入输出列、词性列和迭代次数。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
wordSegmenter = WordSegmenterApproach() \
.setInputCols(["document"]) \
.setOutputCol("token") \
.setPosColumn("tags") \
.setNIterations(5)
pipeline = Pipeline().setStages([
documentAssembler,
wordSegmenter
])
trainingDataSet = POS().readDataset(
spark,
"src/test/resources/word-segmenter/chinese_train.utf8"
)
pipelineModel = pipeline.fit(trainingDataSet)
拼写检查器¶
这些注释器可用于训练文本纠错任务。
ContextSpellCheckerApproach¶
ContextSpellChecker 是一种基于深度学习的上下文感知拼写纠错算法,将拼写纠错视为序列到序列的映射问题。它结合词级、句级和字符级信息,自动生成并排序纠错候选:
- 词级:为每个词生成多种纠错候选
- 句级:利用句子上下文提升纠错准确率
- 字符级:根据字符编辑距离评估纠错成本
详细原理参见《Applying Context Aware Spell Checking in Spark NLP》。更多用法见官方 Examples 和《Training a Contextual Spell Checker for Italian Language》。
输入注释器类型:TOKEN
输出注释器类型:TOKEN
# 拼写检查器训练示例(ContextSpellCheckerApproach)
# 本代码块演示如何使用 Spark NLP 训练上下文感知拼写纠错器。
# 包含的流程如下:
# 1. 导入 Spark NLP 相关模块。
# 2. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 3. Tokenizer:将文本分词为 token。
# 4. ContextSpellCheckerApproach:基于上下文的拼写纠错器,设置输入输出列、最大距离、批量大小、训练轮数和词汇类别数。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
spellChecker = ContextSpellCheckerApproach() \
.setInputCols("token") \
.setOutputCol("corrected") \
.setWordMaxDistance(3) \
.setBatchSize(24) \
.setEpochs(8) \
.setLanguageModelClasses(1650) # dependant on vocabulary size
# .addVocabClass("_NAME_", names) # Extra classes for correction could be added like this
# 拼写检查器训练示例(NorvigSweetingApproach)
# 本代码块演示如何使用 Spark NLP 训练 NorvigSweeting 拼写纠错器。
# 包含的流程如下:
# 1. 导入 Spark NLP 相关模块。
# 2. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 3. Tokenizer:将文本分词为 token。
# 4. NorvigSweetingApproach:基于 Norvig 拼写纠错算法,设置输入输出列和词典文件。
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
spellChecker
])
path = "sherlockholmes.txt"
dataset = spark.read.text(path) \
.toDF("text")
pipelineModel = pipeline.fit(dataset)
NorvigSweeting 拼写检查器¶
NorvigSweeting 是一种自动拼写纠错注释器,能够检索 token 并在英文词典未命中时自动纠正拼写错误。
该算法基于 Damerau-Levenshtein 距离,通过对称删除方法显著降低编辑候选生成和词典查找的复杂度。相比传统拼写纠错(删除、转置、替换、插入),速度提升数百万倍,且适用于多种语言。
使用方法:需通过 setDictionary 提供正确拼写的词典文本文件,每行一个单词,支持正则表达式解析。
# NorvigSweeting 拼写检查器训练示例
# 本代码块演示如何使用 Spark NLP 训练 NorvigSweeting 拼写纠错器。
# 包含的流程如下:
# 1. 导入 Spark NLP 相关模块。
# 2. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 3. Tokenizer:将文本分词为 token。
# 4. NorvigSweetingApproach:基于 Norvig 拼写纠错算法,设置输入输出列和词典文件。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
spellChecker = NorvigSweetingApproach() \
.setInputCols(["token"]) \
.setOutputCol("spell") \
.setDictionary("src/test/resources/spell/words.txt")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
spellChecker
])
pipelineModel = pipeline.fit(trainingData)
25/11/13 12:19:30 WARN TaskSetManager: Stage 130 contains a task of very large size (4261 KiB). The maximum recommended task size is 1000 KiB.
SymmetricDelete 拼写检查器¶
SymmetricDelete 是一种高效的拼写纠错算法,通过对称删除方法生成候选词,并利用距离度量进行纠错。适用于英文及多种语言的自动拼写修正。
使用方法:
- 需通过
setDictionary提供正确拼写的词典文本文件,每行一个单词,支持正则表达式解析。 - 训练流程包括:文本输入、分词、拼写纠错器训练。
该算法可显著提升拼写纠错速度和准确率,适合大规模文本处理场景。
# SymmetricDelete 拼写检查器训练示例
# 本代码块演示如何使用 Spark NLP 训练 SymmetricDelete 拼写纠错器。
# 包含的流程如下:
# 1. 导入 Spark NLP 相关模块。
# 2. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 3. Tokenizer:将文本分词为 token。
# 4. SymmetricDeleteApproach:基于对称删除算法的拼写纠错器,设置输入输出列和词典文件。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
spellChecker = SymmetricDeleteApproach() \
.setInputCols(["token"]) \
.setOutputCol("spell") \
.setDictionary("src/test/resources/spell/words.txt")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
spellChecker
])
pipelineModel = pipeline.fit(trainingData)
25/11/13 12:19:48 WARN TaskSetManager: Stage 134 contains a task of very large size (4261 KiB). The maximum recommended task size is 1000 KiB.
25/11/13 12:19:50 WARN TaskSetManager: Stage 136 contains a task of very large size (4261 KiB). The maximum recommended task size is 1000 KiB.
25/11/13 12:21:04 WARN TaskSetManager: Stage 142 contains a task of very large size (4261 KiB). The maximum recommended task size is 1000 KiB.
以下是可训练的命名实体识别(NER)注释器。
Token Classification(标注分类)¶
这些注释器用于训练模型识别文本中的命名实体。
NerCrfApproach¶
条件随机场(CRF)算法的命名实体识别注释器
- 支持通过 CRF 机器学习算法训练通用 NER 模型。
- 训练数据需为带标签的 Spark DataFrame(如 CoNLL 2003 IOB 格式),包含以下注释类型的列:
- DOCUMENT
- TOKEN
- POS
- WORD_EMBEDDINGS
- NAMED_ENTITY(标签列)
输入注释器类型:DOCUMENT、TOKEN、POS、WORD_EMBEDDINGS
输出注释器类型:NAMED_ENTITY
# 命名实体识别(NER)训练示例
# 本代码块演示如何使用 Spark NLP 训练条件随机场(CRF)命名实体识别器(NerCrfApproach)。
# 包含的流程如下:
# 1. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 2. SentenceDetector:检测句子边界,将文本拆分为句子。
# 3. Tokenizer:将句子分词为 token。
# 4. PerceptronModel.pretrained:加载预训练的感知机词性标注器,进行词性标注。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# 下面这些注释器定义仅在你需要从原始文本构建 CoNLL 格式训练数据时使用,
# 如果你的训练数据已经包含 sentence、token、pos、label 等注释列,则无需重复定义。
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
posTagger = PerceptronModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
# 命名实体识别(NER)训练示例(CRF算法)
# 本代码块演示如何使用 Spark NLP 训练条件随机场(CRF)命名实体识别器(NerCrfApproach)。
# 包含的流程如下:
# 1. WordEmbeddingsModel:加载预训练词向量模型,将 token 映射为向量(embeddings)。
# 2. NerCrfApproach:基于 CRF 算法的命名实体识别器,设置输入列(sentence、token、pos、embeddings)、标签列(label)、训练轮数及输出列(ner)。
# 3. Pipeline:将词向量和 NER 标注器组合为管道。
# 4. CoNLL().readDataset:加载 CoNLL 2003 格式的训练数据集,包含句子、分词、词性、标签等字段。
# 5. pipeline.fit(trainingData):使用训练数据集训练 NER 模型。
#Then training can start:
embeddings = WordEmbeddingsModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(False)
nerTagger = NerCrfApproach() \
.setInputCols(["sentence", "token", "pos", "embeddings"]) \
.setLabelColumn("label") \
.setMinEpochs(1) \
.setMaxEpochs(3) \
.setOutputCol("ner")
pipeline = Pipeline().setStages([
embeddings,
nerTagger
])
# We use the sentences, tokens, POS tags and labels from the CoNLL dataset.
conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
pipelineModel = pipeline.fit(trainingData)
NerDLApproach¶
NerDLApproach 是一种基于深度学习的命名实体识别(NER)注释器,采用 Char CNNs - BiLSTM - CRF 神经网络架构,在主流数据集上表现优异。
训练数据要求:
- 采用 CoNLL 2003 IOB 格式的 Spark DataFrame
- 必须包含以下注释类型的列:
- DOCUMENT(原始文本)
- TOKEN(分词结果)
- WORD_EMBEDDINGS(词向量,如 BertEmbeddings)
- NAMED_ENTITY(实体标签)
常用组件:
- SentenceDetector:句子边界检测
- Tokenizer:分词
- WordEmbeddingsModel 或 BertEmbeddings:词嵌入
输入注释器类型: DOCUMENT、TOKEN、WORD_EMBEDDINGS
输出注释器类型: NAMED_ENTITY
# 命名实体识别(NER)深度学习训练示例(NerDLApproach)
# 本代码块演示如何使用 Spark NLP 训练基于深度学习的命名实体识别器(NerDLApproach)。
# 包含的流程如下:
# 1. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 2. SentenceDetector:检测句子边界,将文本拆分为句子。
# 3. Tokenizer:将句子分词为 token。
# 4. WordEmbeddings/BertEmbeddings:将 token 映射为词向量(embeddings)。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# This CoNLL dataset already includes a sentence, token and label
# column with their respective annotator types. If a custom dataset is used,
# these need to be defined with for example:
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
# 5. NerDLApproach:基于深度学习的命名实体识别器,
# 设置输入列(sentence、token、embeddings)、标签列(label)、训练轮数及输出列(ner)。
# 6. Pipeline:将上述组件组合为管道,进行训练和推理。
embeddings = BertEmbeddings.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
nerTagger = NerDLApproach() \
.setInputCols(["sentence", "token", "embeddings"]) \
.setLabelColumn("label") \
.setOutputCol("ner") \
.setMaxEpochs(1) \
.setRandomSeed(0) \
.setVerbose(0)
pipeline = Pipeline().setStages([
embeddings,
nerTagger
])
# We use the sentences, tokens, and labels from the CoNLL dataset.
conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
pipelineModel = pipeline.fit(trainingData)
# 文本分类器训练示例(ClassifierDLApproach)
# 本代码块演示如何使用 Spark NLP 训练基于深度学习的文本分类器(ClassifierDLApproach)。
# 包含的流程如下:
# 1. 读取带有文本和标签的训练数据集(CSV格式)。
# 2. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 3. UniversalSentenceEncoder:将 Document 注释转换为句子向量(sentence_embeddings)。
# 4. ClassifierDLApproach:基于深度学习的文本分类器,设置输入输出列、标签列、批量大小、训练轮数、学习率和 dropout。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
smallCorpus = spark.read.option("header","True").csv("src/test/resources/classifier/sentiment.csv")
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
useEmbeddings = UniversalSentenceEncoder.pretrained() \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
docClassifier = ClassifierDLApproach() \
.setInputCols("sentence_embeddings") \
.setOutputCol("category") \
.setLabelColumn("label") \
.setBatchSize(64) \
.setMaxEpochs(20) \
.setLr(5e-3) \
.setDropout(0.5)
pipeline = Pipeline() \
.setStages(
[
documentAssembler,
useEmbeddings,
docClassifier
]
)
pipelineModel = pipeline.fit(smallCorpus)
ViveknSentimentApproach¶
ViveknSentimentApproach 是一种受 Vivek Narayanan 算法(GitHub)启发的情感分析器,基于论文《Fast and accurate sentiment classification using an enhanced Naive Bayes model》。
核心特点:
- 结合句子边界和分词,提升上下文感知能力和评分准确性。
- 支持传递性分析,适合复杂文本情感判断。
- 训练数据需包含标准化文本列和标签列(“positive” 或 “negative”)。
输入注释器类型: TOKEN, DOCUMENT
输出注释器类型: SENTIMENT
# ViveknSentimentApproach 情感分析训练示例
# 本代码块演示如何使用 Spark NLP 训练基于 Vivekn 算法的情感分析器。
# 包含的流程如下:
# 1. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 2. Tokenizer:将 Document 注释分词为 token。
# 3. Normalizer:对 token 进行标准化处理(如小写、去除标点)。
# 4. ViveknSentimentApproach:基于 Vivekn 算法的情感分析器,设置输入列、标签列和输出列。
# 5. Finisher:将情感分析结果转换为易于读取的格式,输出最终情感标签。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
document = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
token = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
normalizer = Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normal")
vivekn = ViveknSentimentApproach() \
.setInputCols(["document", "normal"]) \
.setSentimentCol("train_sentiment") \
.setOutputCol("result_sentiment")
finisher = Finisher() \
.setInputCols(["result_sentiment"]) \
.setOutputCols("final_sentiment")
# 构建情感分析管道并进行训练和推理
# 1. 定义管道阶段:文本输入、分词、标准化、情感分析、结果输出
# 2. 创建训练数据集,包含文本和情感标签(positive/negative)
# 3. 使用 Pipeline.fit() 训练情感分析模型
# 4. 创建测试数据集,进行情感预测
# 5. 展示最终情感分析结果
pipeline = Pipeline().setStages([document, token, normalizer, vivekn, finisher])
training = spark.createDataFrame([
("I really liked this movie!", "positive"),
("The cast was horrible", "negative"),
("Never going to watch this again or recommend it to anyone", "negative"),
("It's a waste of time", "negative"),
("I loved the protagonist", "positive"),
("The music was really really good", "positive")
]).toDF("text", "train_sentiment")
pipelineModel = pipeline.fit(training)
data = spark.createDataFrame([
["I recommend this movie"],
["Dont waste your time!!!"]
]).toDF("text")
result = pipelineModel.transform(data)
result.select("final_sentiment").show(truncate=False)
+---------------+ |final_sentiment| +---------------+ |[positive] | |[negative] | +---------------+
# 文档向量训练示例(Doc2VecApproach)
# 本代码块演示如何使用 Spark NLP 训练 Doc2Vec 文档向量模型。
# 包含的流程如下:
# 1. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 2. Tokenizer:将 Document 注释分词为 token。
# 3. Doc2VecApproach:根据分词结果训练文档向量模型,输出 embeddings 列。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = Doc2VecApproach() \
.setInputCols(["token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline() \
.setStages([
documentAssembler,
tokenizer,
embeddings
])
path = "sherlockholmes.txt"
dataset = spark.read.text(path).toDF("text")
pipelineModel = pipeline.fit(dataset)
Word2VecApproach¶
Word2Vec 用于训练词向量模型,将语料库中的每个单词映射为稠密向量,广泛应用于 NLP 和机器学习任务。
- 训练流程:首先构建词汇表,然后通过 skip-gram 和分层 softmax 方法学习每个词的向量表示。
- 词向量可作为下游模型的特征,用于文本分类、聚类、相似度计算等场景。
- Spark ML 实现的 Word2Vec 支持自定义参数,训练高效,易于集成。
- 如需加载预训练模型或实例化新模型,请参考
Word2VecModel。
# 词向量训练示例(Word2VecApproach)
# 本代码块演示如何使用 Spark NLP 训练 Word2Vec 词向量模型。
# 包含的流程如下:
# 1. DocumentAssembler:将原始文本转换为 Document 类型注释,作为管道输入。
# 2. Tokenizer:将 Document 注释分词为 token。
# 3. Word2VecApproach:根据分词结果训练词向量模型,输出 embeddings 列。
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = Word2VecApproach() \
.setInputCols(["token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline() \
.setStages([
documentAssembler,
tokenizer,
embeddings
])
path = "sherlockholmes.txt"
dataset = spark.read.text(path).toDF("text")
pipelineModel = pipeline.fit(dataset)