专注Java教育14年 全国咨询/投诉热线:444-1124-454
赢咖4LOGO图
始于2009,口口相传的Java黄埔军校
首页 hot资讯 Java分词工具之HanLP介绍

Java分词工具之HanLP介绍

更新时间:2021-10-26 10:06:40 来源:赢咖4 浏览667次

HanLP 是由一系列模型和算法组成的Java工具包。目标是普及自然语言处理在生产环境中的应用。它不仅是分词,还提供了词法分析、句法分析、语义理解等完整的功能。HanLP 具有功能齐全、性能高效、结构清晰、语料最新、功能可定制等特点。

HanLP 是完全开源的,包括字典。不依赖其他jar,底层使用了一系列高速数据结构,如双数组Trie树、DAWG、AhoCorasickDoubleArrayTrie等,这些基础组件都是开源的。

通过工具类HanLP,可以一句话调用所有函数,文档详细,开箱即用。底层算法经过精心优化,极速分词模式下每秒可达200​​0万字,内存仅需要120MB。IO方面,字典加载速度极快,快速启动仅需500ms

POM文件

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.iqilu</groupId>
  <artifactId>Segment</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>
  <name>Hello</name>
  <url>http://maven.apache.org</url>
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
    <groupId>com.hankcs</groupId>
    <artifactId>hanlp</artifactId>
    <version>portable-1.3.2</version>
    </dependency>
  </dependencies>
</project>

DemoSegment.java

package com.iqilu;
import com.hankcs.hanlp.HanLP;
import com.hankcs.hanlp.seg.common.Term;
import java.util.List;
public ​class DemoSegment {
   ​public static void main(String[] args) {
       ​String[] testCase = new String[]{
               ​"Goods and services",
               ​"Married and unmarried are indeed interfering with participles",
               ​"Buy fruits and then come to the Expo and die at the Expo",
               ​"China's capital is Beijing",
               ​"Welcome the new teacher to come to dinner",
               ​"The virgin officer of the industry and information technology must personally explain the installation of technical devices such as 24 switches through the subordinate departments every month",
               ​"With the rise of web games, the current web games are prosperous and rely on archives. The design for logical judgment is reduced, but this one cannot be completely ignored.",
       ​};
       ​for (String sentence : testCase)
       ​{
           ​List<Term> termList = HanLP.segment(sentence);
           ​System.out.println(termList);
       ​}
   ​}
​}

结果

[Products/n, and/c, services/vn]
[Married/v, of/uj, and/c, not yet/d, married/v, of/uj, indeed/ad, at/p, interference/v, participle/n, ah/y]
[Buy/v, fruit/n, then/c, come/v, Expo/j, finally/f, go/v, Expo/j]
[China/ns, of/uj, capital/n, yes/v, Beijing/ns]
[Welcome/v, new/a, teacher/n, before death/t, come/v, dinner/v]
[Industry and Information Office/n, female/b, secretary/n, monthly/r, passing/p, subordinate/v, department/n, all/nr, personally/d, 
Explain/v, 24/m, port/q, switch/n, etc/u, technical/n, device/n, of/uj, installation/v, work/vn]
[With/p, page/q, youxing/n, from/v, to/v, now/t, of/uj, page tour/nz, flourishing/an,,/w, 
Depend on/v, archive/vn, proceed/v, logic/n, judge/v, of/uj, design/vn, reduce/v, up/ul,,/w, 
But/c, this piece of/r, also/d, cannot/v, completely/ad, ignore/v, drop/v,./w]

Java分词工具只是众多的Java开发工具之一,以后大家还会接触到更多相关知识。

提交申请后,顾问老师会电话与您沟通安排学习

免费课程推荐 >>
技术文档推荐 >>