利用庖丁解牛工具包解决中文分词问题 -

xiewei906

浏览: 22606 次
性别:
来自: 武汉

最近访客更多访客>>

jjxlcsw2

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

利用庖丁解牛工具包解决中文分词问题

博客分类：

Java

lucene Apache Google D语言 .net

利用LUCENE工具包时，虽然lucene缺省提供了2个比较通用的分析器SimpleAnalyser和StandardAnalyser，但这2个分析器对中文的支持都非常弱，所以要加入对中文语言的切分规则，后来发现了Qieqie的庖丁解牛，于是下载下来想好好利用这个来解决我的中文分词问题，但是经过我的测试，发现总是找不到那个字典目录，还好有源文件，后来就看了下它里面加载资源文件的类，可能就是路径没有配对的问题。于是我查明原因后就重新打包了，后来就OK了，希望遇到我同样问题的朋友们能即时纠正.
第一步，下载庖丁解牛工具包paoding-analysis-2.0.4-alpha2
下载地址：http://code.google.com/p/paoding/downloads/list
解压以后新建一个工程，将SRC文件拷贝到所建立的工程目录下，
修改paoding-dic-home.properties
#values are "system-env" or "this";
#if value is "this" , using the paoding.dic.home as dicHome if configed!
paoding.dic.home.config-fisrt=this

#dictionary home (directory)
#"classpath:xxx" means dictionary home is in classpath.
#e.g "classpath:dic" means dictionaries are in "classes/dic" directory or any other classpath directory
paoding.dic.home=D:\data\paoding\dic

#seconds for dic modification detection
#paoding.dic.detector.interval=60

主要是指定字典文件的路径，在这里给的是字典文件的绝对路径
然后在利用build.xml重新打成JAR包paoding-analysis.jar

第二步，将这个paoding-analysis.jar放到我所在的工程的lib目录下，lib里面总共放了lucene-core-2.3.0.jar，lucene-demos-2.3.0.jar，lucene-highlighter-2.2.0.jar，paoding-analysis.jar
然后简单的写了2个方法来测试是否能够成功对中文进行分词和检索结果的准确度

import java.io.File;
import java.util.Date;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import net.paoding.analysis.analyzer.PaodingAnalyzer;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import com.biaoqi.ibs.base.BaseDbop;

public class LuceneDemo {

Analyzer analyzer = new PaodingAnalyzer();// 实例化庖丁解牛中文分析器

public void CreateIndex() throws Exception {
File indexDir = new File("D:\\PgmInfIndex");
List result = getSearchData();
IndexWriter writer = new IndexWriter(indexDir, analyzer);
long start = new Date().getTime();

if (result != null && result.size() > 0) {
Iterator it = result.iterator();
while (it.hasNext()) {
HashMap map = (HashMap) it.next();
Document doc = new Document();
doc.add(new Field("PGM_ID", map.get("PGM_ID").toString(),
Field.Store.YES, Field.Index.NO, Field.TermVector.NO));
doc.add(new Field("PGM_NAME", map.get("PGM_NAME").toString(),
Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("PGM_PLAY_DURATION", map.get(
"PGM_PLAY_DURATION").toString(), Field.Store.YES,
Field.Index.NO, Field.TermVector.NO));
doc.add(new Field("PGM_LOGO", map.get("PGM_LOGO").toString(),
Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));

writer.addDocument(doc);
}
}

long end = new Date().getTime();

System.out.println("建立IBS_PGM_INF表全部数据索引花费时间为： " + (end - start)
+ " milliseconds");

writer.optimize();
writer.close();
}

public void search() throws Exception {

File indexDir = new File("D:\\PgmInfIndex");
String q = "凶铃";
FSDirectory directory = FSDirectory.getDirectory(indexDir);
Directory fsDir = directory;
IndexSearcher is = new IndexSearcher(fsDir); // ① 打开索引
// Term term = new Term("PGM_LOGO", q);
// TermQuery luceneQuery = new TermQuery(term);
QueryParser parser = new QueryParser("PGM_LOGO", analyzer);
Query query = parser.parse(q);
long start = new Date().getTime();
Hits hits = is.search(query);// ③ 搜索索引
long end = new Date().getTime();
System.err.println("Found " + hits.length() + " document(s) (in "
+ (end - start) + " milliseconds) that matched query ‘" + q
+ "’:");
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);// ④得到匹配的文档
System.out.println(doc.get("PGM_ID") + " --- "
+ doc.get("PGM_LOGO"));
}
}

public List getSearchData() {
StringBuffer sqlBuf = new StringBuffer();
sqlBuf
.append("select PGM_ID,PGM_NAME,PGM_PLAY_DURATION,PGM_LOGO from IBS_PGM_INF order by pgm_id");
return BaseDbop.dbSearch(sqlBuf.toString());
}
}

分享到：

Lucene 排序及多字段查找 | 利用LUCENE工具包中的例子初步学习全文检索

2008-02-21 15:28
浏览 4127
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

利用庖丁解牛工具包解决中文分词问题

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

利用庖丁解牛工具包解决中文分词问题

评论

发表评论

相关推荐

Java路径问题

转一篇lucene的使用的文章，写的比较全

CVS用户配置

Lucene 排序及多字段查找

利用LUCENE工具包中的例子初步学习全文检索

如何做好一个垂直搜索引擎

最近访客更多访客>>