`
xiewei906
  • 浏览: 22606 次
  • 性别: Icon_minigender_1
  • 来自: 武汉
最近访客 更多访客>>
社区版块
存档分类
最新评论

利用庖丁解牛工具包解决中文分词问题

    博客分类:
  • Java
阅读更多
利用LUCENE工具包时,虽然lucene缺省提供了2个比较通用的分析器SimpleAnalyser和StandardAnalyser,但这2个分析器对中文的支持都非常弱,所以要加入对中文语言的切分规则,后来发现了Qieqie的庖丁解牛,于是下载下来想好好利用这个来解决我的中文分词问题,但是经过我的测试,发现总是找不到那个字典目录,还好有源文件,后来就看了下它里面加载资源文件的类,可能就是路径没有配对的问题。于是我查明原因后就重新打包了,后来就OK了,希望遇到我同样问题的朋友们能即时纠正.
第一步,下载庖丁解牛工具包paoding-analysis-2.0.4-alpha2
下载地址:http://code.google.com/p/paoding/downloads/list
解压以后新建一个工程,将SRC文件拷贝到所建立的工程目录下,
修改paoding-dic-home.properties
#values are "system-env" or "this";
#if value is "this" , using the paoding.dic.home as dicHome if configed!
paoding.dic.home.config-fisrt=this

#dictionary home (directory)
#"classpath:xxx" means dictionary home is in classpath.
#e.g "classpath:dic" means dictionaries are in "classes/dic" directory or any other classpath directory
paoding.dic.home=D:\data\paoding\dic

#seconds for dic modification detection
#paoding.dic.detector.interval=60

主要是指定字典文件的路径,在这里给的是字典文件的绝对路径
然后在利用build.xml重新打成JAR包paoding-analysis.jar

第二步,将这个paoding-analysis.jar放到我所在的工程的lib目录下,lib里面总共放了lucene-core-2.3.0.jar,lucene-demos-2.3.0.jar,lucene-highlighter-2.2.0.jar,paoding-analysis.jar
然后简单的写了2个方法来测试是否能够成功对中文进行分词和检索结果的准确度

import java.io.File;
import java.util.Date;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import net.paoding.analysis.analyzer.PaodingAnalyzer;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import com.biaoqi.ibs.base.BaseDbop;

public class LuceneDemo {

Analyzer analyzer = new PaodingAnalyzer();// 实例化庖丁解牛中文分析器

public void CreateIndex() throws Exception {
File indexDir = new File("D:\\PgmInfIndex");
List result = getSearchData();
IndexWriter writer = new IndexWriter(indexDir, analyzer);
long start = new Date().getTime();

if (result != null && result.size() > 0) {
Iterator it = result.iterator();
while (it.hasNext()) {
HashMap map = (HashMap) it.next();
Document doc = new Document();
doc.add(new Field("PGM_ID", map.get("PGM_ID").toString(),
Field.Store.YES, Field.Index.NO, Field.TermVector.NO));
doc.add(new Field("PGM_NAME", map.get("PGM_NAME").toString(),
Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("PGM_PLAY_DURATION", map.get(
"PGM_PLAY_DURATION").toString(), Field.Store.YES,
Field.Index.NO, Field.TermVector.NO));
doc.add(new Field("PGM_LOGO", map.get("PGM_LOGO").toString(),
Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));

writer.addDocument(doc);
}
}

long end = new Date().getTime();

System.out.println("建立IBS_PGM_INF表全部数据索引花费时间为: " + (end - start)
+ " milliseconds");

writer.optimize();
writer.close();
}

public void search() throws Exception {

File indexDir = new File("D:\\PgmInfIndex");
String q = "凶铃";
FSDirectory directory = FSDirectory.getDirectory(indexDir);
Directory fsDir = directory;
IndexSearcher is = new IndexSearcher(fsDir); // ① 打开索引
// Term term = new Term("PGM_LOGO", q);
// TermQuery luceneQuery = new TermQuery(term);
QueryParser parser = new QueryParser("PGM_LOGO", analyzer);
Query query = parser.parse(q);
long start = new Date().getTime();
Hits hits = is.search(query);// ③ 搜索索引
long end = new Date().getTime();
System.err.println("Found " + hits.length() + " document(s) (in "
+ (end - start) + " milliseconds) that matched query ‘" + q
+ "’:");
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);// ④得到匹配的文档
System.out.println(doc.get("PGM_ID") + " --- "
+ doc.get("PGM_LOGO"));
}
}

public List getSearchData() {
StringBuffer sqlBuf = new StringBuffer();
sqlBuf
.append("select PGM_ID,PGM_NAME,PGM_PLAY_DURATION,PGM_LOGO from IBS_PGM_INF order by pgm_id");
return BaseDbop.dbSearch(sqlBuf.toString());
}
}
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics