sitepeople.blogg.se - Document similarity apache lucene

DOCUMENT SIMILARITY APACHE LUCENE CODE

RetrieveInterestingTerms (System.IO.TextReader r)Ĭonvenience routine to make it easy to return the most interesting words in a document. The result is a priority queue of arrays with one entry for every word in the document.

Return a query that will return docs like the passed Reader.ĭescribe the parameters that control how the "more like this" query is formed.įind words for a more-like-this query former. Return a query that will return docs like the passed stream. Return a query that will return docs like the passed URL. Return a query that will return docs like the passed file.

Return a query that will return docs like the passed lucene document ID.

DOCUMENT SIMILARITY APACHE LUCENE CODE

Even if your Analyzer allows stopwords, you might want to tell the MoreLikeThis code to ignore them, as for the purposes of document similarity it seems reasonable to assume that "a stop word is never interesting". Any word in this set is considered "uninteresting" and ignored. Set this to null for the field names to be determined at runtime from the IndexReader provided in the constructor. Sets the field names that will be used when generating the 'More Like This' query. The default field names that will be used is DEFAULT_FIELD_NAMES. Returns the field names that will be used when generating the 'More Like This' query. Words that appear in more than this many percent of all docs will be ignored. Set the maximum percentage in which words may still appear.

Generate "more like this" similarity queries.