Class QueryAutoStopWordAnalyzer

java.lang.Object
org.apache.lucene.analysis.Analyzer
org.apache.lucene.analysis.AnalyzerWrapper
org.apache.lucene.analysis.query.QueryAutoStopWordAnalyzer
All Implemented Interfaces:
Closeable, AutoCloseable

public final class QueryAutoStopWordAnalyzer extends AnalyzerWrapper
An Analyzer used primarily at query time to wrap another analyzer and provide a layer of protection which prevents very common words from being passed into queries.

For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.

Since:
3.1
  • Field Details

  • Constructor Details

    • QueryAutoStopWordAnalyzer

      public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader) throws IOException
      Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than defaultMaxDocFreqPercent
      Parameters:
      delegate - Analyzer whose TokenStream will be filtered
      indexReader - IndexReader to identify the stopwords from
      Throws:
      IOException - Can be thrown while reading from the IndexReader
    • QueryAutoStopWordAnalyzer

      public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, int maxDocFreq) throws IOException
      Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency greater than the given maxDocFreq
      Parameters:
      delegate - Analyzer whose TokenStream will be filtered
      indexReader - IndexReader to identify the stopwords from
      maxDocFreq - Document frequency terms should be above in order to be stopwords
      Throws:
      IOException - Can be thrown while reading from the IndexReader
    • QueryAutoStopWordAnalyzer

      public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, float maxPercentDocs) throws IOException
      Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given maxPercentDocs
      Parameters:
      delegate - Analyzer whose TokenStream will be filtered
      indexReader - IndexReader to identify the stopwords from
      maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word
      Throws:
      IOException - Can be thrown while reading from the IndexReader
    • QueryAutoStopWordAnalyzer

      public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, Collection<String> fields, float maxPercentDocs) throws IOException
      Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given maxPercentDocs
      Parameters:
      delegate - Analyzer whose TokenStream will be filtered
      indexReader - IndexReader to identify the stopwords from
      fields - Selection of fields to calculate stopwords for
      maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word
      Throws:
      IOException - Can be thrown while reading from the IndexReader
    • QueryAutoStopWordAnalyzer

      public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, Collection<String> fields, int maxDocFreq) throws IOException
      Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given maxDocFreq
      Parameters:
      delegate - Analyzer whose TokenStream will be filtered
      indexReader - IndexReader to identify the stopwords from
      fields - Selection of fields to calculate stopwords for
      maxDocFreq - Document frequency terms should be above in order to be stopwords
      Throws:
      IOException - Can be thrown while reading from the IndexReader
  • Method Details

    • getWrappedAnalyzer

      protected Analyzer getWrappedAnalyzer(String fieldName)
      Description copied from class: AnalyzerWrapper
      Retrieves the wrapped Analyzer appropriate for analyzing the field with the given name
      Specified by:
      getWrappedAnalyzer in class AnalyzerWrapper
      Parameters:
      fieldName - Name of the field which is to be analyzed
      Returns:
      Analyzer for the field with the given name. Assumed to be non-null
    • wrapComponents

      protected Analyzer.TokenStreamComponents wrapComponents(String fieldName, Analyzer.TokenStreamComponents components)
      Description copied from class: AnalyzerWrapper
      Wraps / alters the given TokenStreamComponents, taken from the wrapped Analyzer, to form new components. It is through this method that new TokenFilters can be added by AnalyzerWrappers. By default, the given components are returned.
      Overrides:
      wrapComponents in class AnalyzerWrapper
      Parameters:
      fieldName - Name of the field which is to be analyzed
      components - TokenStreamComponents taken from the wrapped Analyzer
      Returns:
      Wrapped / altered TokenStreamComponents.
    • getStopWords

      public String[] getStopWords(String fieldName)
      Provides information on which stop words have been identified for a field
      Parameters:
      fieldName - The field for which stop words identified in "addStopWords" method calls will be returned
      Returns:
      the stop words identified for a field
    • getStopWords

      public Term[] getStopWords()
      Provides information on which stop words have been identified for all fields
      Returns:
      the stop words (as terms)