DeeAnalyzer

DeeAnalyzer — Primary gateway for data indexing

Synopsis

#include <dee.h>

struct              DeeAnalyzer;
struct              DeeAnalyzerClass;
gchar *             (*DeeCollatorFunc)                  (const gchar *input,
                                                         gpointer data);
void                (*DeeTermFilterFunc)                (DeeTermList *terms_in,
                                                         DeeTermList *terms_out,
                                                         gpointer filter_data);
void                dee_analyzer_add_term_filter        (DeeAnalyzer *self,
                                                         DeeTermFilterFunc filter_func,
                                                         gpointer filter_data,
                                                         GDestroyNotify filter_destroy);
void                dee_analyzer_analyze                (DeeAnalyzer *self,
                                                         const gchar *data,
                                                         DeeTermList *terms_out,
                                                         DeeTermList *colkeys_out);
gint                dee_analyzer_collate_cmp            (DeeAnalyzer *self,
                                                         const gchar *key1,
                                                         const gchar *key2);
gint                dee_analyzer_collate_cmp_func       (const gchar *key1,
                                                         const gchar *key2,
                                                         gpointer analyzer);
gchar *             dee_analyzer_collate_key            (DeeAnalyzer *self,
                                                         const gchar *data);
DeeAnalyzer *       dee_analyzer_new                    (void);
void                dee_analyzer_tokenize               (DeeAnalyzer *self,
                                                         const gchar *data,
                                                         DeeTermList *terms_out);

Object Hierarchy

  GObject
   +----DeeAnalyzer
         +----DeeTextAnalyzer

Description

A DeeAnalyzer takes a text stream, splits it into tokens, and runs the tokens through a series of filtering steps. Optionally outputs collation keys for the terms.

One of the important use cases of analyzers in Dee is as vessel for the indexing logic for creating a DeeIndex from a DeeModel.

The recommended way to implement your own custom analyzers are by either adding term filters to a DeeAnalyzer or DeeTextAnalyzer instance with dee_analyzer_add_term_filter() and/or derive your own subclass that overrides the dee_analyzer_tokenize() method. Should you have very special requirements it is possible to reimplement all aspects of the analyzer class though.

Details

struct DeeAnalyzer

struct DeeAnalyzer;

All fields in the DeeAnalyzer structure are private and should never be accessed directly


struct DeeAnalyzerClass

struct DeeAnalyzerClass {
};


DeeCollatorFunc ()

gchar *             (*DeeCollatorFunc)                  (const gchar *input,
                                                         gpointer data);

A collator takes an input string, most often a term produced from a DeeAnalyzer, and outputs a collation key.

input :

The string to produce a collation key for

data :

User data set when registering the collator. [closure]

Returns :

The collation key. Free with g_free() when done using it. [transfer full]

DeeTermFilterFunc ()

void                (*DeeTermFilterFunc)                (DeeTermList *terms_in,
                                                         DeeTermList *terms_out,
                                                         gpointer filter_data);

A term filter takes a list of terms and runs it through a filtering and/or set of transformations and stores the output in a DeeTermList.

You can register term filters on a DeeAnalyzer with dee_analyzer_add_term_filter().

terms_in :

A DeeTermList with the terms to filter

terms_out :

A DeeTermList to write the filtered terms to

filter_data :

User data set when registering the filter. [closure]

dee_analyzer_add_term_filter ()

void                dee_analyzer_add_term_filter        (DeeAnalyzer *self,
                                                         DeeTermFilterFunc filter_func,
                                                         gpointer filter_data,
                                                         GDestroyNotify filter_destroy);

Register a DeeTermFilterFunc to be called whenever dee_analyzer_analyze() is called.

Term filters can be used to normalize, add, or remove terms from an input data stream.

self :

The analyzer to add a term filter to

filter_func :

Function to call. [scope notified]

filter_data :

Data to pass to filter_func when it is invoked. [closure]

filter_destroy :

Called on filter_data when the DeeAnalyzer owning the filter is destroyed. [allow-none]

dee_analyzer_analyze ()

void                dee_analyzer_analyze                (DeeAnalyzer *self,
                                                         const gchar *data,
                                                         DeeTermList *terms_out,
                                                         DeeTermList *colkeys_out);

Extract terms and or collation keys from some input data (which is normally, but not necessarily, a UTF-8 string).

The terms and corresponding collation keys will be written in order to the provided DeeTermLists.

Implementation notes for subclasses: The analysis process must call dee_analyzer_tokenize() and run the tokens through all term filters added with dee_analyzer_add_term_filter(). Collation keys must be generated with dee_analyzer_collate_key().

self :

The analyzer to use

data :

The input data to analyze

terms_out :

A DeeTermList to place the generated terms in. If NULL to terms are generated. [allow-none]

colkeys_out :

A DeeTermList to place generated collation keys in. If NULL no collation keys are generated. [allow-none]

dee_analyzer_collate_cmp ()

gint                dee_analyzer_collate_cmp            (DeeAnalyzer *self,
                                                         const gchar *key1,
                                                         const gchar *key2);

Compare collation keys generated by dee_analyzer_collate_key() with similar semantics as strcmp(). See also dee_analyzer_collate_cmp_func() if you need a version of this function that works as a GCompareDataFunc.

The default implementation in DeeAnalyzer just uses strcmp().

self :

The analyzer to use when comparing collation keys

key1 :

The first collation key to compare

key2 :

The second collation key to compare

Returns :

-1, 0 or 1, if key1 is <, == or > than key2.

dee_analyzer_collate_cmp_func ()

gint                dee_analyzer_collate_cmp_func       (const gchar *key1,
                                                         const gchar *key2,
                                                         gpointer analyzer);

A GCompareDataFunc using a DeeAnalyzer to compare the keys. This is just a convenience wrapper around dee_analyzer_collate_cmp().

key1 :

The first key to compare

key2 :

The second key to compare

analyzer :

The DeeAnalyzer to use for the comparison

Returns :

-1, 0 or 1, if key1 is <, == or > than key2.

dee_analyzer_collate_key ()

gchar *             dee_analyzer_collate_key            (DeeAnalyzer *self,
                                                         const gchar *data);

Generate a collation key for a set of input data (usually a UTF-8 string passed through tokenization and term filters of the analyzer).

The default implementation just calls g_strdup().

self :

The analyzer to generate a collation key with

data :

The input data to generate a collation key for

Returns :

A newly allocated collation key. Use dee_analyzer_collate_cmp() or dee_analyzer_collate_cmp_func() to compare collation keys. Free with g_free().

dee_analyzer_new ()

DeeAnalyzer *       dee_analyzer_new                    (void);


dee_analyzer_tokenize ()

void                dee_analyzer_tokenize               (DeeAnalyzer *self,
                                                         const gchar *data,
                                                         DeeTermList *terms_out);

Tokenize some input data (which is normally, but not necessarily, a UTF-8 string).

Tokenization splits the input data into constituents (in most cases words), but does not run it through any of the term filters set for the analyzer. It is undefined if the tokenization process itself does any normalization.

self :

The analyzer to use

data :

The input data to analyze

terms_out :

A DeeTermList to place the generated tokens in.