| Home > Admin Area > BibIndex Admin Guide |
| WARNING: BIBINDEX ADMIN GUIDE IS UNDER DEVELOPMENT |
|---|
| BibIndex Admin Guide is not yet completed. Most of admin-level functionality for BibIndex exists only in commandline mode. We are in the process of developing both the guide as well as the web admin interface. If you are interested in seeing some specific things implemented with high priority, please contact us at ddd.bib@uab.cat. Thanks for your interest! |
To define a new index you must first give the index a internal name. An empty index is then created by preparing the database tables.
Before the index can be used for searching, the fields that should be included in the index must be selected.
When desired to fill the index based on the fields selected, you can schedule the update by running bibindex -w indexname together with other desired parameters.
Can be configured by changing CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS and CFG_BIBINDEX_CHARS_PUNCTUATION in the general config file.
How the words are broken up defines what is added to the index. Should only "director-general" be added, or should "director", "general" and "director-general" be added? The index can vary between 300 000 and 3 000 000 terms based the policy for breaking words.
BibIndex supports stopword removal by not adding words which exists in a given stopword list to the index. Stopword removal makes the index smaller by removing much used words.
Which stopword list that should be used can be configured in the general config file file by changing the value of the variable CFG_BIBINDEX_PATH_TO_STOPWORDS_FILE. If no stopword list should be used, the value should be 0.
The BibIndex indexer supports stemming, removing the ending of words thus creating a smaller indexer. For example, using English, the word "information" will be stemmed to "inform"; or "looking", "looks", and "looked" will be all stemmed to "look", thus giving more hits to each word.
Currently you can configure the stemming language on a per-index basis. All searches referring a stemmed index will also be stemmed based on the same language.
By setting the value of CFG_BIBINDEX_MIN_WORD_LENGTH in the general config file higher than 0, only words with the number of characters higher than this will be added to the index.
By setting the value of CFG_BIBINDEX_REMOVE_HTML_MARKUP in the general config file, the indexer may try to remove all HTML code from documents before indexing, and index only the text left. (HTML code is defined as everything between '<' and '>' in a text.)
By setting the value of CFG_BIBINDEX_REMOVE_LATEX_MARKUP in the general config file, the indexer may try to remove all LaTeX code from documents before indexing, and index only the text left. (LaTeX code is defined as everything between '\command{' and '}' in a text, or '{\command ' and '}').
The metadata tags are usually indexed by its content. There are
special cases however, such as the fulltext indexing. In this case
the tag contains an URL to the fulltext material and we would like to
fetch this material and index words found in this material rather than
in the metadata itself. This is possible via special tag assignement
via tagToWordsFunctions variable.
The default setup is configured in the way that if the indexer sees
that it has to index tag 8564_u, it switches into the
fulltext indexing mode described above. It can index locally stored
files or even fetch them from external URLs, depending on the value of
the CFG_BIBINDEX_FULLTEXT_INDEX_LOCAL_FILES_ONLY configuration
variable. When fetching files from remote URLs, when it ends on a
splash page (an intermediate page before getting to fulltext file
itself), it can find and follow any further links to fulltext files.
The default setup also differentiate between metadata and fulltext
indexing, so that any field index does process only
metadata, not fulltext. If you want to have the fulltext indexed
together with the metadata, so that both are searched by default, you
can go to BibIndex Admin interface and in the Manage Logical Fields
explicitly add the tag 8564_u under any
field field.
To index your newly created or modified documents, bibindex must be run periodically via bibsched. This is achieved by the sleep option (-s) to bibindex. For more information please see HOWTO Run admin guide.
Upon each indexing run, bibindex checks and reports any inconsistencies in the indexes. You can also manually check for the index corruption yourself by using the check (-k) option to bibindex.
If a problem is found during the check, bibindex hints you to run repairing (-r). If you run it, then during repair bibindex tries to correct problems automatically by its own means. Usually it succeeds.
When the automatic repairing does not succeed though, then manual intervention is required. The easiest thing to get the indexes back to shape are commands like: (assuming the problem is with the index ID 1):
$ echo "DELETE FROM idxWORD01R WHERE type='TEMPORARY' or type='FUTURE';" | \
/opt/invenio/bin/dbexec
to leave only the 'CURRENT' reverse index. After that you can rerun
the index checking procedure (-k) and, if successful, continue with
the normal web site operation. However, a full reindexing should be
scheduled for the forthcoming night or weekend.
The procedure of reindexing is taking place into the real indexes that are also used for searching. Therefore the end users will feel immediately any change in the indexes. If you need to reindex your records from scratch, then the best procedure is the following: reindex the collection index only (fast operation), recreate collection cache, and only after that reindex all the other indexes (slow operation). This will ensure that the records in your system will be at least browsable while the indexes are being rebuilt. The steps to perform are:
First we reindex the collection index:
$ bibindex --reindex -f50000 -wcollection # reindex the collection index (fast)
$ echo "UPDATE collection SET reclist=NULL;" | \
/opt/invenio/bin/dbexec # clean collection cache
$ webcoll -f # recreate the collection cache
$ bibsched # run the two above-submitted tasks
$ sudo apachectl restart
Then we launch (slower) reindexing of the remaining indexes:
$ bibindex --reindex -f50000 # reindex other indexes (slow) $ webcoll -f $ bibsched # run the two above-submitted tasks, and put the queue back in auto mode $ sudo apachectl restart
You may optionally want to reindex the word ranking tables:
$ bibsched # wait for all active tasks to finish, and put the queue into manual mode
$ cd invenio-0.92.1 # source dir
$ grep rnkWORD ./modules/miscutil/sql/tabbibclean.sql | \
/opt/invenio/bin/dbexec # truncate rank indexes
$ echo "UPDATE rnkMETHOD SET last_updated='0000-00-00 00:00:00';" | \
/opt/invenio/bin/dbexec # rewind the last ranking time
Secondly, if you have been using custom ranking methods using new rnkWORD* tables (most probably you have not), you would have to truncate them too:
# find out which custom ranking indexes were added: $ echo "SELECT id FROM rnkMETHOD" | /opt/invenio/bin/dbexec id 66 67 [...] # for every ranking index id, truncate corresponding ranking tables: $ echo "TRUNCATE rnkWORD66F" | /opt/invenio/bin/dbexec $ echo "TRUNCATE rnkWORD66R" | /opt/invenio/bin/dbexec $ echo "TRUNCATE rnkWORD67F" | /opt/invenio/bin/dbexec $ echo "TRUNCATE rnkWORD67R" | /opt/invenio/bin/dbexec
At last, we launch reindexing of the ranking indexes:
and we are done.$ bibrank -f50000 $ bibsched # run the three above-submitted tasks, and put the queue back in auto mode $ sudo apachectl restart
In the future Invenio should ideally run indexing into invisible tables that would be switched against the production ones once the indexing process is successfully over. For the time being, if reindexing takes several hours in your installation (e.g. if you have 1,000,000 records), you may want to mysqlhotcopy your tables and run reindexing on those copies yourself.