Tovek Tools Assist
Tovek Tools Assist should help you to know more about words that are stored
in collections. This utility is intended for advanced users and so this
explanation will be more about concepts than what every button does. Also the
setup is different: you have to unpack a ZIP archive and copy files to the
correct folder. The installation archive can be downloaded from the
download page.
TT Assist has two functions: it helps you to build a Verity Assist and using it
search or view words in a collection; and it helps you to dump all words and their
counts from a collection and then to merge and/or extract from these lists. The search words functionality
is useful for knowing better what's in a collection, which
words/variants are used, and generally what to search for. It can also show
how a wildcard operator works and what it can find. The dump words feature
can be used to prepare a list of words which can be then further analyzed, for
example by doing all kinds of word statistics. Dumping words
is based on DIDUMP command-line utility, and this is one of the reasons why
TT Assist is not considered to be a 'full' program with its own licensing etc.,
but more like a bonus for TT Analyst Pack users. Also the user interface is in
English only and there is currently no plan to localize it.
Installation
In order to use TT Assist your TT licence must include both Query Editor and local
indexing by VDK, i.e. you must be able to create and index files into a 'Normal
Files' type of source.
The downloaded ZIP file should contain these three files: ASSIST.EXE, _DIDUMP.BAT, and
DIDUMP.EXE. Unpack and place these three files into the _NTI31\BIN folder of the TT
installation. By default it is <PROGRAM FILES>\TVKTOOLS\_NTI31\BIN, and,
as a guide, this folder contains TVKAGT.EXE as well. ASSIST.EXE is the
TT Assist application to be started.
The main window
The main window contains a list of local sources (i.e. without Tovek Server and K2
sources) and in the lower part there is a window showing the word dump progress and
indicating which lists of words were loaded and merged and some statistics about it.
Buttons next to this window are used for managing the merged word list.
To browse and search for words using Verity Assist, select a source in the main window
and choose the Assist button. Everything is performed in the following dialog box.
The assist first has to be built, or rebuilt if a collection has been modified.
It's the user's responsibility to make sure that the assist is up-to-date. When
there is no word entered, the Show Words button should list words from the
start of the word list; otherwise it displays words beginning with the text
entered or which are matching the wildcard, based on the Wildcard check box.
The list on the right shows word variants from the collection; which word
variants is specified by checkboxes in the top right corner of the window.
Please note that other types of assists, well known to Topic v. 4 users (such
as the Suggest), are not supported by Verity at this time. To copy words from
the both lists use the Copy button. The words are copied from the list which
was used as the last one.
When you move through the list on the left, as you approach the end of a listbox
contents new adjacent words will be added to the list. When you want to select
and copy more words at once, it's necessary to select those words twice. There
might be problems with some languages, such as Russian, when sometimes there
are no adjacent words returned if they are starting with a different first letter.
However, when you type the beginning of a word - i.e. when you search for a word -
all seems to work well.
First it is necessary to create a word dump list for a collection. It is performed
by selecting a source from the list in the main window and choosing the Get words
button. Then a command prompt window with the DIDUMP utility is displayed, temporary
results are loaded, sorted, and merged, and this process is repeated for every DID
file in the collection, i.e. partition by partition. No-longer active partitions
marked by MRG files are skipped. Progress is displayed in the text box in the lower
part of the main window. At the end you will be prompted to save the LST file containing
all the collection's words and their counts, and a short summary will be shown.
LST file contains a short header and then the list of all words. Every line of the
actual list contains the word text, then "the number of unique documents in which
the word appears", and finally "the total number of occurrences of a word" for all
the partitions. Entries on the line are separated by tabulators, and the quotes
were from the documentation of the DIDUMP utility. For those familiar with the
DIDUMP, the Size column is dropped and is not used in TT Assist. Another difference
of the resulting LST file is that the numbers are totals for all the partitions,
i.e. they should be valid for the complete collection. One note about the order
of the words (lines): they are sorted, but not according to the collection's
locale or the sort order. Sorting is performed only to make sure that the words
from all partitions are merged correctly.
It was noted that the partitions marked by MRG files are skipped, but there is still
a possibility that the counts will not match. For example, if only a few documents
are deleted from a collection, those documents are marked as deleted, but their
contents (i.e. the words) is not removed from the collection. It is done only
after the full optimization, and so if you really want to be sure that there are
no extra words the easiest way might be to purge and reindex the collection.
In the real world, however, under most scenarios only new documents are added to
the fulltext and no deletions are performed, and so this is quite straightforward.
The only way (that we know about, that is) how to get the word dump is to use the
DIDUMP utility. This is quite unfortunate because it's a command line utility
intended for experienced administrators and there are some problems when calling
it from a GUI application that can be run on a number of versions of MS Windows.
The best way seems to be to use a batch file to start it and so this is how
it's called now. The batch file _DIDUMP.BAT receives as parameters the full path
to the Common directory with Verity's settings, the full path for the DID partition
file to be dumped out, and then the full path to a temporary output file. All path
names are in the short form (8+3) which should reduce problems with spaces and
accented characters that might be contained there (this is just one example
of problems with an admin-like utility: it's normally solved very quickly by
administrators, and they would not use accented characters in paths in the first
place.) But there are differences how short path names are generated on Windows NT/2000
and on Windows 9x/ME, for example. In case that something goes wrong and
it seems that there are no words in a collection, try to uncomment the PAUSE
commands in the batch file to see what's happening inside, what are the parameters,
and check if it can be fixed somehow. One possibility might be to move a collection
temporarily from its original place to a folder with a 'simple' name, such as C:\COL1.
Once you have created LST files for all the collections you want to analyze, you
can start merging these files and extracting ranges of words based on your criteria.
Merging of LST files is similar to merging words from partitions, but this time
you are merging the contents of collections. It's also possible to merge extracted
ranges of words; range extraction is described below. To merge LST files, choose
the Add file button from the main window and then select a LST file to be merged
into the shared list. The list of merged files, a name of the last loaded file,
and some statistics is shown in the status window. Use the Reset button to clear
the merged list. Please note that you can add the same file multiple times - it
is not forbidden because there might be valid reasons for doing so, such as
boosting counts of some words in your analysis etc.
If you have one or multiple LST files loaded you can extract some or all the words
from that list into a new file. It's done in the dialog box displayed after pressing
the Extract button. All the options are saved and restored between extracts, and
they should be quite straightforward to use. Only two notes here: first, if you check
the "Do not write header", the header is not written to the output file which might
make it easier to load the file into MS Excel, a database, or into any application you
want to use to analyze your data. But keep in mind that files without headers cannot be
reloaded into TT Assist for further merging. Second, if you decide to limit the output
by the number of occurrences by any of the two ways (if by both ways, conditions work
as logical AND), entries with the Min and Max values will also be included in the
output list. For example, if the Word Count Min is 2 and the Word Count Max is 5,
words will be written out if they occur 2, 3, 4, or 5 times in the merged list.
And one warning at the end: LST files can be big. Maybe not in the terms of the file size,
but certainly hundreds of thousands of entries are quite common. Not every application
can handle these numbers easily so please be careful and/or patient.