home       products       support       download       company
[tovek.com]  
  support    tovek tools technical notes   002
 

Tovek Tools Assist

Tovek Tools Assist should help you to know more about words that are stored in collections. This utility is intended for advanced users and so this explanation will be more about concepts than what every button does. Also the setup is different: you have to unpack a ZIP archive and copy files to the correct folder. The installation archive can be downloaded from the download page.

TT Assist has two functions: it helps you to build a Verity Assist and using it search or view words in a collection; and it helps you to dump all words and their counts from a collection and then to merge and/or extract from these lists. The search words functionality is useful for knowing better what's in a collection, which words/variants are used, and generally what to search for. It can also show how a wildcard operator works and what it can find. The dump words feature can be used to prepare a list of words which can be then further analyzed, for example by doing all kinds of word statistics. Dumping words is based on DIDUMP command-line utility, and this is one of the reasons why TT Assist is not considered to be a 'full' program with its own licensing etc., but more like a bonus for TT Analyst Pack users. Also the user interface is in English only and there is currently no plan to localize it.

Installation

In order to use TT Assist your TT licence must include both Query Editor and local indexing by VDK, i.e. you must be able to create and index files into a 'Normal Files' type of source.

The downloaded ZIP file should contain these three files: ASSIST.EXE, _DIDUMP.BAT, and DIDUMP.EXE. Unpack and place these three files into the _NTI31\BIN folder of the TT installation. By default it is <PROGRAM FILES>\TVKTOOLS\_NTI31\BIN, and, as a guide, this folder contains TVKAGT.EXE as well. ASSIST.EXE is the TT Assist application to be started.

The main window

The main window contains a list of local sources (i.e. without Tovek Server and K2 sources) and in the lower part there is a window showing the word dump progress and indicating which lists of words were loaded and merged and some statistics about it. Buttons next to this window are used for managing the merged word list.

Search words

To browse and search for words using Verity Assist, select a source in the main window and choose the Assist button. Everything is performed in the following dialog box.

The assist first has to be built, or rebuilt if a collection has been modified. It's the user's responsibility to make sure that the assist is up-to-date. When there is no word entered, the Show Words button should list words from the start of the word list; otherwise it displays words beginning with the text entered or which are matching the wildcard, based on the Wildcard check box. The list on the right shows word variants from the collection; which word variants is specified by checkboxes in the top right corner of the window. Please note that other types of assists, well known to Topic v. 4 users (such as the Suggest), are not supported by Verity at this time. To copy words from the both lists use the Copy button. The words are copied from the list which was used as the last one.

When you move through the list on the left, as you approach the end of a listbox contents new adjacent words will be added to the list. When you want to select and copy more words at once, it's necessary to select those words twice. There might be problems with some languages, such as Russian, when sometimes there are no adjacent words returned if they are starting with a different first letter. However, when you type the beginning of a word - i.e. when you search for a word - all seems to work well.

Dump words

First it is necessary to create a word dump list for a collection. It is performed by selecting a source from the list in the main window and choosing the Get words button. Then a command prompt window with the DIDUMP utility is displayed, temporary results are loaded, sorted, and merged, and this process is repeated for every DID file in the collection, i.e. partition by partition. No-longer active partitions marked by MRG files are skipped. Progress is displayed in the text box in the lower part of the main window. At the end you will be prompted to save the LST file containing all the collection's words and their counts, and a short summary will be shown.

LST file contains a short header and then the list of all words. Every line of the actual list contains the word text, then "the number of unique documents in which the word appears", and finally "the total number of occurrences of a word" for all the partitions. Entries on the line are separated by tabulators, and the quotes were from the documentation of the DIDUMP utility. For those familiar with the DIDUMP, the Size column is dropped and is not used in TT Assist. Another difference of the resulting LST file is that the numbers are totals for all the partitions, i.e. they should be valid for the complete collection. One note about the order of the words (lines): they are sorted, but not according to the collection's locale or the sort order. Sorting is performed only to make sure that the words from all partitions are merged correctly.

It was noted that the partitions marked by MRG files are skipped, but there is still a possibility that the counts will not match. For example, if only a few documents are deleted from a collection, those documents are marked as deleted, but their contents (i.e. the words) is not removed from the collection. It is done only after the full optimization, and so if you really want to be sure that there are no extra words the easiest way might be to purge and reindex the collection. In the real world, however, under most scenarios only new documents are added to the fulltext and no deletions are performed, and so this is quite straightforward.

The only way (that we know about, that is) how to get the word dump is to use the DIDUMP utility. This is quite unfortunate because it's a command line utility intended for experienced administrators and there are some problems when calling it from a GUI application that can be run on a number of versions of MS Windows. The best way seems to be to use a batch file to start it and so this is how it's called now. The batch file _DIDUMP.BAT receives as parameters the full path to the Common directory with Verity's settings, the full path for the DID partition file to be dumped out, and then the full path to a temporary output file. All path names are in the short form (8+3) which should reduce problems with spaces and accented characters that might be contained there (this is just one example of problems with an admin-like utility: it's normally solved very quickly by administrators, and they would not use accented characters in paths in the first place.) But there are differences how short path names are generated on Windows NT/2000 and on Windows 9x/ME, for example. In case that something goes wrong and it seems that there are no words in a collection, try to uncomment the PAUSE commands in the batch file to see what's happening inside, what are the parameters, and check if it can be fixed somehow. One possibility might be to move a collection temporarily from its original place to a folder with a 'simple' name, such as C:\COL1.

Merge and extract

Once you have created LST files for all the collections you want to analyze, you can start merging these files and extracting ranges of words based on your criteria.

Merging of LST files is similar to merging words from partitions, but this time you are merging the contents of collections. It's also possible to merge extracted ranges of words; range extraction is described below. To merge LST files, choose the Add file button from the main window and then select a LST file to be merged into the shared list. The list of merged files, a name of the last loaded file, and some statistics is shown in the status window. Use the Reset button to clear the merged list. Please note that you can add the same file multiple times - it is not forbidden because there might be valid reasons for doing so, such as boosting counts of some words in your analysis etc.

If you have one or multiple LST files loaded you can extract some or all the words from that list into a new file. It's done in the dialog box displayed after pressing the Extract button. All the options are saved and restored between extracts, and they should be quite straightforward to use. Only two notes here: first, if you check the "Do not write header", the header is not written to the output file which might make it easier to load the file into MS Excel, a database, or into any application you want to use to analyze your data. But keep in mind that files without headers cannot be reloaded into TT Assist for further merging. Second, if you decide to limit the output by the number of occurrences by any of the two ways (if by both ways, conditions work as logical AND), entries with the Min and Max values will also be included in the output list. For example, if the Word Count Min is 2 and the Word Count Max is 5, words will be written out if they occur 2, 3, 4, or 5 times in the merged list.

And one warning at the end: LST files can be big. Maybe not in the terms of the file size, but certainly hundreds of thousands of entries are quite common. Not every application can handle these numbers easily so please be careful and/or patient.

 
 
  www.tovek.com
Page Contents Modified: 20-Aug-2003 21:34:42