NotaBene Mailing List 2003

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Umlauts in Orbis - SWS 6.1



Joachim Linder wrote:

> thanks for your answer. Following it I tried to _maintain_ the existing
> database (checking left box, right box, both boxes, non - each resulting in
> a full indexing cycle) with no other result than to learn that maintining
> seems not to change this property of an existing database. 

In a superficial way, my results would be consistent with this: the
DISPLAY for the properties of the textbase does not change. The left box
for "Include All Accents/Modifiers etc." and only that is ALWAYS
checked. So by that signal, the property does not seem to change.
However, judging by actual searches, it DOES change. This is also
reliably reflected in the textbase configuration file
TEXTBASEFILENAME.CFG.

This is what a little testing revealed:

There are three possible values for how accents are handled by Orbis
index and search. The value is either 0,1, or 2. It is recorded in
TEXTBASEFILENAME.CFG as a line similar to this:

ACCENTS=1

This code was probably introduced only with NB 6. At any rate it was
absent in the first incarnations of NB5 -- cannot see the line in my
older textbases whose properties I have not edited since NB5.

The way Orbis indexes accents can be changed through Orbis's Maintain,
Edit Properties, Keyword Type, the two boxes on the line starting with
"Accents". In my experience, one can also directly edit
TEXTBASEFILENAME.CFG.

The following explains how the different values are set using the check
boxes and how searches work vis-a-vis accented characters and their
non-accented base versions. Let the textbase include an entry containing
"Künstler and another one containing "Kunstwerke."

Left Box="Include All Accents/Modifiers, etc."
Right Box="(Also) Strip All Accents/Modifiers, etc."

Accents=0 
Check only the Right Box 
Kunst* retrieves Kunstwerke and Künstler
Künst* retrieves nothing

Accents=1 
Check only the Left Box or leave both unchecked 
Kunst* retrieves Kunstwerke but not Künstler 
Künst* retrieves Künstler but not Kunstwerke

Accents=2 
Check both boxes
Kunst* retrieves Kunstwerke and Künstler
Künst* retrieves Künstler but not Kunstwerke

Further notes:

"Accents=1" seems to be the default value on my system but I don't know
if this is a factory setting or something I introduced. I don't know
where the default value is stored.

Highlighting is not altogether consistent with the index mode. At least,
under "Accents=1" "Kunst" does not RETRIEVE entries based on "Künstler"
but it will HIGHLIGHT also "Künstler" if it happens to be in a retrieved
entry.

My results pertain with any confidence only to the 1-byte versions of
accented characters. (I try to stick to 1-byte versions as best I can --
other versions do not travel well to other programs -- this is still
true at least for taking them to MS Word both via the Windows clipboard
copy/paste and RTF conversion.) I did deliberately put the word "lálá"
with the larger (4-bytes they seem to be) versions of a-acute in a
textbase file. This textbase has ACCENTS=2.  Search for "lala" retrieved
the entry (as it should have in this mode), as did a search for "lálá"
with the same 4-byte á's. However, a search for "lálá" with 1-byte á's
retrieved nothing. So there is another reason to avoid those multibyte
accented characters. (I'm sure NB has some reasons to have them in the
first place, and it would be nice to learn of those.) 

j-p takala



Main Index | Thread Index