I checked the KWE output for all languages wrt to patterns of unwanted keywords.
Here are some suggestions. Please, remove
0. All words which appear more than once
1. all strings of lower case character and or punctuation signs up to a length of three. Remark: I doubt that a good keyword has only three (lower case) characters or less. Upper case should be excluded here, cf. XP, XML etc. Any other counterexamples???
2. all strings which contain digits. Stricter version: strings which do include digits but do not end in one (in case we would like to include Win2000, Win98 etc.)
3. Strings which contain one of the following punctuation characters --> ., &, ,, /,(,),[,],%,*,<,>,%,+,:,", _ Remark: the following punctuation signs can be part of words and words containing them should be kept: §, ' (singlequote), - (dash)
These three rules should do a decent job
Logged In: YES
user_id=1604795
Originator: YES
Alex, I read in your WP4 progress report that you filter out duplicated keywords in ILIAS.
Do you consider this the definite slution? There have been resaons for having keyords duplicated
which has to do with the fact that we keep track of the inflected / attested forms for each
keyword. Even if we do not use this information currently, we might do this later or in another
environment. The other steps, like filtering out short and / or malfomed strings, as mentioned above
should be done in the keyword extractor, because they are generally of no use.
Lukasz, what do you think?
Logged In: YES
user_id=520433
Originator: NO
Lukasz implemented Lothar's suggestion. KWE output should be further improved, but I am decreasing the priority.