Menu

#32 Fine tuning the output of KWE

open
4
2007-08-30
2007-06-27
No

I checked the KWE output for all languages wrt to patterns of unwanted keywords.
Here are some suggestions. Please, remove
0. All words which appear more than once
1. all strings of lower case character and or punctuation signs up to a length of three. Remark: I doubt that a good keyword has only three (lower case) characters or less. Upper case should be excluded here, cf. XP, XML etc. Any other counterexamples???
2. all strings which contain digits. Stricter version: strings which do include digits but do not end in one (in case we would like to include Win2000, Win98 etc.)
3. Strings which contain one of the following punctuation characters --> ., &, ,, /,(,),[,],%,*,<,>,%,+,:,", _ Remark: the following punctuation signs can be part of words and words containing them should be kept: §, ' (singlequote), - (dash)

These three rules should do a decent job

Discussion

  • Lothar Lemnitzer

    • priority: 5 --> 8
     
  • Lothar Lemnitzer

    Logged In: YES
    user_id=1604795
    Originator: YES

    Alex, I read in your WP4 progress report that you filter out duplicated keywords in ILIAS.
    Do you consider this the definite slution? There have been resaons for having keyords duplicated
    which has to do with the fact that we keep track of the inflected / attested forms for each
    keyword. Even if we do not use this information currently, we might do this later or in another
    environment. The other steps, like filtering out short and / or malfomed strings, as mentioned above
    should be done in the keyword extractor, because they are generally of no use.

    Lukasz, what do you think?

     
  • Miroslav Spousta

    • priority: 8 --> 5
     
  • Miroslav Spousta

    Logged In: YES
    user_id=520433
    Originator: NO

    Lukasz implemented Lothar's suggestion. KWE output should be further improved, but I am decreasing the priority.

     
  • Alex Killing

    Alex Killing - 2007-08-30
    • priority: 5 --> 4
     

Log in to post a comment.

MongoDB Logo MongoDB