<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en" xmlns="http://www.w3.org/2005/Atom"><title>Recent changes to bugs</title><link href="https://sourceforge.net/p/simmetrics/bugs/" rel="alternate"/><link href="https://sourceforge.net/p/simmetrics/bugs/feed.atom" rel="self"/><id>https://sourceforge.net/p/simmetrics/bugs/</id><updated>2014-12-07T17:12:31.851000Z</updated><subtitle>Recent changes to bugs</subtitle><entry><title>#7 Loop in TokeniserWhitespace.tokenizeToArrayList</title><link href="https://sourceforge.net/p/simmetrics/bugs/7/?limit=25#0a82" rel="alternate"/><published>2014-12-07T17:12:31.851000Z</published><updated>2014-12-07T17:12:31.851000Z</updated><author><name>mpkorstanje</name><uri>https://sourceforge.net/u/mpkorstanje/</uri></author><id>https://sourceforge.net7e70c80474ae78a3449c85a9f21d6c6c98e3e372</id><summary type="html">&lt;div class="markdown_content"&gt;&lt;p&gt;I have a fixed version here:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/mpkorstanje/simmetrics/blob/master/src/main/java/uk/ac/shef/wit/simmetrics/tokenisers/TokeniserCSVBasic.java" rel="nofollow"&gt;https://github.com/mpkorstanje/simmetrics/blob/master/src/main/java/uk/ac/shef/wit/simmetrics/tokenisers/TokeniserCSVBasic.java&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;But you'll have to build from source. I'm doing an overhaul of the whole thing.&lt;/p&gt;&lt;/div&gt;</summary></entry><entry><title>Loop in TokeniserWhitespace.tokenizeToArrayList</title><link href="https://sourceforge.net/p/simmetrics/bugs/7/" rel="alternate"/><published>2014-02-12T17:13:58.663000Z</published><updated>2014-02-12T17:13:58.663000Z</updated><author><name>Mitch Claborn</name><uri>https://sourceforge.net/u/mclaborn/</uri></author><id>https://sourceforge.net5a993f8cfc9c0853ed070416777b11a2ed21e2db</id><summary type="html">&lt;div class="markdown_content"&gt;&lt;p&gt;Never ending loop with specific inputs in uk.ac.shef.wit.simmetrics.tokenisers.TokeniserWhitespace.tokenizeToArrayList&lt;/p&gt;
&lt;p&gt;sample program:&lt;br /&gt;
      public static void main(String[] args) {&lt;br /&gt;
        System.out.println("start");&lt;br /&gt;
        InterfaceStringMetric l_metric = new MongeElkan();&lt;br /&gt;
        String l_address_a = "POST OFFICE HOLD &amp;amp; PHONE";&lt;br /&gt;
        String l_address_b = "2665 - C   N HIGHLAND AVE";&lt;br /&gt;
        float l_score = l_metric.getSimilarity(l_address_a, l_address_b);&lt;br /&gt;
        System.out.println("end, score=" + l_score);&lt;br /&gt;
      }&lt;/p&gt;
&lt;p&gt;stack trace:&lt;br /&gt;
    "main" prio=10 tid=0x00007ff70800d800 nid=0x5451 runnable &lt;span&gt;[0x00007ff70ead2000]&lt;/span&gt;&lt;br /&gt;
       java.lang.Thread.State: RUNNABLE&lt;br /&gt;
        at uk.ac.shef.wit.simmetrics.tokenisers.TokeniserWhitespace.tokenizeToArrayList(TokeniserWhitespace.java:121)&lt;br /&gt;
        at uk.ac.shef.wit.simmetrics.similaritymetrics.MongeElkan.getSimilarity(MongeElkan.java:170)&lt;br /&gt;
        at com.mm.server.inventory.app.MergeWorklistFunctions.main(MergeWorklistFunctions.java:213)&lt;/p&gt;&lt;/div&gt;</summary></entry><entry><title>String tokensier break down method runs in deadlock</title><link href="https://sourceforge.net/p/simmetrics/bugs/6/" rel="alternate"/><published>2011-01-07T11:02:39Z</published><updated>2011-01-07T11:02:39Z</updated><author><name>Zubair Ahmed</name><uri>https://sourceforge.net/u/mzubairahmed/</uri></author><id>https://sourceforge.netfba415302606a50cba7eb289693e61ca322e42c6</id><summary type="html">&lt;div class="markdown_content"&gt;&lt;p&gt;The TokeniserWhitespace.tokenizeToArrayList(String input) runs into deadlock when the input string contains more than one whitespace simultaneously.&lt;br /&gt;
I've fixed it in the source code in version 1.6. But can't commit in your provided svn..&lt;/p&gt;
&lt;p&gt;Thanks&lt;/p&gt;&lt;/div&gt;</summary></entry><entry><title>TagLink constructor message massive performance impact</title><link href="https://sourceforge.net/p/simmetrics/bugs/5/" rel="alternate"/><published>2010-04-20T13:47:48Z</published><updated>2010-04-20T13:47:48Z</updated><author><name>Anonymous</name><uri>https://sourceforge.net/u/userid-None/</uri></author><id>https://sourceforge.net7848087cbc5feb21a3f56a85ad69a4e30e6f7567</id><summary type="html">&lt;div class="markdown_content"&gt;&lt;p&gt;The (misspelled) performance message "WARNING - this metric is not recomended for fast processing..." is causing a massive performance issue when used in a multithreaded environment.&lt;/p&gt;
&lt;p&gt;If several threads each create TagLink algorithm instance, they create contention for the System.out PrintStream which queues all the threads behind each other until the performance message is output.&lt;/p&gt;
&lt;p&gt;I have moved (and spellchecked!) the warning to a static block so it is only output when the class is loaded, and not on every construction and I have a 10 times performance improvement.  I have also made the code stricter around the use of generics to avoid unneccessary casts.&lt;/p&gt;
&lt;p&gt;The irony is not lost on me that a message warning of poor performance is such a massive bottleneck!&lt;/p&gt;&lt;/div&gt;</summary></entry><entry><title>Non-breaking space causes infinite loop</title><link href="https://sourceforge.net/p/simmetrics/bugs/4/" rel="alternate"/><published>2009-09-21T13:59:22Z</published><updated>2009-09-21T13:59:22Z</updated><author><name>Craig</name><uri>https://sourceforge.net/u/scytayl/</uri></author><id>https://sourceforge.net68e648776949f60b93111547bff13945128adf9d</id><summary type="html">&lt;div class="markdown_content"&gt;&lt;p&gt;The method tokenizeToArrayList in TokeniserWhitespace.java uses two methods to look at whitespace characters: Character.isWhitespace() and the characters in its delimiters field: "\r\n\t \u00A0". Unfortunately, isWhitespace() does not regard a non-breaking space (\u00A0) as a whitespace character, which ends up causing an infinite loop when tokenising a string.&lt;/p&gt;
&lt;p&gt;The fix is to test for a non-breaking space character when testing isWhitespace() on line 116:&lt;br /&gt;
if (Character.isWhitespace(ch) || (int)ch == 160) {&lt;br /&gt;
curPos++;&lt;br /&gt;
}&lt;/p&gt;&lt;/div&gt;</summary></entry><entry><title>Bug with character encoding</title><link href="https://sourceforge.net/p/simmetrics/bugs/3/" rel="alternate"/><published>2008-07-07T16:28:07Z</published><updated>2008-07-07T16:28:07Z</updated><author><name>Anonymous</name><uri>https://sourceforge.net/u/userid-None/</uri></author><id>https://sourceforge.net2eeaa3d28fdfcb9b0c3cad43a5822f6e3d069c39</id><summary type="html">&lt;div class="markdown_content"&gt;&lt;p&gt;Some of the metrics (for example BlockDistance) fail if one of the strings has a unicode 160 (non-blocking space) in.&lt;/p&gt;&lt;/div&gt;</summary></entry><entry><title>Euclidean Distance always returns 0.0</title><link href="https://sourceforge.net/p/simmetrics/bugs/2/" rel="alternate"/><published>2007-06-28T21:18:19Z</published><updated>2007-06-28T21:18:19Z</updated><author><name>Anonymous</name><uri>https://sourceforge.net/u/userid-None/</uri></author><id>https://sourceforge.nete9e7fa17c129507913cd5b54a868eee2b2186701</id><summary type="html">&lt;div class="markdown_content"&gt;&lt;p&gt;I recently installed simmetrics v 1.6 and i'm having a strange result from euclidean distance function.&lt;/p&gt;
&lt;p&gt;I just downloaded the jar and the source, and start playing with the SimpleExample.java file, when trying with the Euclidean distance, it always return 0.0, unless inputing two equal strings.&lt;/p&gt;
&lt;p&gt;abb aba return 0.0&lt;/p&gt;
&lt;p&gt;abc abd return 0.0&lt;/p&gt;
&lt;p&gt;abc abc return 1.0&lt;/p&gt;
&lt;p&gt;I tried with a lot of Strings of different sizes, and had the same result.&lt;/p&gt;
&lt;p&gt;Luis Ibáñez&lt;br /&gt;
ldibanyez@gmail.com&lt;/p&gt;&lt;/div&gt;</summary></entry><entry><title>Jaro impemetation </title><link href="https://sourceforge.net/p/simmetrics/bugs/1/" rel="alternate"/><published>2007-01-02T16:18:14Z</published><updated>2007-01-02T16:18:14Z</updated><author><name>Anonymous</name><uri>https://sourceforge.net/u/userid-None/</uri></author><id>https://sourceforge.net86b803325080a992d0b66f098c5c0c3222237957</id><summary type="html">&lt;div class="markdown_content"&gt;&lt;p&gt;I’ve found two thing wrong in the implementation of the jaro algorithm and since i am not familiar with the cvs i thought i should post‘em here.&lt;/p&gt;
&lt;p&gt;1) In the computation of distance the line should be &lt;br /&gt;
this.Distance = Math.Min(string1.Length, string2.Length) / 2 + Math.Min(string1.Length, string2.Length) % 2; in order to have a proper rounding&lt;/p&gt;
&lt;p&gt;2) and to avoid the left vs right distance difference that shows up sometimes  we have to edit the following line:&lt;br /&gt;
//compare char with range of characters to either side&lt;br /&gt;
for (int j = Math.Max (0, i - distance); !foundIt &amp;amp;&amp;amp; j &amp;lt;= Math.Min(i + distance, string2.Length - 1 ); j++)&lt;/p&gt;
&lt;p&gt;Keep up the good work!&lt;/p&gt;&lt;/div&gt;</summary></entry></feed>