<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Recent posts to Discussion</title><link>https://sourceforge.net/p/web-harvest/discussion/</link><description>Recent posts to Discussion</description><atom:link href="https://sourceforge.net/p/web-harvest/discussion/feed.rss" rel="self"/><language>en</language><lastBuildDate>Wed, 13 Sep 2023 09:02:20 -0000</lastBuildDate><atom:link href="https://sourceforge.net/p/web-harvest/discussion/feed.rss" rel="self" type="application/rss+xml"/><item><title>User manual page is not available anymore </title><link>https://sourceforge.net/p/web-harvest/discussion/694022/thread/25cb36021f/?limit=25#72ee</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;user manual page is not availalbe anymore &lt;a href="https://web-harvest.sourceforge.net/usage.php" rel="nofollow"&gt;https://web-harvest.sourceforge.net/usage.php&lt;/a&gt; &lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Amalia Criznic</dc:creator><pubDate>Wed, 13 Sep 2023 09:02:20 -0000</pubDate><guid>https://sourceforge.net614b99f613331b37f6160b1e662053b5aeabde28</guid></item><item><title>Saving list in a var-def</title><link>https://sourceforge.net/p/web-harvest/discussion/591299/thread/ab05f703a0/?limit=25#cf10</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;I have the following code to retrieve data from a table ina webpage:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="cp"&gt;&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;config&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;var-def&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"webpage"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;html-to-xml&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;http&lt;/span&gt; &lt;span class="na"&gt;url=&lt;/span&gt;&lt;span class="s"&gt;"http://somepage.www/tablepage"&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;/html-to-xml&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/var-def&amp;gt;&lt;/span&gt; 
    &lt;span class="nt"&gt;&amp;lt;loop&lt;/span&gt; &lt;span class="na"&gt;item=&lt;/span&gt;&lt;span class="s"&gt;"currPro"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;list&amp;gt;&lt;/span&gt;
           &lt;span class="nt"&gt;&amp;lt;xpath&lt;/span&gt; &lt;span class="na"&gt;expression=&lt;/span&gt;&lt;span class="s"&gt;"//tr/td"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
             &lt;span class="nt"&gt;&amp;lt;var&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"webpage"&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
           &lt;span class="nt"&gt;&amp;lt;/xpath&amp;gt;&lt;/span&gt;     
        &lt;span class="nt"&gt;&amp;lt;/list&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;body&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;var&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"currPro"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/var&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;/body&amp;gt;&lt;/span&gt;

    &lt;span class="nt"&gt;&amp;lt;/loop&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;/config&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;I need to know how the loop logic should be inside  to be able to save all the elements of the list in a variable var-def&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Adin Israel López Muñoz</dc:creator><pubDate>Mon, 08 Mar 2021 21:57:10 -0000</pubDate><guid>https://sourceforge.netc1859ad6fdd5e21aff7cc9aee6404b100505b41d</guid></item><item><title>I need to save all the elements of a list in a variable var-def</title><link>https://sourceforge.net/p/web-harvest/discussion/694022/thread/1d9776c9b2/?limit=25#916d</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;I have the following code to retrieve data from a table ina webpage:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="cp"&gt;&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;config&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;var-def&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"webpage"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;html-to-xml&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;http&lt;/span&gt; &lt;span class="na"&gt;url=&lt;/span&gt;&lt;span class="s"&gt;"http://somepage.www/tablepage"&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;/html-to-xml&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/var-def&amp;gt;&lt;/span&gt; 
    &lt;span class="nt"&gt;&amp;lt;loop&lt;/span&gt; &lt;span class="na"&gt;item=&lt;/span&gt;&lt;span class="s"&gt;"currPro"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;list&amp;gt;&lt;/span&gt;
           &lt;span class="nt"&gt;&amp;lt;xpath&lt;/span&gt; &lt;span class="na"&gt;expression=&lt;/span&gt;&lt;span class="s"&gt;"//tr/td"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
             &lt;span class="nt"&gt;&amp;lt;var&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"webpage"&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
           &lt;span class="nt"&gt;&amp;lt;/xpath&amp;gt;&lt;/span&gt;     
        &lt;span class="nt"&gt;&amp;lt;/list&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;body&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;var&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"currPro"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/var&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;/body&amp;gt;&lt;/span&gt;

    &lt;span class="nt"&gt;&amp;lt;/loop&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;/config&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;I need to know how the loop logic should be inside  to be able to save all the elements of the list in the same variable var-def&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Adin Israel López Muñoz</dc:creator><pubDate>Mon, 08 Mar 2021 21:53:19 -0000</pubDate><guid>https://sourceforge.net2ed2d44de5de56f27da663057fb03cac4541270e</guid></item><item><title>&lt;html-to-xml&gt; gives java heap space</title><link>https://sourceforge.net/p/web-harvest/discussion/694022/thread/374c0e83e8/?limit=25#7ec2</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;Hello Guys, &lt;/p&gt;
&lt;p&gt;I have a simple request to a webService which returns an XML ( about 100mb) which is not that much from my point of view , and when try to clean the content by  &amp;lt;html-to-xml&amp;gt; in order to do the xslt it stops and gives a java heap space error &lt;br/&gt;
 I even set jvm -Xms750m -Xmx4048m but still the same &amp;lt;/html-to-xml&amp;gt;&lt;/p&gt;
&lt;p&gt;Do you have a solution for this ? &lt;/p&gt;
&lt;p&gt;BR,&lt;br/&gt;
Amalia&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Amalia Criznic</dc:creator><pubDate>Sun, 10 Mar 2019 20:17:26 -0000</pubDate><guid>https://sourceforge.netaaf4d99225396414ee6ca7a4b00deb61f57a7806</guid></item><item><title>Cannot web scrape this url </title><link>https://sourceforge.net/p/web-harvest/discussion/591299/thread/10726e3a6d/?limit=25#3f96</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;Hi all&lt;br/&gt;
I need to web crawling the page URL:&lt;br/&gt;
&lt;a href="https://stomp.straitstimes.com/singapore-seen/moe-to-take-action-against-lewd-instagram-account-targeting-junior-college-girls" rel="nofollow"&gt;https://stomp.straitstimes.com/singapore-seen/moe-to-take-action-against-lewd-instagram-account-targeting-junior-college-girls&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;We have error in line of   scraper.execute();&lt;/p&gt;
&lt;p&gt;private Map&amp;lt;string, object=""&amp;gt; crawlArticleData(String sourceUrl, Resource configFile) throws IOException {&amp;lt;/string,&amp;gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;    &lt;span class="nt"&gt;InputSource&lt;/span&gt; &lt;span class="nt"&gt;configIn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nt"&gt;new&lt;/span&gt; &lt;span class="nt"&gt;InputSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;configFile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;getInputStream&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="nt"&gt;Execute&lt;/span&gt; &lt;span class="nt"&gt;Web&lt;/span&gt; &lt;span class="nt"&gt;Harvest&lt;/span&gt; &lt;span class="nt"&gt;process&lt;/span&gt; &lt;span class="nt"&gt;to&lt;/span&gt; &lt;span class="nt"&gt;extract&lt;/span&gt; &lt;span class="nt"&gt;content&lt;/span&gt;
    &lt;span class="nt"&gt;ScraperConfiguration&lt;/span&gt; &lt;span class="nt"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nt"&gt;new&lt;/span&gt; &lt;span class="nt"&gt;ScraperConfiguration&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;configIn&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="nt"&gt;Scraper&lt;/span&gt; &lt;span class="nt"&gt;scraper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nt"&gt;null&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="nt"&gt;String&lt;/span&gt; &lt;span class="nt"&gt;articleContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="nt"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;imageUrls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nt"&gt;null&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="nt"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

        &lt;span class="err"&gt;scraper&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="err"&gt;new&lt;/span&gt; &lt;span class="err"&gt;Scraper(config,&lt;/span&gt; &lt;span class="err"&gt;"")&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="err"&gt;//Config&lt;/span&gt; &lt;span class="err"&gt;timeout&lt;/span&gt; &lt;span class="err"&gt;for&lt;/span&gt; &lt;span class="err"&gt;httpClient&lt;/span&gt; &lt;span class="err"&gt;used&lt;/span&gt; &lt;span class="err"&gt;in&lt;/span&gt; &lt;span class="err"&gt;Scraper&lt;/span&gt;
        &lt;span class="err"&gt;org.apache.commons.httpclient.HttpClient&lt;/span&gt; &lt;span class="err"&gt;scraperHttpClient&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="err"&gt;scraper.getHttpClientManager().getHttpClient()&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="err"&gt;scraperHttpClient.getParams().setParameter("http.socket.timeout",&lt;/span&gt; &lt;span class="err"&gt;new&lt;/span&gt; &lt;span class="err"&gt;Integer(SO_TIMEOUT))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="err"&gt;scraperHttpClient.getParams().setParameter("http.connection.timeout",&lt;/span&gt; &lt;span class="err"&gt;new&lt;/span&gt; &lt;span class="err"&gt;Integer(CONNECTION_TIMEOUT))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="err"&gt;scraper.addVariableToContext("articleUrl",&lt;/span&gt; &lt;span class="err"&gt;sourceUrl)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="err"&gt;scraper.execute()&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="err"&gt;//get&lt;/span&gt; &lt;span class="err"&gt;article&lt;/span&gt; &lt;span class="err"&gt;content&lt;/span&gt;
        &lt;span class="err"&gt;Variable&lt;/span&gt; &lt;span class="err"&gt;articleContentVariable&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="err"&gt;scraper.getContext().getVar("articleContent")&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="err"&gt;if&lt;/span&gt; &lt;span class="err"&gt;(articleContentVariable&lt;/span&gt; &lt;span class="err"&gt;==&lt;/span&gt; &lt;span class="err"&gt;null&lt;/span&gt; &lt;span class="err"&gt;||&lt;/span&gt; &lt;span class="err"&gt;articleContentVariable.toBinary()&lt;/span&gt; &lt;span class="err"&gt;==&lt;/span&gt; &lt;span class="err"&gt;null)&lt;/span&gt; &lt;span class="err"&gt;{&lt;/span&gt;
            &lt;span class="err"&gt;logger.debug("Fail&lt;/span&gt; &lt;span class="err"&gt;to&lt;/span&gt; &lt;span class="err"&gt;extract&lt;/span&gt; &lt;span class="err"&gt;body&lt;/span&gt; &lt;span class="err"&gt;content&lt;/span&gt; &lt;span class="err"&gt;for&lt;/span&gt; &lt;span class="err"&gt;"&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt; &lt;span class="err"&gt;sourceUrl)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="err"&gt;return&lt;/span&gt; &lt;span class="err"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="nt"&gt;articleContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nt"&gt;new&lt;/span&gt; &lt;span class="nt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;articleContentVariable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;toBinary&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="s2"&gt;"UTF-8"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="nt"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nt"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;apache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;commons&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;StringUtils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;isBlank&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt; &lt;span class="nt"&gt;articleContent&lt;/span&gt; &lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nt"&gt;articleContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;length&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;2000000&lt;/span&gt; &lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="err"&gt;logger.info(&lt;/span&gt; &lt;span class="err"&gt;"Reject&lt;/span&gt; &lt;span class="err"&gt;article&lt;/span&gt; &lt;span class="err"&gt;because&lt;/span&gt; &lt;span class="err"&gt;content&lt;/span&gt; &lt;span class="err"&gt;is&lt;/span&gt; &lt;span class="err"&gt;too&lt;/span&gt; &lt;span class="err"&gt;large.&lt;/span&gt; &lt;span class="err"&gt;Article&lt;/span&gt; &lt;span class="k"&gt;size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;" + articleContent.length() );&lt;/span&gt;
&lt;span class="s2"&gt;            throw new IOException();&lt;/span&gt;
&lt;span class="s2"&gt;        }&lt;/span&gt;

&lt;span class="s2"&gt;        Variable imageUrlVariable = scraper.getContext().getVar("&lt;/span&gt;&lt;span class="n"&gt;imageUrl&lt;/span&gt;&lt;span class="s2"&gt;");&lt;/span&gt;

&lt;span class="s2"&gt;        imageUrls = new ArrayList&amp;lt;String&amp;gt;();&lt;/span&gt;

&lt;span class="s2"&gt;        if (imageUrlVariable != null) {&lt;/span&gt;
&lt;span class="s2"&gt;            String imageUrlsString = imageUrlVariable.toString();&lt;/span&gt;
&lt;span class="s2"&gt;            if (!StringUtils.isEmpty(imageUrlsString)) {&lt;/span&gt;
&lt;span class="s2"&gt;                imageUrls.addAll(Arrays.asList(imageUrlsString.split("&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="s2"&gt;")));&lt;/span&gt;
&lt;span class="s2"&gt;            }&lt;/span&gt;
&lt;span class="s2"&gt;        }&lt;/span&gt;
&lt;span class="s2"&gt;    }catch (Exception  ex ){&lt;/span&gt;
&lt;span class="s2"&gt;          logger.error( "&lt;/span&gt;&lt;span class="n"&gt;Crawler&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="n"&gt;sourceUrl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"+sourceUrl, ex);&lt;/span&gt;
&lt;span class="s2"&gt;    }finally{&lt;/span&gt;
&lt;span class="s2"&gt;        //clean up&lt;/span&gt;
&lt;span class="s2"&gt;        scraper.dispose();&lt;/span&gt;
&lt;span class="s2"&gt;        configIn.getByteStream().close();&lt;/span&gt;
&lt;span class="s2"&gt;    }&lt;/span&gt;

&lt;span class="s2"&gt;    Map&amp;lt;String, Object&amp;gt; articleData = new HashMap&amp;lt;String, Object&amp;gt;();&lt;/span&gt;
&lt;span class="s2"&gt;    articleData.put("&lt;/span&gt;&lt;span class="n"&gt;articleContent&lt;/span&gt;&lt;span class="s2"&gt;", articleContent);&lt;/span&gt;
&lt;span class="s2"&gt;    articleData.put("&lt;/span&gt;&lt;span class="n"&gt;imageUrls&lt;/span&gt;&lt;span class="s2"&gt;", imageUrls);&lt;/span&gt;

&lt;span class="s2"&gt;    if ( !org.apache.commons.lang.StringUtils.isBlank( articleContent ) &amp;amp;&amp;amp; articleContent.length() &amp;gt; 0 )&lt;/span&gt;
&lt;span class="s2"&gt;        logger.info( "&lt;/span&gt;&lt;span class="n"&gt;Crawl&lt;/span&gt; &lt;span class="n"&gt;completed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Article&lt;/span&gt; &lt;span class="k"&gt;content&lt;/span&gt; &lt;span class="k"&gt;size&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;" + articleContent.length() );&lt;/span&gt;
&lt;span class="s2"&gt;    else&lt;/span&gt;
&lt;span class="s2"&gt;        logger.info( "&lt;/span&gt;&lt;span class="n"&gt;Crawl&lt;/span&gt; &lt;span class="n"&gt;completed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="err"&gt;"&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="err"&gt;return&lt;/span&gt; &lt;span class="err"&gt;articleData&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">WEETAT</dc:creator><pubDate>Mon, 14 Jan 2019 09:14:07 -0000</pubDate><guid>https://sourceforge.net818e13c07a85893654a316c76be327bb9d2efaac</guid></item><item><title>Help parsing a javascript/JSON variable</title><link>https://sourceforge.net/p/web-harvest/discussion/591299/thread/59fbcda9b4/?limit=25#a1ba</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;Hi, &lt;/p&gt;
&lt;p&gt;I'm new to web-harvest, any help you can provide would be much appreciated.&lt;/p&gt;
&lt;p&gt;I have a page that is outputting javascript containing a JS variable:   domainList. This variable value is a JSON script, so I assume I can use JSON to XML to then xpath the items/name values.  I'm looking to build a list of the name values.&lt;/p&gt;
&lt;p&gt;here is the actual javascript text I need to parse.  thanks in advance.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&amp;lt;script type="text/javascript"&amp;gt;
             if (self != top) {
                 top.location = self.location;
             }
             require(["dp/layers/loginlayer","dojo/domReady!"],function(){
                require(["dojo/parser", "dp/Login", "dojo/ready"], function(parser, Login, ready){
                  ready(function(){

                    parser.parse();

                    var domainList = {
                        "identifier": "id",
                        "items": [
                        { "name": "default", "id": "default" }, { "name": "dp_Common", "id": "dp_Common" }, { "name": "hr_etime", "id": "hr_etime" }, { "name": "hr_us_ew2", "id": "hr_us_ew2" }, { "name": "isc_tools_mfg_omcs", "id": "isc_tools_mfg_omcs" }, { "name": "mq_Adapter", "id": "mq_Adapter" }, { "name": "partnerworld_websvc_bpw", "id": "partnerworld_websvc_bpw" }, { "name": "sc_prod", "id": "sc_prod" }
                        ]
                    };

                    login = new Login();
                    login.startup(domainList,"XI52",
                        "b03eixmlapp005",
                        "",
                        "",
                        "XI52.7.6.0.9");
                  });
                });
             });
            &amp;lt;/script&amp;gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Rick Goncalves</dc:creator><pubDate>Mon, 01 Oct 2018 22:37:43 -0000</pubDate><guid>https://sourceforge.net4a430833e8bb70f42f0b6eafa35f0da5588cd4f4</guid></item><item><title>Invoke WebHarvest function from Java function</title><link>https://sourceforge.net/p/web-harvest/discussion/694022/thread/169dd48e/?limit=25#c16f</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;I have created web harvest function. I am able to invoke the function using web harvest code. My challange is, need to invoke that web harvest function from java function. Is it possible? For example considered this&lt;br/&gt;
&lt;strong&gt;We harvest method&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt; &lt;span class="nt"&gt;&amp;lt;function&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"testing"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
 &lt;span class="nt"&gt;&amp;lt;script&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;&amp;lt;![CDATA[&lt;/span&gt;
&lt;span class="cp"&gt;   // code block&lt;/span&gt;
&lt;span class="cp"&gt;   ]]&amp;gt;&lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Java method&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span class="nt"&gt;&amp;lt;script&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;&amp;lt;![CDATA[&lt;/span&gt;
&lt;span class="cp"&gt;  function jMethod(){&lt;/span&gt;
&lt;span class="cp"&gt;    testing(); // need to call that web harvest method here&lt;/span&gt;
&lt;span class="cp"&gt;  }&lt;/span&gt;
&lt;span class="cp"&gt;  ]]&amp;gt;&lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">sunilprabakar</dc:creator><pubDate>Tue, 07 Feb 2017 11:58:08 -0000</pubDate><guid>https://sourceforge.net2d3c27868c0c0ac75bd1bbe7e8c719cbba638750</guid></item><item><title>Orchid Tor Proxy</title><link>https://sourceforge.net/p/web-harvest/discussion/694022/thread/f0c39a3e/?limit=25#2eff</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;TWIMC: &lt;/p&gt;
&lt;p&gt;I'm trying to use WebHarvest with Orchid (https://subgraph.com/orchid/index.en.html), a Tor proxy.&lt;/p&gt;
&lt;p&gt;I started Orchid on my local machine, and it's supposed to create a proxy at port 9150.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;D:\Programs\webharvest&amp;gt;java -jar orchid-1.0.0.jar
Jan 14, 2017 11:58:52 AM com.subgraph.orchid.TorClient start
INFO: Starting Orchid (version: 1.0.0.8c6b26d)
Jan 14, 2017 11:58:52 AM com.subgraph.orchid.directory.DirectoryImpl loadFromStore
INFO: Loading cached network information from disk
Jan 14, 2017 11:58:52 AM com.subgraph.orchid.directory.DirectoryImpl loadFromStore
INFO: Loading certificates
Jan 14, 2017 11:58:52 AM com.subgraph.orchid.circuits.CircuitCreationTask checkCircuitsForCreation
INFO: Cannot build circuits because we don't have enough directory information
Jan 14, 2017 11:58:52 AM com.subgraph.orchid.directory.DirectoryImpl loadFromStore
INFO: Loading consensus
Jan 14, 2017 11:58:52 AM com.subgraph.orchid.directory.consensus.ConsensusDocumentImpl verifySingleAuthority
WARNING: Consensus signed by unrecognized directory authority: 0232af901c31a04ee9848595af9bb7620d4c5b2e
Jan 14, 2017 11:58:52 AM com.subgraph.orchid.directory.consensus.ConsensusDocumentImpl verifySingleAuthority
WARNING: Consensus signed by unrecognized directory authority: 23d15d965bc35114467363c165c4f724b64b4f66
Jan 14, 2017 11:58:52 AM com.subgraph.orchid.directory.DirectoryImpl loadFromStore
INFO: Loading microdescriptor cache
Jan 14, 2017 11:58:52 AM com.subgraph.orchid.directory.DirectoryImpl loadFromStore
INFO: loading state file
&amp;gt;&amp;gt;&amp;gt; [ 80% ]: Connecting to the Tor network
&amp;gt;&amp;gt;&amp;gt; [ 85% ]: Finished Handshake with first hop
&amp;gt;&amp;gt;&amp;gt; [ 90% ]: Establishing a Tor circuit
&amp;gt;&amp;gt;&amp;gt; [ 100% ]: Done
Tor is ready to go!
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;I tried adding the proxy settings through WebHarvest &amp;gt; Settings and at the command line with parameters proxyhost=127.0.0.1 proxyport=9050. When running the application, the connection doesn't seem to go through the proxy. &lt;/p&gt;
&lt;p&gt;Any help would be greatly appreciated.&lt;/p&gt;
&lt;p&gt;Regards,&lt;/p&gt;
&lt;p&gt;M.&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">parker20121</dc:creator><pubDate>Sat, 14 Jan 2017 17:15:45 -0000</pubDate><guid>https://sourceforge.net14d1680000920cce76c029106e1351875956a7b9</guid></item><item><title>Cannot include path=functions.xml</title><link>https://sourceforge.net/p/web-harvest/discussion/591298/thread/0bd49a70/</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;function.xml&lt;br/&gt;
Please download and extract this file&lt;br/&gt;
&lt;a href="http://web-harvest.sourceforge.net/download/webharvest2b1-project.zip"&gt;http://web-harvest.sourceforge.net/download/webharvest2b1-project.zip&lt;/a&gt;&lt;br/&gt;
or using this code&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&amp;lt;config&amp;gt;&lt;br/&gt;
    &amp;lt;!-- &lt;br/&gt;
        Download multi-page list of items.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;    @param pageUrl       - URL of starting page
    @param itemXPath     - XPath expression to obtain single item in the list
    @param nextXPath     - XPath expression to URL for the next page
    @param maxloops      - maximum number of pages downloaded

    @return list of all downloaded items
 --&amp;gt;
&lt;span class="nt"&gt;&amp;lt;function&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"download-multipage-list"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;return&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;while&lt;/span&gt; &lt;span class="na"&gt;condition=&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="cp"&gt;${&lt;/span&gt;&lt;span class="n"&gt;pageUrl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="cp"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt; &lt;span class="na"&gt;maxloops=&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="cp"&gt;${&lt;/span&gt;&lt;span class="n"&gt;maxloops&lt;/span&gt;&lt;span class="cp"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt; &lt;span class="na"&gt;index=&lt;/span&gt;&lt;span class="s"&gt;"i"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;empty&amp;gt;&lt;/span&gt;
                &lt;span class="nt"&gt;&amp;lt;var&lt;/span&gt;&lt;span class="err"&gt;-def&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"content"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
                    &lt;span class="nt"&gt;&amp;lt;html&lt;/span&gt;&lt;span class="err"&gt;-to-xml&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
                        &lt;span class="nt"&gt;&amp;lt;http&lt;/span&gt; &lt;span class="na"&gt;url=&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="cp"&gt;${&lt;/span&gt;&lt;span class="n"&gt;pageUrl&lt;/span&gt;&lt;span class="cp"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"/&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
                    &lt;span class="nt"&gt;&amp;lt;/html-to-xml&amp;gt;&lt;/span&gt;
                &lt;span class="nt"&gt;&amp;lt;/var-def&amp;gt;&lt;/span&gt;

                &lt;span class="nt"&gt;&amp;lt;var&lt;/span&gt;&lt;span class="err"&gt;-def&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"nextLinkUrl"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
                    &lt;span class="nt"&gt;&amp;lt;xpath&lt;/span&gt; &lt;span class="na"&gt;expression=&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="cp"&gt;${&lt;/span&gt;&lt;span class="n"&gt;nextXPath&lt;/span&gt;&lt;span class="cp"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
                        &lt;span class="nt"&gt;&amp;lt;var&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"content"/&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
                    &lt;span class="nt"&gt;&amp;lt;/xpath&amp;gt;&lt;/span&gt;
                &lt;span class="nt"&gt;&amp;lt;/var-def&amp;gt;&lt;/span&gt;

                &lt;span class="nt"&gt;&amp;lt;var&lt;/span&gt;&lt;span class="err"&gt;-def&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"pageUrl"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
                    &lt;span class="nt"&gt;&amp;lt;template&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;${&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fullUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pageUrl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;nextLinkUrl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="cp"&gt;}&lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/template&amp;gt;&lt;/span&gt;
                &lt;span class="nt"&gt;&amp;lt;/var-def&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;/empty&amp;gt;&lt;/span&gt;

            &lt;span class="nt"&gt;&amp;lt;xpath&lt;/span&gt; &lt;span class="na"&gt;expression=&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="cp"&gt;${&lt;/span&gt;&lt;span class="n"&gt;itemXPath&lt;/span&gt;&lt;span class="cp"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
                &lt;span class="nt"&gt;&amp;lt;var&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"content"/&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;/xpath&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;/while&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/return&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/function&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&amp;lt;/config&amp;gt;&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">jomy nn</dc:creator><pubDate>Fri, 15 Jul 2016 09:13:02 -0000</pubDate><guid>https://sourceforge.net5092f4db387f94c61ffc84f35390bb25051ef750</guid></item><item><title>Latest Build for WH</title><link>https://sourceforge.net/p/web-harvest/discussion/591298/thread/7d4d4625/</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;Hi Team,&lt;/p&gt;
&lt;p&gt;Was having a scenario where I need to login(by sending POST) on a particular website and download a file, but the site uses re-direction(uses 302 return code). &lt;/p&gt;
&lt;p&gt;When I tried using the WH 2.0 build from &lt;a href="http://web-harvest.sourceforge.net/download.php"&gt;http://web-harvest.sourceforge.net/download.php&lt;/a&gt;, it wan't working, its returning the same login page as a response.&lt;/p&gt;
&lt;p&gt;So, I tried checking out the Trunck code from &lt;a href="https://sourceforge.net/p/web-harvest/code/HEAD/tree/trunk/"&gt;https://sourceforge.net/p/web-harvest/code/HEAD/tree/trunk/&lt;/a&gt; and tried creating a local build and there it executed correctly.&lt;/p&gt;
&lt;p&gt;Just wanted to know, where can I find the latest build for the code on trunk? or Where are you planning to release the next build?&lt;/p&gt;
&lt;p&gt;Regards,&lt;br/&gt;
Malay&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Malay Shah</dc:creator><pubDate>Thu, 14 Jul 2016 08:14:47 -0000</pubDate><guid>https://sourceforge.netd3a56737ca35eb5e0caa2b48d90af322ca4bbdde</guid></item></channel></rss>