Recent changes to feature-requests

#7 a way to get complete path

Sebastian Pipping — Sun, 04 Oct 2015 21:34:28 -0000

Ticket moved from /p/uriparser/bugs/26/

#6 URL segments and UTF8 support for REST API

Sebastian Pipping — Sat, 07 Feb 2015 22:06:26 -0000

Hi!

For UTF-8 maybe check feature request #5 (http://sourceforge.net/p/uriparser/feature-requests/5/).

About a helper accessing a path segment by index: Limited time and too low priority to me right now, to be honest.

About the simple example of a test I'm not sure yet what exactly you re asking for. If there are open questions about usage, I can offer support via voice chat.

Best, Sebastian

URL segments and UTF8 support for REST API

Mircho — Sat, 07 Feb 2015 18:54:44 -0000

First of all - thank you for all your efforts for creating uriparser.
I am trying to use it in an application - server based on cesanta mongoose. I want to create a meaningful REST API. For this I need to handle urls of the type
http://server/api/mpd/playlists/интернетрадио
There are two things that would make it easier to implement the REST API:
Having a facility to access path segment by index (just a convinience)
Having a way to process the UTF8 parts of the url without going in and out of multibyte, if possible.
Even a simple example of a test in the test suite would be enough.
Thanks

#5 Add support for UTF8

Sebastian Pipping — Fri, 23 Jan 2015 20:18:30 -0000

If you are speaking of the length of the verbatim content in field UriQueryListStructA.value, that's stored implicitly (as mentioned before).

If you are speaking of the length of content in field UriQueryListStructA.value after decoding to UTF-8, that would need internal UTF-8 decoding from uriparser, knowledge of the encoding in there etc.

Also, please note that adding fields to structures breaks ABI compatibility with prior releases so that's something library authors need to think twice about.

If you aim at storing length to known space requirements up front, what might help is a (safe) heuristic:

A character in UTF-8 may take 1 to 4 bytes
A character in UTF-16 may take 2 to 4 bytes (see https://en.wikipedia.org/wiki/UTF-16#Examples for four-byte examples)
If the input was four-byte UTF-8 characters only, UTF-16 output would take x1/2 to x1 space in bytes, or "strlen(...) / 2 + 1" wchar_t elements at worst.
If the input was all single-byte UTF-8 characters, UTF-16 output would take x2 to 4x (worsened on purpose) space in bytes or "strlen(...) * 2 + 1" wchar_t elements at worst.
So "strlen(...) * 2 + 1" makes a safe worst case wchar_t character space calculation for later conversion to UTF-16.

Is that what you are looking for?

#5 Add support for UTF8

Sebastian Pipping — Fri, 23 Jan 2015 19:44:08 -0000

The only null byte in a UTF-8 string possible is an actual null character.
Please check the table at https://en.wikipedia.org/wiki/UTF-8#Description .
So UTF-8 can contain null bytes to the very same degree as ASCII.

Ok. But to exclude any potential error during converting or operating with, It will be very useful if I have a size of that bufer that I will convert to UTF-16 or operate with that bufer treated as UTF-8 String. It's simple to add, isn't it? You parse query and adding size is not very difficult but will be very useful.

#5 Add support for UTF8

Sebastian Pipping — Thu, 22 Jan 2015 13:47:25 -0000

Hello again,

UTF-8 String can contain zeroes! That's why the field "size" is needed.

The only null byte in a UTF-8 string possible is an actual null character.
Please check the table at https://en.wikipedia.org/wiki/UTF-8#Description .
So UTF-8 can contain null bytes to the very same degree as ASCII.

More importantly, the string in field "value" is not a /full/ UTF-8 string but uses single byte characters shared with ASCII, only. UTF-8 is what you have after the conversion in another buffer.

So, now you haven't support for UTF8, at least for Windows.

It's the same for Linux.

And I must manually convert from UTF8-bytes to UTF-16 as I did.

One way or another, an additional call to a converter function is needed. uriparser could ship with a UTF-8 to UTF-16 function, but I do not consider that to be uriparser's job. There are other libraries to do that, that you can easily use together with uriparser with more or less the same level of convenience.

I'm happy to have a quick Skype/mumble/Phone/Jitsi about it some time, if you feel that could help. In that case, contact me offlist about a time and the medium of choice, please.

Best, Sebastian

#5 Add support for UTF8

rasjv — Thu, 22 Jan 2015 08:27:51 -0000

Your options are...

So, now you haven't support for UTF8, at least Windows. And I must manually convert to UTF8 as I did.

About adding length to UriQueryList: member "value" is a zero terminated string so it is >carrying its length around implicitly.

"member "value" is a zero terminated string" will not work for UTF-8 String.
UTF-8 String can contain zeroes! That's why the field "size" is needed.

#5 Add support for UTF8

Sebastian Pipping — Tue, 20 Jan 2015 20:45:46 -0000

My understanding is that

you have a string in a wchar_t array
with a percent encoded UTF-8 string.

Your options are:

a) Convert the string into a char array picking every second byte. If the URI is valid that's a lossless operation. You the parse the URI using uriParseUriA, run uriDissectQueryMallocExA to dissect the query and run uriUnescapeInPlaceExA on the query parts. That should give valid UTF-8 if it was initially.

b) Keep the string in the wchar_t array, use uriParseUriW, then use uriDissectQueryMallocExW, copy the query parts into a char array picking every second byte (again lossless), run uriUnescapeInPlaceExA on those, again valid UTF-8 if it was initially.

About adding length to UriQueryList: member "value" is a zero terminated string so it is carrying its length around implicitly.

Best, Sebastian

#5 Add support for UTF8

rasjv — Tue, 20 Jan 2015 07:28:24 -0000

Hello, Sebastian.

I use uriParseUriW. The whole code is:

UriParserStateW state={0};
UriUriW uri={0};

state.uri = &uri;
if (uriParseUriW(&state, L"https://www.google.com/search?q=%D1%80%D0%B0%D0%B7%D0%B1%D0%BE%D1%80+URL+%D0%BD%D0%B0+%D0%BF%D0%B0%D1%80%D0%B0%D0%BC%D0%B5%D1%82%D1%80%D1%8B+C%2B%2B&ie=utf-8&oe=utf-8#q=parse+URL+C%2B%2B\x0") != URI_SUCCESS)
{
/ Failure /
uriFreeUriMembersW(&uri);
}
//success
//do something with uri

UriQueryListW * queryList=0;
int itemCount;
if (uriDissectQueryMallocW(&queryList, &itemCount, uri.query.first,
uri.query.afterLast) != URI_SUCCESS)
{
/ Failure /

}
//success
//do something with queryList
const wchar_t *query1;
query1=queryList->value;

uriFreeQueryListW(queryList);
uriFreeUriMembersW(&uri);

UTF-8 characters are not double-byte characters, they can be from 1 to 6 bytes.
UTF-16(wchar_t) yes, are double-byte characters in most cases(although can be very seldom special cases there UTF-16 character is more than double-byte, but this is an offtopic).

So, I'll show you screenshot and you understand more clearly.
I have a query in UTF-8 but escaped in URL:
%D1%80%D0%B0%D0%B7%D0%B1%D0%BE%D1%80+URL+%D0%BD%D0%B0+%D0%BF%D0%B0%D1%80%D0%B0%D0%BC%D0%B5%D1%82%D1%80%D1%8B+C%2B%2B
This is a UTF-8 string not UTF-16! It means: разбор URL на параметры C++

If you treat it as UTF-16 you will get what you can see on the screenshot highlighted in red:

So, I need manually convert query1 bytes to UTF-8 string with this code(for Windows):

wchar_t query1_utf16[256];
char query1_corrected[256];
int len,rez,i;
len=wcslen(query1);
for (i=0;i<len;i++)
query1_corrected[i]=((char )((char )query1+2i));
rez=MultiByteToWideChar(CP_UTF8,0,query1_corrected,len,query1_utf16,256);
query1_utf16[rez]=0;

And we see exactly what we must see:

If I will use your char-functions(A-ending) then the same except I get multibyte characters in "query1" and must manually convert this string type of char to UTF-16 to work with it in Windows. The "query1" string must be treated as UTF-8 string if you say: "UTF-8 is supported already". And it must be converted to UTF-16 in Windows OS because this the default Unicode format for this OS.

It's also will be very useful if you give a size in bytes of such UTF-8 string (it can be calculate during parsing) and let do not make any additional calculations. So, I think a new property must be added to the "UriQueryListA struct" to the existing: key,value,next named len:

int size;

#5 Add support for UTF8

Sebastian Pipping — Mon, 19 Jan 2015 19:58:11 -0000

Hello rasjv,

are you using uriParseUriA/char or uriParseUriW/wchar_t? All that uriparser knows about encoding is single-byte or double-byte characters. If I'm not mistaken, all characters I see in the URI above have the same single-byte encoding in both ASCII and UTF-8. In that sense, UTF-8 is supported already. Please help me understand what you are asking for.

Best, Sebastian