Thread: [PyIndexer] Thoughts on MySQL Implementation
Status: Pre-Alpha
Brought to you by:
cduncan
|
From: Casey D. <cas...@ya...> - 2001-12-16 17:00:54
|
A couple of observations: The index population could seems pretty inefficient to me in sqlindexer. You are essentially sending one sql INSERT per word when something is indexed. I think it would be worthwhile to test an implementation that uses a single LOAD DATA statement when the text is indexed. This would involve writing the rows to be inserted into textindex to a temporary file and then using LOAD DATA to batch load it to mySQL. See the following for docs on LOAD DATA: http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#LOAD_DATA As the search end, it seems to me that app side processing will be best for positional matches, such as for phrases. I'd imagine such processing should be eventually coded in C, but it should be acceptable in Python if done efficiently (like using Python arrays, IISets or some-such). You might want to check out this Guido-essay for some ideas here: http://www.python.org/doc/essays/list2str.html I agree that storing a document count for each word could help with optimizing since you could start with the smallest dataset first and prune it from there. Perhaps IISets could be used to get UNION/INTERSECT functionality efficiently if mySQL can't do it for you. As for partial indexes, they may help, especially reducing memory paging and improving cache usage. Here are the docs for that: http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#CREATE_INDEX It would probably be worthwhile to test times for a relatively large size (32 bytes?) vs a small one (4 or 8 bytes?). Sometimes less is more 8^). Its near impossible to tell which would be faster on a given architecture without real-world testing. BTW: Have you seen mySQL's built-in full-text indexing support? http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#Fulltext_Search Not sure whether it does everything we need, but it probably worth a look... -Casey __________________________________________________ Do You Yahoo!? Check out Yahoo! Shopping and Yahoo! Auctions for all of your unique holiday gifts! Buy at http://shopping.yahoo.com or bid at http://auctions.yahoo.com |
|
From: Marcus C. <ma...@wr...> - 2001-12-16 22:54:03
|
On Sun, 16 Dec 2001 at 09:00:53 -0800, Casey Duncan wrote: [ Data population and LOAD DATA ] This would definitely be a lot faster than individual INSERTs. The main reason for the speed-up is that in this case MySQL treats the whole thing as one transaction, so it defers flushing the key buffers until all the data is loaded. The same can be accomplished (although not quite as efficiently) by wrapping the whole lot of inserts inside a transaction block (if using BDB or InnoDB tables) or LOCK/UNLOCK TABLEs (if using non-TST's). Use BEGIN/COMMIT in the former case. [ App-side processing ] > As the search end, it seems to me that app side > processing will be best for positional matches, such > as for phrases. I'd imagine such processing should be > eventually coded in C, but it should be acceptable in > Python if done efficiently (like using Python arrays, > IISets or some-such). I'd tend to do more of the processing on the database side, as described, in which case use of C or Python becomes less of an issue. > http://www.python.org/doc/essays/list2str.html Very interesting and inspiring :-) > I agree that storing a document count for each word > could help with optimizing since you could start with > the smallest dataset first and prune it from there. > Perhaps IISets could be used to get UNION/INTERSECT > functionality efficiently if mySQL can't do it for > you. I reckon MySQL can do it pretty efficiently given a set of ids, but doing such processing as coalescing duplicates, etc., on the app side would likely be faster than letting the MySQL query parser and optimiser do it for you. That said, it would be only a matter of ms improvement; the main benefit of app-side processing is in eliminating joins. [ prefix indexes ] > Sometimes less is more 8^). :-) > impossible to tell which would be faster on a given > architecture without real-world testing. Indeed. Although I don't see a knob to change the size of the key block, though, so AFAIK it's set at 1024 bytes. Caching will likely have different effects on different platforms, though, and the setting of key_buffer_size will have an effect on MySQL's own caching. [ MySQL's FULLTEXT index ] > http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#Fulltext_Search AFAIK, it doesn't do phrase matching. Their relevance calculation looks very interesting, though. Cheers -- Marcus |
|
From: Chris W. <ch...@ni...> - 2001-12-17 12:59:30
|
Casey Duncan wrote: > > This would involve writing the rows to be inserted > into textindex to a temporary file and then using LOAD > DATA to batch load it to mySQL. See the following for > docs on LOAD DATA: > > http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#LOAD_DATA Well, my reluctance on this is that it needs a temporary file. Where do we put this temporary file? How do we know we're going to be allowed to write to the filesystem? That said, the only other option is to build a big INSERT INTO tbl_name VALUES (expression,...),(...),... ...and then we have to worry about max. sql length I guess? Anyone know what the maximum length of SQL you can shove down a single c.execute() is? > As the search end, it seems to me that app side > processing will be best for positional matches, such > as for phrases. Can you elaborate? > I agree that storing a document count for each word > could help with optimizing since you could start with > the smallest dataset first and prune it from there. Does this still hold true when you're OR'ign terms together? > Perhaps IISets could be used to get UNION/INTERSECT > functionality efficiently if mySQL can't do it for > you. I prefer to not require the BTrees module, as it's not part of the standard python library, but if needs must ;-) > As for partial indexes, they may help, especially > reducing memory paging and improving cache usage. Here > are the docs for that: > > http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#CREATE_INDEX Thanks :-) Once I get the initial implementation and scalability testing package finalized, mayeb you guys could try some tweakage? > BTW: Have you seen mySQL's built-in full-text indexing > support? > > http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#Fulltext_Search Yeah, sadly doesn't do phrase matching :-S We could use this and do the usual cheap hack that ZCatalog does, matching the phrase "x y z" simply gets turned into "x" AND "y" AND "z", but that won't be good enough for the specific application I need the indexer for :-S thanks for the comments :-) Chris |
|
From: Casey D. <c.d...@nl...> - 2001-12-17 14:05:13
|
On Monday 17 December 2001 07:57 am, Chris Withers allegedly wrote: [snip load data stuff] > Well, my reluctance on this is that it needs a temporary file. Where do we > put this temporary file? How do we know we're going to be allowed to write > to the filesystem? Yup, other than using mktemp or creating some sort or var directory in the install, dunno. That said, I don't think it's too much to ask for a program to have write access to the temp dir... 8^) I think you could potentially see a big speed improvement and even more so if the mySQL server is on another box... > That said, the only other option is to build a big > INSERT INTO tbl_name VALUES (expression,...),(...),... > > ...and then we have to worry about max. sql length I guess? > Anyone know what the maximum length of SQL you can shove down a single > c.execute() is? I dunno. Perhaps the C API would allow you to do a LOAD DATA from an in-memory data structure? Just a thought, we can't be the only ones trying to stuff the proverbial goose here 8^) > > As the search end, it seems to me that app side > > processing will be best for positional matches, such > > as for phrases. > > Can you elaborate? My thought was that you would treat the query just like A and B and C on the SQL side of things. But you would bring in word position information with the queries, so that you could on the application-side determine which queries actually statisfy the phrase match. I think there will be relatively few comparisons there for any meaningful search terms. I just think that trying to do that type of vertical comparison on the SQL-side will be a pain. Also, how are you planning to deal with stop-words in phrase searches? I notice you have included a previous word reference in the index. I'm assuming stop words are thrown out of both the word index and the search words, correct? > > I agree that storing a document count for each word > > could help with optimizing since you could start with > > the smallest dataset first and prune it from there. > > Does this still hold true when you're OR'ign terms together? For simplicity, I would just treat each part of the OR as a separate query and combine the results on the application side. I think ORs are going to be expensive no matter what you do, so you might as well keep is simple. So I guess that would be a no 8^) > I prefer to not require the BTrees module, as it's not part of the standard > python library, but if needs must ;-) I hear you. Just thinking out loud. > Thanks :-) Once I get the initial implementation and scalability testing > package finalized, mayeb you guys could try some tweakage? What are friends for? [snip full-text] > Yeah, sadly doesn't do phrase matching :-S 8^( > We could use this and do the usual cheap hack that ZCatalog does, matching > the phrase "x y z" simply gets turned into "x" AND "y" AND "z", but that > won't be good enough for the specific application I need the indexer for > :-S Bleah. Real phrase matching is a definite requirement for me too. I wonder if MySQL stores any positional info in its full-text index that you could get your hands on... I'd imagine it is using a table internally to store the index... /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.d...@nl... \---------------------------------------------------/ |
|
From: Marcus C. <ma...@wr...> - 2001-12-17 15:51:27
|
On Mon, 17 Dec 2001 at 09:08:06 -0500, Casey Duncan wrote: [ LOAD DATA and temporary files ] > On Monday 17 December 2001 07:57 am, Chris Withers allegedly wrote: > [snip load data stuff] > > Well, my reluctance on this is that it needs a temporary file. Where do we > > put this temporary file? How do we know we're going to be allowed to write > > to the filesystem? > > Yup, other than using mktemp or creating some sort or var directory in the > install, dunno. That said, I don't think it's too much to ask for a program > to have write access to the temp dir... 8^) I think you can use a FIFO with LOAD DATA LOCAL INFILE... Anyone got experience with this? Probably something for the mysql list ;-) [ snip separating app and db server ] A good idea particularly if the app is handling some of the grunt work :-) TCP/IP comms between app and server may be slower than communication using a unix socket, though. > > That said, the only other option is to build a big > > INSERT INTO tbl_name VALUES (expression,...),(...),... > > > > ...and then we have to worry about max. sql length I guess? > > Anyone know what the maximum length of SQL you can shove down a single > > c.execute() is? Whatever you set it to ;-) The buffer starts off at net_buffer_length, and grows to a maximum of max_allowed_packet (default 16MB, which should be enough ;-). See the manual entries for SHOW VARIABLES. [ app-side processing for positional matches ] > > Can you elaborate? > > My thought was that you would treat the query just like A and B and C on the > SQL side of things. But you would bring in word position information with the > queries, so that you could on the application-side determine which queries > actually statisfy the phrase match. I think there will be relatively few > comparisons there for any meaningful search terms. > > I just think that trying to do that type of vertical comparison on the > SQL-side will be a pain. IIRC a search for 'windows 2000' on the test data Chris has been using has ~ 10K rows for each of 'windows' and '2000'. Doing that processing app-side would be costly (first have to match doc ids, then compare each in O(n**2)), whereas using the previous word id in the textindex when you already know what the word must be, should be cheaper. > Also, how are you planning to deal with stop-words in phrase searches? I > notice you have included a previous word reference in the index. I'm assuming > stop words are thrown out of both the word index and the search words, > correct? The same splitter is applied to source docs and search string. > > > I agree that storing a document count for each word > > > could help with optimizing since you could start with > > > the smallest dataset first and prune it from there. > > > > Does this still hold true when you're OR'ign terms together? > > For simplicity, I would just treat each part of the OR as a separate query > and combine the results on the application side. I think ORs are going to be > expensive no matter what you do, so you might as well keep is simple. Fully agree. I'm not sure what the other DBMS's do with ORs in the WHERE, but this looks like the best way of doing it all-round. [ MySQL's FULLTEXT index ] > Bleah. Real phrase matching is a definite requirement for me too. I wonder if > MySQL stores any positional info in its full-text index that you could get > your hands on... I'd imagine it is using a table internally to store the > index... No, MySQL doesn't store the position; it uses relative occurrences only in determining relevance, and does so at the row level. Cheers -- Marcus |