|
From: Fred Y. <fr...@on...> - 2001-03-24 13:39:05
|
As I've mentioned before, I've got a VPython/Tk program that hangs on my
machine, a dual Pentium3 800 system running WinNT4 SP6. Others have
tried to recreate my problem, but none have seen it.
I've found that with one small change to the C++ code for VPython
itself, I can prevent the deadlock without seeming to cause any other
problems. Here's the patch:
--- displaylist.cpp.orig Sat Mar 24 07:09:58 2001
+++ displaylist.cpp Sat Mar 24 07:12:11 2001
@@ -143,7 +143,7 @@
void DisplayObject::remove() {
if (prev) {
- DisplayObject::read_lock P(prev->mtx);
+ //DisplayObject::read_lock P(prev->mtx);
prev->next = next;
}
if (next) {
--
I've simply disabled the first 'read_lock' in the
DisplayObject::remove() method. I came to this because my application
always seems to hang at the point where it sets the 'visible' attribute
of an object to false, and this function is called in response to that
action.
So, is there something wrong with the locking scheme in cvisual, or am I
just hacking around a deeper problem such as some glitch in display
drivers? I admit that I know nothing about OpenGL, much less about what
locking strategy is necessary. I don't even understand how there can be
contention in my application since all work would seem to be done in the
Tk thread in response to callback functions associated with the Slider
objects. But here too, I don't understand just what threads are
involved and how they can interleave.
It does bother me that the insert() and remove() methods of
DisplayObject both create two locks, not releasing the first one before
acquiring the second even though the first seems no longer to be needed
when the second is acquired. I tried hacking the code to explicitly
delete the first lock as soon as possible, but that version still
deadlocks.
My problematic application is still at
<http://www.ontosys.com/src/residue.py>.
Here's the output from glinfo.py, showing my private version of
cvisual.dll. My video card is a 3Dfx Voodoo-3 3000 16MB AGP, set at
1600 x 1200 x 32 at 85 Hz:
--
$ python glinfo.py
Visual-2001-02-27
e:\programs\python20\dlls\cvisual.dll Sat Mar 24 07:12:18 2001
e:\programs\python20\visual\__init__.py Tue Feb 27 21:58:28 2001
e:\programs\python20\visual\graph.py Tue Feb 20 20:35:42 2001
d:\winnt\system32\opengl32.dll Thu Nov 18 10:04:00 1999
OpenGL renderer active.
Vendor: Microsoft Corporation
Version: 1.1.0
Renderer: GDI Generic
Extensions:
GL_WIN_swap_hint
GL_EXT_bgra
GL_EXT_paletted_texture
--
Fred Yankowski fr...@On... tel: +1.630.879.1312
Principal Consultant www.OntoSys.com fax: +1.630.879.1370
OntoSys, Inc 38W242 Deerpath Rd, Batavia, IL 60510, USA
|
|
From: Fred Y. <fc...@ac...> - 2001-03-24 14:05:26
|
As I've mentioned before, I've got a VPython/Tk program that hangs on my
machine, a dual Pentium3 800 system running WinNT4 SP6. Others have
tried to recreate my problem, but none have seen it.
I've found that with one small change to the C++ code for VPython
itself, I can prevent the deadlock without seeming to cause any other
problems. Here's the patch:
--- displaylist.cpp.orig Sat Mar 24 07:09:58 2001
+++ displaylist.cpp Sat Mar 24 07:12:11 2001
@@ -143,7 +143,7 @@
void DisplayObject::remove() {
if (prev) {
- DisplayObject::read_lock P(prev->mtx);
+ //DisplayObject::read_lock P(prev->mtx);
prev->next = next;
}
if (next) {
--
I've simply disabled the first 'read_lock' in the
DisplayObject::remove() method. I came to this because my application
always seems to hang at the point where it sets the 'visible' attribute
of an object to false, and this function is called in response to that
action.
So, is there something wrong with the locking scheme in cvisual, or am I
just hacking around a deeper problem such as some glitch in display
drivers? I admit that I know nothing about OpenGL, much less about what
locking strategy is necessary. I don't even understand how there can be
contention in my application since all work would seem to be done in the
Tk thread in response to callback functions associated with the Slider
objects. But here too, I don't understand just what threads are
involved and how they can interleave.
It does bother me that the insert() and remove() methods of
DisplayObject both create two locks, not releasing the first one before
acquiring the second even though the first seems no longer to be needed
when the second is acquired. I tried hacking the code to explicitly
delete the first lock as soon as possible, but that version still
deadlocks.
My problematic application is still at
<http://www.ontosys.com/src/residue.py>.
Here's the output from glinfo.py, showing my private version of
cvisual.dll. My video card is a 3Dfx Voodoo-3 3000 16MB AGP, set at
1600 x 1200 x 32 at 85 Hz:
--
$ python glinfo.py
Visual-2001-02-27
e:\programs\python20\dlls\cvisual.dll Sat Mar 24 07:12:18 2001
e:\programs\python20\visual\__init__.py Tue Feb 27 21:58:28 2001
e:\programs\python20\visual\graph.py Tue Feb 20 20:35:42 2001
d:\winnt\system32\opengl32.dll Thu Nov 18 10:04:00 1999
OpenGL renderer active.
Vendor: Microsoft Corporation
Version: 1.1.0
Renderer: GDI Generic
Extensions:
GL_WIN_swap_hint
GL_EXT_bgra
GL_EXT_paletted_texture
--
Fred Yankowski fr...@On... tel: +1.630.879.1312
Principal Consultant www.OntoSys.com fax: +1.630.879.1370
OntoSys, Inc 38W242 Deerpath Rd, Batavia, IL 60510, USA
|
|
From: David S. <dsc...@vy...> - 2001-03-28 16:48:02
|
> I've simply disabled the first 'read_lock' in the
> DisplayObject::remove() method. I came to this because my application
> always seems to hang at the point where it sets the 'visible' attribute
> of an object to false, and this function is called in response to that
> action.
>
> So, is there something wrong with the locking scheme in cvisual
Yes, there is apparently something wrong.
First of all, some background on the situation: The code in question is
invoked to add or remove an object from the list of visible objects in a
display. The reason it employs locking is that the visual renderer, in a
separate thread, may be iterating through the same list and displaying the
objects. It needs to be protected from encountering deleted objects,
objects that are being modified, etc.
To understand what's happening, we need to look in several places
(1) The insert() and remove() functions themselves
(2) DisplayListIterator::next() in displaylist.h - this is the function used
by the renderer to walk through the list
(3) The code in prim.cpp and arrprim.cpp that calls SetVisible(), which
calls insert() and remove()
This last is important, because it turns out that that code takes *another*
write_lock, on "this". So the sequence for insert goes
lock this
lock head of list
lock first item in list
unlock first item
unlock head
unlock this
The sequence for remove goes
lock this
lock prev
unlock prev
lock next
unlock next
unlock this
The sequence for next is a little harder to see:
lock head
lock first
unlock head
lock second
unlock first
lock third
unlock second
lock fourth
And so on. We need to watch out for deadlocks between next() in the
rendering thread and either insert() or remove() in the other thread.
No deadlocks can occur between insert() and next(). Intuitively, this is
because both functions are walking down the list "with the grain".
Specifically, each one locks a node A only if
(1) It holds no other locks
or
(2) It holds a lock on A.prev and no other nodes
If (1) is the case, this thread has *no* locks and so there can't possibly
be a deadlock while this thread waits for A. If (2) is the case, and A is
locked by the other thread, the other thread cannot be waiting for A.prev
because it can't have both A.prev.prev and A locked.
However, this logic doesn't work for remove(), which locks this->prev
"against the grain". In fact, it's pretty easy to see that there CAN be a
deadlock. To look at just one case:
remove() next()
-------- ------
lock this
lock prev
lock prev unlock prev.prev
lock this (deadlock)
This is fairly embarrassing, and also rather puzzling - why is it working
for everyone except Fred?
Now, Fred's fix removes the lock on prev. It's easy to see that the
deadlock is eliminated, since remove() then runs with the grain of the list:
lock this
lock next
unlock next
unlock this
Alas, this version of remove(), while deadlock proof, is highly suspicious
in another way: it is modifying an object (prev) that isn't locked. To give
a simple example of how this could cause problems, imagine that next() was
written this way:
void DisplayListIterator::next() {
DisplayObject *old = ptr;
ptr = old->next; // (1)
if (old->next) // (2)
old->next->mtx.sync_lock();
old->mtx.sync_unlock();
}
This is a perfectly legitimate way to write it - ptr is locked going in, so
all the accesses are to a locked object. However, imagine this code running
alongside Fred's remove(). I won't give the details, but remove() could
change old->next in between (1) and (2) above, resulting in ptr being set to
one object, but a different object being locked!
Now, the agonizing thing is that Fred's version probably *does* work as
things are implemented now. The prev and next pointers are only used (thank
goodness) in a very small number of places, and from a cursory examination
all of them will probably work given the memory ordering semantics of Intel
SMP. However, I hesitate to leave this land mine around for the next person
who tries to work with (or port) the code.
I think the best solution is for the locking in remove() to work like this:
lock prev
unlock prev
lock this
lock next
unlock next
unlock this
This should be safe because the rendering thread never *modifies* the prev
or this pointers, and multiple Python threads are protected from one another
by the global interpreter lock. Therefore it isn't necessary to have this
locked until we need to actually modify its prev or next pointers.
For this to work, the write locks have to be *removed* from the code that
calls SetVisible (in prim.cpp and arrprim.cpp, and maybe elsewhere?)
insert() and remove() both need to be modified to lock "this" themselves. A
plausible implementation of remove() is something like
void DisplayObject::remove() {
if (prev) {
DisplayObject::read_lock P(prev->mtx);
prev->next = next;
}
DisplayObject::write_lock T(this->mtx);
if (next) {
DisplayObject::read_lock N(next->mtx);
next->prev = prev;
}
prev = next = 0;
getObject().decrement_reference_count();
}
Unfortunately, I have no more time for this today.
> It does bother me that the insert() and remove() methods of
> DisplayObject both create two locks, not releasing the first one before
> acquiring the second even though the first seems no longer to be needed
> when the second is acquired. I tried hacking the code to explicitly
> delete the first lock as soon as possible, but that version still
> deadlocks.
Maybe you are unaware of the syntactic trick being used to implement the
locks? Each lock is removed when it goes out of scope (roughly, when you
hit a } in the code ):
if (prev) {
DisplayObject::read_lock P(prev->mtx);
prev->next = next;
} // <---- unlock P
Dave
|
|
From: Fred Y. <fr...@on...> - 2001-03-28 17:45:49
|
David, Thank you for considering this problem. I look forward to getting an updated VPython that fixes the problem without requiring my hack. On Wed, Mar 28, 2001 at 11:52:04AM -0500, David Scherer wrote: > This is fairly embarrassing, and also rather puzzling - why is it working > for everyone except Fred? Such situations no longer surprise me. ;-) > Maybe you are unaware of the syntactic trick being used to implement the > locks? Each lock is removed when it goes out of scope (roughly, when you > hit a } in the code ): I thought that might be the case, but I wasn't sure if C++ guaranteed that behavior -- calling the destructor at exactly the end of the block -- and I didn't completely trust the compiler in any case. -- Fred Yankowski fr...@On... tel: +1.630.879.1312 Principal Consultant www.OntoSys.com fax: +1.630.879.1370 OntoSys, Inc 38W242 Deerpath Rd, Batavia, IL 60510, USA |
|
From: Markus G. <gr...@iu...> - 2001-04-03 10:42:23
|
Fred Yankowski wrote: > David, > > Thank you for considering this problem. I look forward to getting an > updated VPython that fixes the problem without requiring my hack. > > On Wed, Mar 28, 2001 at 11:52:04AM -0500, David Scherer wrote: > > This is fairly embarrassing, and also rather puzzling - why is it working > > for everyone except Fred? Fred's residue.py also hangs on my machine from time to time. I have a P200, Diamond Viper V550 16MB. Markus |