Welcome to mirror list, hosted at ThFree Co, Russian Federation.

cygwin.com/git/newlib-cygwin.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'winsup/cygwin/DevDocs/CgfNotes.OLD')
-rw-r--r--winsup/cygwin/DevDocs/CgfNotes.OLD562
1 files changed, 562 insertions, 0 deletions
diff --git a/winsup/cygwin/DevDocs/CgfNotes.OLD b/winsup/cygwin/DevDocs/CgfNotes.OLD
new file mode 100644
index 000000000..5d31ed72e
--- /dev/null
+++ b/winsup/cygwin/DevDocs/CgfNotes.OLD
@@ -0,0 +1,562 @@
+2014-04-26 cgf-000026
+
+Forgot to clear to the end of screen when restoring a screen buffer.
+That worked, for some reason, with Take Command but not with normal
+consoles. I don't remember why I didn't resize the screen like a Linux
+X terminal emulator but that might have made things work a little
+better. Right now, there is a scroll bar for apps like less or vi and
+that doesn't feel right.
+
+2014-03-29 cgf-000025
+
+Reorganized _cygtls::signal_debugger to avoid sending anything to the
+debugger if we've seen an exception. I think it used to work that way
+and I changed it without noting why. It sure seems like, if we don't do
+this, gdb will see two signals and, it really does, when there has been
+a Windows-recognized exception.
+
+2014-02-15 cgf-000024
+
+Wow. It's hard getting the screen handling stuff working correctly when
+there is a screen buffer larger than screen size and vice versa. These
+changes attempt to use SetConsoleWindowInfo whenever possible so that
+the contents of the screen buffer are never wiped out. They also fix
+some previously misbehaving "scroll the screen" commands.
+
+2013-06-07 cgf-000023
+
+Given the fact that the signal thread never exits there is no need
+for exit_thread to ever block. So, nuke this code.
+
+2013-01-31 cgf-000022
+
+While researching the lftp behavior reported here:
+
+http://cygwin.com/ml/cygwin/2013-01/msg00390.html
+
+after a frenzy of rewriting sigflush handling to avoid blocking in the
+signal thread (which is now and should ever have been illegal), it
+dawned on me that we're not supposed to be flushing the tty input buffer
+every time a signal is received. We're supposed to do this only when
+the user hits a character (e.g., CTRL-C) which initiates a signal
+action. So, I removed sigflush from sigpacket::process and moved it to
+tc ()->kill_pgrp (). This function should only be called to send
+signals related to the tty so this should have the desired effect.
+
+2013-01-11 cgf-000021
+
+Apparently I got the signal handling semantics of select() wrong again
+even though I would have sworn that I tested this on Linux and Windows.
+
+select() is apparently *always* interrupted by a signal and *never*
+restarts. Hopefully, between the comment added to the code and this
+note, I'll not make this mistake again.
+
+2013-01-02 cgf-000020
+
+(This entry should have been checked in with the changes but... I forgot)
+
+This is a fairly big revamp of the way that windows signals are handled.
+The intent is that all signal decisions should be made by the signal
+thread; not by the exception handler.
+
+This required the ability to pass information from the exception handler
+to the signal thread so, a si_cyg field was added to siginfo_t. This
+contains information needed to generate a "core dump". Hmm. Haven't
+checked to see if this breaks Cygwin's hardly-ever-used real core dump
+facility.
+
+Anyway, I moved signal_exit back into exceptions.cc and removed it from
+the sigpacket class. This function is now treated like a signal handler
+function - Cygwin will cause it to be dispatched in the context of
+whatever thread caught the signal. signal_exit also makes the
+determination about when to write a stackdump.
+
+The signal-handler thread will no longer ever attempt to exit. It will
+just keep processing signals (it will not process real signals after
+Cygwin stops shutting down, however). This should make it impossible
+for the signal thread to ever block waiting for the process lock since
+it now never grabs the process lock. The signal-handler thread will
+notify gdb when it gets a signal now but, in theory, gdb should see the
+context of the thread which received the signal, not the signal-handler
+thread.
+
+2012-12-28 cgf-000019
+
+(I forgot to mention that cgf-000018 was reverted. Although I never saw
+a hang from this, I couldn't convince myself that one wasn't possible.)
+
+This fix attempts to correct a deadlock where, when a true Windows
+signal arrives, Windows creates a thread which "does stuff" and attempts
+to exit. In the process of exiting Cygwin grabs the process lock. If
+the signal thread has seen the signal and wants to exit, it can't
+because the newly-created thread now holds it. But, since the new
+thread is relying on the signal thread to release its process lock,
+it exits and the process lock is never released.
+
+To fix this, I removed calls to _cygtls::signal_exit in favor of
+flagging that we were exiting by setting signal_exit_code (almost forgot
+to mark that NO_COPY: that would have been fun). The new function
+setup_signal_exit() now handles setting things up so that ReadFile loop
+in wait_sig will do the right thing when it terminates. This function
+may just Sleep indefinitely if a signal is being sent from a thread
+other than the signal thread. wait_sig() was changed so that it will
+essentially drop into asychronous-read-mode when a signal which exits
+has been detected. The ReadFile loop is exited when we know that the
+process is supposed to be exiting and there is nothing else in the
+signal queue.
+
+Although I never actually saw this happen, exit_thread() was also
+changed to release the process lock and just sleep indefintely if it is
+detected that we are exiting.
+
+2012-12-21 cgf-000018
+
+Re: cgf-000017
+
+It occurred to me that just getting the process lock during
+DLL_THREAD_DETACH in dll_entry() might be adequate to fix this
+problem. It's certainly much less intrusive.
+
+There are potential deadlock problems with grabbing a lock in
+this code, though, so this check-in will be experimental.
+
+2012-12-21 cgf-000017
+
+The changes in this set are to work around the issue noted here:
+
+http://cygwin.com/ml/cygwin/2012-12/threads.html#00140
+
+The problem is, apparently, that the return value of an ExitThread()
+will take precedence over the return value of TerminateProcess/ExitProcess
+if the thread is the last one exiting. That's rather amazing...
+
+For the fix, I replaced all calls to ExitThread with exit_thread(). The
+exit_thread function, creates a handle to the current thread and sends
+it to a packet via sig_send(__SIGTHREADEXIT). Then it acquires the
+process lock and calls ExitThread.
+
+wait_sig will then wait for the handle, indicating that the thread has
+exited, and, when that has happened, removes the process lock on behalf
+of the now-defunct thread. wait_sig will now also avoid actually
+exiting since it could trigger the same problem.
+
+Holding process_lock should prevent threads from exiting while a Cygwin
+process is shutting down. They will just block forever in that case -
+just like wait_sig.
+
+2012-08-17 cgf-000016
+
+While debugging another problem I finally noticed that
+sigpacket::process was unconditionally calling tls->set_siginfo prior to
+calling setup_handler even though setup_handler could fail. In the
+event of two successive signals, that would cause the second signal's
+info to overwrite the first even though the signal handler for the first
+would eventually be called. Doh.
+
+Fixing this required passing the sigpacket si field into setup_handler.
+Making setup_handler part of the sigpacket class seemed to make a lot of
+sense so that's what I did. Then I passed the si element into
+interrupt_setup so that the infodata structure could be filled out prior
+to arming the signal.
+
+The other changes checked in here eliminate the ResetEvent for
+signal_arrived since previous changes to cygwait should handle the
+case of spurious signal_arrived detection. Since signal_arrived is
+not a manual-reset event, we really should just let the appropriate
+WFMO handle it. Otherwise, there is a race where a signal comes in
+a "split second" after WFMO responds to some other event. Resetting
+the signal_arrived would cause any subsequent WFMO to never be
+triggered. My current theory is that this is what is causing:
+
+http://cygwin.com/ml/cygwin/2012-08/msg00310.html
+
+2012-08-15 cgf-000015
+
+RIP cancelable_wait. Yay.
+
+2012-08-09 cgf-000014
+
+So, apparently I got it somewhat right before wrt signal handling.
+Checking on linux, it appears that signals will be sent to a thread
+which can accept the signal. So resurrecting and extending the
+"find_tls" function is in order. This function will return the tls
+of any thread which 1) is waiting for a signal with sigwait*() or
+2) has the signal unmasked.
+
+In redoing this it became obvious that I had the class designation wrong
+for the threadlist handling so I moved the manipulation of the global
+threadlist into the cygheap where it logically belongs.
+
+2012-07-21 cgf-000013
+
+These changes reflect a revamp of the "wait for signal" functionality
+which has existed in Cygwin through several signal massages.
+
+We now create a signal event only when a thread is waiting for a signal
+and arm it only for that thread. The "set_signal_arrived" function is
+used to establish the event and set it in a location referencable by
+the caller.
+
+I still do not handle all of the race conditions. What happens when
+a signal comes in just after a WF?O succeeds for some other purpose? I
+suspect that it will arm the next WF?O call and the subsequent call to
+call_signal_handler could cause a function to get an EINTR when possibly
+it shouldn't have.
+
+I haven't yet checked all of the test cases for the URL listed in the
+previous entry.
+
+Baby steps.
+
+2012-06-12 cgf-000012
+
+These changes are the preliminary for redoing the way threads wait for
+signals. The problems are shown by the test case mentioned here:
+
+http://cygwin.com/ml/cygwin/2012-05/msg00434.html
+
+I've known that the signal handling in threads wasn't quite right for
+some time. I lost all of my thread signal tests in the great "rm -r"
+debacle of a few years ago and have been less than enthusiastic about
+redoing everything (I had PCTS tests and everything). But it really is
+time to redo this signal handling to make it more like it is supposed to
+be.
+
+This change should not introduce any new behavior. Things should
+continue to behave as before. The major differences are a change in the
+arguments to cancelable_wait and cygwait now uses cancelable_wait and,
+so, the returns from cygwait now mirror cancelable_wait.
+
+The next change will consolidate cygwait and cancelable_wait into one
+cygwait function.
+
+2012-06-02 cgf-000011
+
+The refcnt handling was tricky to get right but I had convinced myself
+that the refcnt's were always incremented/decremented under a lock.
+Corinna's 2012-05-23 change to refcnt exposed a potential problem with
+dup handling where the fdtab could be updated while not locked.
+
+That should be fixed by this change but, on closer examination, it seems
+like there are many places where it is possible for the refcnt to be
+updated while the fdtab is not locked since the default for
+cygheap_fdget is to not lock the fdtab (and that should be the default -
+you can't have read holding a lock).
+
+Since refcnt was only ever called with 1 or -1, I broke it up into two
+functions but kept the Interlocked* operation. Incrementing a variable
+should not be as racy as adding an arbitrary number to it but we have
+InterlockedIncrement/InterlockedDecrement for a reason so I kept the
+Interlocked operation here.
+
+In the meantime, I'll be mulling over whether the refcnt operations are
+actually safe as they are. Maybe just ensuring that they are atomically
+updated is enough since they control the destruction of an fh. If I got
+the ordering right with incrementing and decrementing then that should
+be adequate.
+
+2012-06-02 cgf-000010
+
+<1.7.16>
+- Fix emacs problem which exposed an issue with Cygwin's select() function.
+ If a signal arrives while select is blocking and the program longjmps
+ out of the signal handler then threads and memory may be left hanging.
+ Fixes: http://cygwin.com/ml/cygwin/2012-05/threads.html#00275
+</1.7.16>
+
+This was try #4 or #5 to get select() signal handling working right.
+It's still not there but it should now at least not leak memory or
+threads.
+
+I mucked with the interface between cygwin_select and select_stuff::wait
+so that the "new" loop in select_stuff::wait() was essentially moved
+into the caller. cygwin_select now uses various enum states to decide
+what to do. It builds the select linked list at the beginning of the
+loop, allowing wait() to tear everything down and restart. This is
+necessary before calling a signal handler because the signal handler may
+longjmp away.
+
+I initially had this all coded up to use a special signal_cleanup
+callback which could be called when a longjmp is called in a signal
+handler. And cygwin_select() set up and tore down this callback. Once
+I got everything compiling it, of course, dawned on me that just because
+you call a longjmp in a signal handler it doesn't mean that you are
+jumping *out* of the signal handler. So, if the signal handler invokes
+the callback and returns it will be very bad for select(). Hence, this
+slower, but hopefully more correct implementation.
+
+(I still wonder if some sort of signal cleanup callback might still
+be useful in the future)
+
+TODO: I need to do an audit of other places where this problem could be
+occurring.
+
+As alluded to above, select's signal handling is still not right. It
+still acts as if it could call a signal handler from something other
+than the main thread but, AFAICT, from my STC, this doesn't seem to be
+the case. It might be worthwhile to extend cygwait to just magically
+figure this out and not even bother using w4[0] for scenarios like this.
+
+2012-05-16 cgf-000009
+
+<1.7.16>
+- Fix broken console mouse handling. Reported here:
+ http://cygwin.com/ml/cygwin/2012-05/msg00360.html
+</1.7.16>
+
+I did a cvs annotate on smallprint.cc and see that the code to translate
+%characters > 127 to 0x notation was in the 1.1 revision. Then I
+checked the smallprint.c predecessor. It was in the 1.1 version of that
+program too, which means that this odd change has probably been around
+since <= 2000.
+
+Since __small_sprintf is supposed to emulate sprintf, I got rid of the
+special case handling. This may affect fhandler_socket::bind. If so, we
+should work around this problem there rather than keeping this strange
+hack in __small_printf.
+
+2012-05-14 cgf-000008
+
+<1.7.16>
+- Fix hang when zero bytes are written to a pty using
+ Windows WriteFile or equivalent. Fixes:
+ http://cygwin.com/ml/cygwin/2012-05/msg00323.html
+</1.7.16>
+
+cgf-000002, as usual, fixed one thing while breaking another. See
+Larry's predicament in: http://goo.gl/oGEr2 .
+
+The problem is that zero byte writes to the pty pipe caused the dread
+end-of-the-world-as-we-know-it problem reported on the mailing list
+where ReadFile reads zero bytes even though there is still more to read
+on the pipe. This is because that change caused a 'record' to be read
+and a record can be zero bytes.
+
+I was never really keen about using a throwaway buffer just to get a
+count of the number of characters available to be read in the pty pipe.
+On closer reading of the documentation for PeekNamedPipe it seemed like
+the sixth argument to PeekNamedPipe should return what I needed without
+using a buffer. And, amazingly, it did, except that the problem still
+remained - a zero byte message still screwed things up.
+
+So, we now detect the case where there is zero bytes available as a
+message but there are bytes available in the pipe. In that scenario,
+return the bytes available in the pipe rather than the message length of
+zero. This could conceivably cause problems with pty pipe handling in
+this scenario but since the only way this scenario could possibly happen
+is when someone is writing zero bytes using WriteFile to a pty pipe, I'm
+ok with that.
+
+2012-05-14 cgf-000007
+
+<1.7.16>
+- Fix invocation of strace from a cygwin process. Fixes:
+ http://cygwin.com/ml/cygwin/2012-05/msg00292.html
+</1.7.16>
+
+The change in cgf-000004 introduced a problem for processes which load
+cygwin1.dll dynamically. strace.exe is the most prominent example of
+this.
+
+Since the parent handle is now closed for "non-Cygwin" processes, when
+strace.exe tried to dynamically load cygwin1.dll, the handle was invalid
+and child_info_spawn::handle_spawn couldn't use retrieve information
+from the parent. This eventually led to a strace_printf error due to an
+attempt to dereference an unavailable cygheap. Probably have to fix
+this someday. You shouldn't use the cygheap while attempting to print
+an error about the inavailability of said cygheap.
+
+This was fixed by saving the parent pid in child_info_spawn and calling
+OpenProcess for the parent pid and using that handle iff a process is
+dynamically loaded.
+
+2012-05-12 cgf-000006
+
+<1.7.16>
+- Fix hang when calling pthread_testcancel in a canceled thread.
+ Fixes some of: http://cygwin.com/ml/cygwin/2012-05/msg00186.html
+</1.7.16>
+
+This should fix the first part of the reported problem in the above
+message. The cancel seemed to actually be working but, the fprintf
+eventually ended up calling pthread_testcancel. Since we'd gotten here
+via a cancel, it tried to recursively call the cancel handler causing a
+recursive loop.
+
+2012-05-12 cgf-000005
+
+<1.7.16>
+- Fix pipe creation problem which manifested as a problem creating a
+fifo. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00253.html
+</1.7.16>
+
+My change on 2012-04-28 introduced a problem with fifos. The passed
+in name was overwritten. This was because I wasn't properly keeping
+track of the length of the generated pipe name when there was a
+name passed in to fhandler_pipe::create.
+
+There was also another problem in fhandler_pipe::create. Since fifos
+use PIPE_ACCESS_DUPLEX and PIPE_ACCESS_DUPLEX is an or'ing of
+PIPE_ACCESS_INBOUND and PIPE_ACCESS_OUTBOUND, using PIPE_ACCESS_OUTBOUND
+as a "never-used" option for PIPE_ADD_PID in fhandler.h was wrong. So,
+fifo creation attempted to add the pid of a pipe to the name which is
+wrong for fifos.
+
+2012-05-08 cgf-000004
+
+The change for cgf-000003 introduced a new problem:
+http://cygwin.com/ml/cygwin/2012-05/msg00154.html
+http://cygwin.com/ml/cygwin/2012-05/msg00157.html
+
+Since a handle associated with the parent is no longer being duplicated
+into a non-cygwin "execed child", Windows is free to reuse the pid of
+the parent when the parent exits. However, since we *did* duplicate a
+handle pointing to the pid's shared memory area into the "execed child",
+the shared memory for the pid was still active.
+
+Since the shared memory was still available, if a new process reuses the
+previous pid, Cygwin would detect that the shared memory was not created
+and had a "PID_REAPED" flag. That was considered an error, and, so, it
+would set procinfo to NULL and pinfo::thisproc would die since this
+situation is not supposed to occur.
+
+I fixed this in two ways:
+
+1) If a shared memory region has a PID_REAPED flag then zero it and
+reuse it. This should be safe since you are not really supposed to be
+querying the shared memory region for anything after PID_REAPED has been
+set.
+
+2) Forego duping a copy of myself_pinfo if we're starting a non-cygwin
+child for exec.
+
+It seems like 2) is a common theme and an audit of all of the handles
+that are being passed to non-cygwin children is in order for 1.7.16.
+
+The other minor modification that was made in this change was to add the
+pid of the failing process to fork error output. This helps slightly
+when looking at strace output, even though in this case it was easy to
+find what was failing by looking for '^---' when running the "stv"
+strace dumper. That found the offending exception quickly.
+
+2012-05-07 cgf-000003
+
+<1.7.15>
+Don't make Cygwin wait for all children of a non-cygwin child program.
+Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00063.html,
+ http://cygwin.com/ml/cygwin/2012-05/msg00075.html
+</1.7.15>
+
+This problem is due to a recent change which added some robustness and
+speed to Cygwin's exec/spawn handling by not trying to force inheritance
+every time a process is started. See ChangeLog entries starting on
+2012-03-20, and multiple on 2012-03-21.
+
+Making the handle inheritable meant that, as usual, there were problems
+with non-Cygwin processes. When Cygwin "execs" a non-Cygwin process N,
+all of its N + 1, N + 2, ... children will also inherit the handle.
+That means that Cygwin will wait until all subprocesses have exited
+before it returns.
+
+I was willing to make this a restriction of starting non-Cygwin
+processes but the problem with allowing that is that it can cause the
+creation of a "limbo" pid when N exits and N + 1 and friends are still
+around. In this scenario, Cygwin dutifully notices that process N has
+died and sets the exit code to indicate that but N's parent will wait on
+rd_proc_pipe and will only return when every N + ... windows process
+has exited.
+
+The removal of cygheap::pid_handle was not related to the initial
+problem that I set out to fix. The change came from the realization
+that we were duping the current process handle into the child twice and
+only needed to do it once. The current process handle is used by exec
+to keep the Windows pid "alive" so that it will not be reused. So, now
+we just close parent in child_info_spawn::handle_spawn iff we're not
+execing.
+
+In debugging this it bothered me that 'ps' identified a nonactive pid as
+active. Part of the reason for this was the 'parent' handle in
+child_info was opened in non-Cygwin processes, keeping the pid alive.
+That has been kluged around (more changes after 1.7.15) but that didn't
+fix the problem. On further investigation, this seems to be caused by
+the fact that the shared memory region pid handles were still being
+passed to non-cygwin children, keeping the pid alive in a limbo-like
+fashion. This was easily fixed by having pinfo::init() consider a
+memory region with PID_REAPED as not available. A more robust fix
+should be considered for 1.7.15+ where these handles are not passed
+to non-cygwin processes.
+
+This fixed the problem where a pid showed up in the list after a user
+does something like: "bash$ cmd /c start notepad" but, for some reason,
+it does not fix the problem where "bash$ setsid cmd /c start notepad".
+That bears investigation after 1.7.15 is released but it is not a
+regression and so is not a blocker for the release.
+
+2012-05-03 cgf-000002
+
+<1.7.15>
+Fix problem where too much input was attempted to be read from a
+pty slave. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00049.html
+</1.7.15>
+
+My change on 2012/04/05 reintroduced the problem first described by:
+http://cygwin.com/ml/cygwin/2011-10/threads.html#00445
+
+The problem then was, IIRC, due to the fact that bytes sent to the pty
+pipe were not written as records. Changing pipe to PIPE_TYPE_MESSAGE in
+pipe.cc fixed the problem since writing lines to one side of the pipe
+caused exactly that the number of characters to be read on the other
+even if there were more characters in the pipe.
+
+To debug this, I first replaced fhandler_tty.cc with the 1.258,
+2012/04/05 version. The test case started working when I did that.
+
+So, then, I replaced individual functions, one at a time, in
+fhandler_tty.cc with their previous versions. I'd expected this to be a
+problem with fhandler_pty_master::process_slave_output since that had
+seen the most changes but was surprised to see that the culprit was
+fhandler_pty_slave::read().
+
+The reason was that I really needed the bytes_available() function to
+return the number of bytes which would be read in the next operation
+rather than the number of bytes available in the pipe. That's because
+there may be a number of lines available to be read but the number of
+bytes which will be read by ReadFile should reflect the mode of the pty
+and, if there is a line to read, only the number of bytes in the line
+should be seen as available for the next read.
+
+Having bytes_available() return the number of bytes which would be read
+seemed to fix the problem but it could subtly change the behavior of
+other callers of this function. However, I actually think this is
+probably a good thing since they probably should have been seeing the
+line behavior.
+
+2012-05-02 cgf-000001
+
+<1.7.15>
+Fix problem setting parent pid to 1 when process with children execs
+itself. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00009.html
+</1.7.15>
+
+Investigating this problem with strace showed that ssh-agent was
+checking the parent pid and getting a 1 when it shouldn't have. Other
+stuff looked ok so I chose to consider this a smoking gun.
+
+Going back to the version that the OP said did not have the problem, I
+worked forward until I found where the problem first occurred -
+somewhere around 2012-03-19. And, indeed, the getppid call returned the
+correct value in the working version. That means that this stopped
+working when I redid the way the process pipe was inherited around
+this time period.
+
+It isn't clear why (and I suspect I may have to debug this further at
+some point) this hasn't always been a problem but I made the obvious fix.
+We shouldn't have been setting ppid = 1 when we're about to pass off to
+an execed process.
+
+As I was writing this, I realized that it was necessary to add some
+additional checks. Just checking for "have_execed" isn't enough. If
+we've execed a non-cygwin process then it won't know how to deal with
+any inherited children. So, always set ppid = 1 if we've execed a
+non-cygwin process.