Buggy read-buffer!

Started by itistoday, January 12, 2010, 04:05:22 PM

Previous topic - Next topic

itistoday

EDIT: this file has been updated after the discovery of a new bug in 'read-buffer'.



Bug #1 - Bad usage of <int-size>



While troubleshooting a problem I took a look at source for the read-buffer function and found something that I wasn't expecting, namely that when you pass it the <int-size> parameter to specify the maximum number of bytes to read, it will actually allocate a temporary buffer of that size.



From the documentation this is not evident. I assumed that 'read-buffer' would read from <int-file> in smaller chunks, copying what it read into a smaller temporary byte array and then copying that, chunk-by-chunk, into <sym-buffer>. In my opinion this is what the function *should* do because it handles efficiently all scenarios. I.e. what if you're expecting to get data anywhere from 1KB to 100MB? I don't think it makes sense in that situation to always allocate 100MB of memory, but it would be nice to be able to tell 'read-buffer' that such an amount of data is possible.



Bug #2 (severe) - function 'read' used incorrectly



Related to Bug #1, the C read function is used incorrectly, and actually this turned out to be the cause of the problem I was having (referenced in passing above), which is that read-buffer cannot currently be used to read large amounts of data (over ~64KB, this value is system dependent). From the man page on read:


RETURN VALUES
     If successful, the number of bytes actually read is returned.  Upon read-
     ing end-of-file, zero is returned.  Otherwise, a -1 is returned and the
     global variable errno is set to indicate the error.


So the bug is that 'read-buffer' doesn't use chunked reads and perhaps consequently also does not keep reading until 0 is returned. Proper use of read is something along these lines:


int read-buffer(int fd, unsigned char *buff, int possibleSize) {
int chunk_size = possibleSize > 8192 ? 8192 : possibleSize;
unsigned char *p = buff;
int amnt;

while ( (amnt = read(fd, p, chunk_size)) != 0 && possibleSize > 0 ) {
if ( amnt == -1 ) return -1; // error!
p += amnt;
possibleSize -= amnt;
if ( possibleSize < chunk_size )
chunk_size = possibleSize;
}

return (int)(p - buff);
}
Get your Objective newLISP groove on.

Lutz

#1
The function 'read-buffer' without the wait option does a one buffer read(), like it should: on error it returns 'nil' and 'sys-error' is set, on EOF, it returns 'nil' as well and does not set 'sys-error'. If you expect to read size N bytes, it is only natural to allocate N bytes internally, if less bytes are read, memory is re-allocated to the smaller size or freed on error.  If you want to read in smaller chunks you have to write this, and a 'nil' return value on EOF is useful to exit your 'while' loop.



ps: your post seems to be related to the email you sent me about Apache and CGI processes. Consider that TCP/IP may transmit a big buffer in smaller chunks and translated by Apache into multiple std I/O events. Your CGI process receiving that data must handle that situation.

itistoday

#2
Quote from: "Lutz"The function 'read-buffer' without the wait option does a one buffer read(), like it should: on error it returns 'nil' and 'sys-error' is set, on EOF, it returns 'nil' as well and does not set 'sys-error'. If you expect to read size N bytes, it is only natural to allocate N bytes internally, if less bytes are read, memory is re-allocated to the smaller size or freed on error.  If you want to read in smaller chunks you have to write this, and a 'nil' return value on EOF is useful to exit your 'while' loop.


With all due respect, this doesn't make any sense.



It's still a fact that:


[*] The behavior of 'read-buffer' is not documented. The fact that it always allocates a buffer of <int-size> is not mentioned.
  • [*] Because of the way 'read-buffer' is written, it cannot be used to read a large amount of data. This is a bug, and the only way to fix it is by using 'read' in a manner similar to what I showed above.
  • [/list]


    The documentation states that a maximum of <int-size> will be read. This is not the same as expecting <int-size> to be read. Again, if you *expect* anywhere from 1KB to 100MB to be read (say with 1MB being the average), then it does not make sense for 'read-buffer' to always allocate 100MB. newLISP should be smarter and more flexible than that, allocating a small buffer initially, and expanding it with 'realloc' if necessary.



    I have updated the mercurial repository of Dragonfly to compensate for newLISP's poor 'read-buffer', but tell me what is better, this:


    (define (handle-binary-data , (chunk "") (chunk-size 8192) (max-bytes MAX_POST_LENGTH) read)
    (while (and (setf read (read-buffer (device) chunk chunk-size)) chunk (not (zero? max-bytes)))
    (write-buffer $BINARY chunk)
    (dec max-bytes read)
    (when (< max-bytes chunk-size) (setf chunk-size max-bytes))
    )
    )


    Or this:


    (read-buffer (device) $BINARY MAX_POST_LENGTH)

    I find it very odd that you would say that it is better for the newLISP scripter to write what essentially is C-style code to solve this problem in newLISP when it should instead be solved by newLISP for them. The current way that 'read-buffer' must be used is essentially converting that C-code I showed in the previous post into newLISP code, and this is neither necessary nor convenient nor efficient.


    Quote from: "Lutz"ps: your post seems to be related to the email you sent me about Apache and CGI processes. Consider that TCP/IP may transmit a big buffer in smaller chunks and translated by Apache into multiple std I/O events. Your CGI process receiving that data must handle that situation.


    UNIX programs and pipes do not work this way. Your C-program should call 'read' multiple times, to keep reading in chunks until it reaches <int-size> or EOF or error. Apache does the same, except with the standard function 'write'. Apache will call 'write' multiple times as it gets the data from the client, transferring the output to the CGI script. The fact that it calls 'write' multiple times does not mean that "multiple std I/O events" will happen. EOF is only sent once the file descriptor is closed, and that only happens once all of the data has been transferred.



    I really hope you will reconsider and fix 'read-buffer' so that it behaves in a standard and flexible way, the way that most newLISP functions behave. As it stands 'read-buffer' is an eyesore.
    Get your Objective newLISP groove on.