updated manual and XML test1.1.0

author: steve donovan <steve.j.donovan@gmail.com> 2013-03-08 16:21:53 +0400
committer: steve donovan <steve.j.donovan@gmail.com> 2013-03-08 16:21:53 +0400
commit: 5c6078254c0f2ec81572d726c2027354d943dbfa (patch)
tree: 3d7f16930eb5e9fed9f84c8e4d434ce2afd86610
parent: 2c0de51c7f77deb8c396daf8448a6aa9dac6a96f (diff)
5 files changed, 139 insertions, 16 deletions
diff --git a/docs/config.ld b/docs/config.ld
index 1130a28..13d4c94 100644
--- a/docs/config.ld
+++ b/docs/config.ld
@@ -1,5 +1,5 @@
 project = 'Penlight'
-description = 'Penlight Lua Libraries 1.0.3'
+description = 'Penlight Lua Libraries 1.1.0'
 full_description = 'The documentation is available @{01-introduction.md|here}.'
 title = 'Penlight Documentation'
 dir = 'api'
diff --git a/docs/manual/01-introduction.md b/docs/manual/01-introduction.md
index 2006028..5b5deb1 100644
--- a/docs/manual/01-introduction.md
+++ b/docs/manual/01-introduction.md
@@ -93,7 +93,6 @@ formal need to keep the global table uncluttered and the informal need for
 convenience. `require'pl.import_into'` returns a function, which accepts a table
 for injecting Penlight into, or if no table is given, it passes back a new one.
 
-
     local pl = require'pl.import_into'()
 
 The table `pl` is a 'lazy table' which loads modules as needed, so we can then
@@ -194,6 +193,14 @@ For example,
 If you were to accidently type `mymod.Answer()`, then you would get a runtime
 error: "variable 'Answer' is not declared in 'mymod'".
 
+This can be applied to existing modules. You may desire to have the same level
+of checking for the Lua standard libraries:
+
+    strict.make_all_strict(_G)
+
+Thereafter a typo such as `math.cosine` will give you an explicit error, rather
+than merely returning a `nil` that will cause problems later.
+
 ### What are function arguments in Penlight?
 
 Many functions in Penlight themselves take function arguments, like `map` which
@@ -273,6 +280,8 @@ The function `printf` discussed earlier is included in `pl.utils` because it
 makes properly formatted output easier. (There is an equivalent `fprintf` which
 also takes a file object parameter, just like the C function.)
 
+Splitting a string using a delimiter is a fairly common operation, hence `split`.
+
 Utility functions like `is_callable` and `is_type` help with identifying what
 kind of animal you are dealing with. Obviously, a function is callable, but an
 object can be callable as well if it has overriden the `__call` metamethod. The
@@ -319,7 +328,7 @@ upfront, since in general you won't know what values are needed.
 
 Penlight is fully compatible with Lua 5.1, 5.2 and LuaJIT 2. To ensure this,
 `utils` also defines the global Lua 5.2
-[load](http://www.lua.org/work/doc/manual.html#pdf-load) function when needed.
+[load](http://www.lua.org/work/doc/manual.html#pdf-load) function as `utils.load`
 
  * the input (either a string or a function)
  * the source name used in debug information
@@ -327,9 +336,15 @@ Penlight is fully compatible with Lua 5.1, 5.2 and LuaJIT 2. To ensure this,
 whether the source is a binary chunk or text code (default is 'bt')
  * the environment for the compiled chunk
 
-Using `load` should reduce the need to call the deprecated function `setfenv`,
+Using `utils.load` should reduce the need to call the deprecated function `setfenv`,
 and make your Lua 5.1 code 5.2-friendly.
 
+Currently, the `utils` module does define a global `getfenv` and `setfenv` for
+Lua 5.2, based on code by Sergey Rozhenko.  Note that these functions can fail
+for functions which don't access any globals. (whether it's wise to directly
+inject these functions into global or not, I'll leave for a later version to
+decide)
+
 ### Application Support
 
 `app.parse_args` is a simple command-line argument parser. If called without any
diff --git a/docs/manual/03-strings.md b/docs/manual/03-strings.md
index 5312dc3..a408fc9 100644
--- a/docs/manual/03-strings.md
+++ b/docs/manual/03-strings.md
@@ -39,7 +39,8 @@ easily at hand. Note that can be injected into the `string` table if you use
 `stringx.import`, but a simple alias like `local stringx = require 'pl.stringx'`
 is preferrable. This is the recommended practice when writing modules for
 consumption by other people, since it is bad manners to change the global state
-of the rest of the system.
+of the rest of the system. Magic may be used for convenience, but there is always
+a cost.
 
 
 ### String Templates
diff --git a/docs/manual/06-data.md b/docs/manual/06-data.md
index ea69330..e64ee33 100644
--- a/docs/manual/06-data.md
+++ b/docs/manual/06-data.md
@@ -224,9 +224,15 @@ have to use `==` (this warning comes from experience.)
 
 For this to work, _field names must be Lua identifiers_. So `read` will massage
 fieldnames so that all non-alphanumeric chars are replaced with underscores.
+However, the `original_fieldnames` field always contains the original un-massaged
+fieldnames.
 
 `read` can handle standard CSV files fine, although doesn't try to be a
-full-blown CSV parser. Spreadsheet programs are not always the best tool to
+full-blown CSV parser.  With the `csv=true` option, it's possible to have
+double-quoted fields, which may contain commas; then trailing commas become
+significant as well.
+
+Spreadsheet programs are not always the best tool to
 process such data, strange as this might seem to some people. This is a toy CSV
 file; to appreciate the problem, imagine thousands of rows and dozens of columns
 like this:
@@ -263,6 +269,34 @@ condition (such as belonging to a specified set) then it is not generally
 possible to express such a condition as a query string, without resorting to
 hackery such as global variables.
 
+With 1.0.3, you can specify explicit conversion functions for selected columns.
+For instance, this is a log file with a Unix date stamp:
+
+    Time Message
+    1266840760 +# EE7C0600006F0D00C00F06010302054000000308010A00002B00407B00
+    1266840760 closure data 0.000000 1972 1972 0
+    1266840760 ++ 1266840760 EE 1
+    1266840760 +# EE7C0600006F0D00C00F06010302054000000408020A00002B00407B00
+    1266840764 closure data 0.000000 1972 1972 0
+
+We would like the first column as an actual date object, so the `convert`
+field sets an explicit conversion for column 1. (Note that we have to explicitly
+convert the string to a number first.)
+
+    Date = require 'pl.Date'
+
+    function date_convert (ds)
+        return Date(tonumber(ds))
+    end
+
+    d = data.read(f,{convert={[1]=date_convert},last_field_collect=true})
+
+This gives us a two-column dataset, where the first column contains `Date` objects
+and the second column contains the rest of the line. Queries can then easily
+pick out events on a day of the week:
+
+    q = d:select "Time,Message where Time:weekday_name()=='Sun'"
+
 Data does not have to come from files, nor does it necessarily come from the lab
 or the accounts department. On Linux, `ps aux` gives you a full listing of all
 processes running on your machine. It is straightforward to feed the output of
@@ -303,7 +337,7 @@ And it can be used generally as a filter command to extract columns from data.
 
 (As with AWK, please note the single-quotes used in this command; this prevents
 the shell trying to expand the column indexes. If you are on Windows, then you
-are fine, but it is still necessary to quote the expression in double-quotes so
+must quote the expression in double-quotes so
 it is passed as one argument to your batch file.)
 
 As a tutorial resource, have a look at `test-data.lua` in the PL tests directory
@@ -477,7 +511,8 @@ the following fields:
        list_delim = ',',
        trim_quotes = true,
        ignore_assign = false,
-       keysep = '='
+       keysep = '=',
+       smart = false,
     }
 
 `variablilize` is the option that converted `write.timeout` in the first example
@@ -554,7 +589,27 @@ That result is a string, since `tonumber` doesn't like it, but defining the
 `convert_numbers` option as `function(s) return tonumber((s:gsub(' kB$','')))
 end` will get the memory figures as actual numbers in the result. (The extra
 parentheses are necessary so that `tonumber` only gets the first result from
-`gsub`)
+`gsub`). From `tests/test-config.lua':
+
+    testconfig([[
+    MemTotal:        1024748 kB
+    MemFree:          220292 kB
+    ]],
+    { MemTotal = 1024748, MemFree = 220292 },
+    {
+     keysep = ':',
+     convert_numbers = function(s)
+        s = s:gsub(' kB$','')
+        return tonumber(s)
+      end
+     }
+    )
+
+
+The `smart` option lets `config.read` make a reasonable guess for you; there
+are examples in `tests/test-config.lua`, but basically these common file
+formats (and those following the same pattern) can be processed directly in
+smart mode: 'etc/fstab', '/proc/XXXX/status', 'ssh_config' and 'pdatedb.conf'.
 
 Please note that `config.read` can be passed a _file-like object_; if it's not a
 string and supports the `read` method, then that will be used. For instance, to
@@ -675,7 +730,7 @@ snippet from 'text-lexer.lua':
     test.asserteq(ls,List{'for','in','do','if','then','else','end','end'})
 
 Here is a useful little utility that identifies all common global variables found
-in a lua module:
+in a lua module (ignoring those declared locally for the moment):
 
     -- testglobal.lua
     require 'pl'
@@ -721,7 +776,8 @@ specialized library.
 
 #### Parsing and Pretty-Printing
 
-The semi-standard XML parser in the Lua universe is [lua-expat](). In particular,
+The semi-standard XML parser in the Lua universe is [lua-expat](http://matthewwild.co.uk/projects/luaexpat/).
+In particular,
 it has a function called `lxp.lom.parse` which will parse XML into the Lua Object
 Model (LOM) format. However, it does not provide a way to convert this data back
 into XML text.  `xml.parse` will use this function, _if_ `lua-expat` is
@@ -758,7 +814,7 @@ also as an array. It is always present.
 the first child of `d`, etc.
 
 It could be argued that having attributes also as the array part of `attr` is not
-essential (you generally cannot depend on attribute order in XML) but that's how
+essential (you cannot depend on attribute order in XML) but that's how
 it goes with this standard.
 
 `lua-expat` is another _soft dependency_ of Penlight; generally, the fallback
@@ -826,6 +882,7 @@ on Debian/Ubuntu Linux systems.
 
     d = xml.parse [[
     <serviceproviders format="2.0">
+    ...
     <country code="za">
         <provider>
             <name>Cell-c</name>
@@ -873,7 +930,7 @@ on Debian/Ubuntu Linux systems.
             </gsm>
         </provider>
     </country>
-
+    ....
     </serviceproviders>
     ]]
 
@@ -882,7 +939,7 @@ Getting the names of the providers per-country is straightforward:
     local t = {}
     for country in d:childtags() do
         local providers = {}
-        t[country.tag] = providers
+        t[country.attr.code] = providers
         for provider in country:childtags() do
             table.insert(providers,provider:child_with_name('name'):get_text())
         end
@@ -891,12 +948,13 @@ Getting the names of the providers per-country is straightforward:
     pretty.dump(t)
     -->
     {
-      country = {
+      za = {
         "Cell-c",
         "MTN",
         "Vodacom",
         "Virgin Mobile"
       }
+      ....
     }
 
 #### Generating XML with 'xmlification'
@@ -996,4 +1054,26 @@ The `match` method can be passed a LOM document or some text, which will be
 parsed first. Note that `$NUMBER` is treated specially as a numerical index, so
 that `$1` is the first element of the resulting array, etc.
 
+#### HTML Parsing
+
+HTML is an ususally slack dialect of XML, and Dennis Schridde has contributed
+a feature which makes parsing it easier.  For instance, from the tests:
+
+    xml.parsehtml = true
+
+    doc = xml.parse [[
+    <BODY>
+    Hello dolly<br>
+    HTML is <b>slack</b><br>
+    </BODY>
+    ]]
+
+    asserteq(xml.tostring(doc),[[
+    <body>
+    Hello dolly<br/>
+    HTML is <b>slack</b><br/></body>]])
+
+That is, all tags are converted to lowercase, and some elements like `br`
+are properly closed.
+
 
diff --git a/tests/test-xml.lua b/tests/test-xml.lua
index dde8ce8..14fe3c1 100644
--- a/tests/test-xml.lua
+++ b/tests/test-xml.lua
@@ -391,4 +391,31 @@ t = SP{country{code="$country",provider{
    name '$name', gsm{apn {value="$apn",dns '196.43.46.190'}}
    }}}
 
-print(xml.tostring(t,' ','  '))
+out = xml.tostring(t,' ','  ')
+asserteq(out,[[
+
+ <serviceprovider>
+   <country code='$country'>
+     <provider>
+       <name>$name</name>
+       <gsm>
+         <apn value='$apn'>
+           <dns>196.43.46.190</dns>
+         </apn>
+       </gsm>
+     </provider>
+   </country>
+ </serviceprovider>]])
+
+xml.parsehtml = true
+doc = parse [[
+<BODY>
+Hello dolly<br>
+HTML is <b>slack</b><br>
+</BODY>
+]]
+
+asserteq(xml.tostring(doc),[[
+<body>
+Hello dolly<br/>
+HTML is <b>slack</b><br/></body>]])
author	steve donovan <steve.j.donovan@gmail.com>	2013-03-08 16:21:53 +0400
committer	steve donovan <steve.j.donovan@gmail.com>	2013-03-08 16:21:53 +0400
commit	5c6078254c0f2ec81572d726c2027354d943dbfa (patch)
tree	3d7f16930eb5e9fed9f84c8e4d434ce2afd86610
parent	2c0de51c7f77deb8c396daf8448a6aa9dac6a96f (diff)