Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/hxseven/htmlSQL.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJonas John <jonas@jonasjohn.de>2012-02-09 14:49:06 +0400
committerJonas John <jonas@jonasjohn.de>2012-02-09 14:49:06 +0400
commitce9c480292ec48b7bae27ac4e82057aebfbdceee (patch)
tree68d2ca39b84618fc8fc9781c7306c056e237766d
parent86ea60d5dabfa6bba23ad29d6daaf8c8d7c3a10a (diff)
Minor code cleanup
-rwxr-xr-xREADME-german.md20
-rwxr-xr-xREADME.md33
-rwxr-xr-xhtmlsql.class.php70
3 files changed, 57 insertions, 66 deletions
diff --git a/README-german.md b/README-german.md
index 297d195..662f1d0 100755
--- a/README-german.md
+++ b/README-german.md
@@ -16,7 +16,7 @@ wie diese ausführt:
Diese Abfrage gibt einen Array aller Links mit dem Attribut class="liste"
zurück.
-Alle HTTP Verbindungen in htmlSQL benützen die wunderbare Snoopy Klasse
+Alle HTTP Verbindungen in htmlSQL benützen die Snoopy Klasse
(Package Version 1.2.3 - URL: http://sourceforge.net/projects/snoopy/).
Allerdings wird Snoopy nicht für "file" oder "string" Queries benötigt.
Alle Snoopy betreffenden Dokumente (z.B: Copyright-Infos, Readme, usw.)
@@ -45,7 +45,7 @@ eine universelle Klasse dafür zu entwickeln.
Warnung
-------
-Für die Abfragen wird die eval()-Funktion benützt. Deshalb sollten alle
+Für die Abfragen wird die `eval()` Funktion benützt. Deshalb sollten alle
vom Besucher abhängige Daten wie z.b. IDs geprüft oder ggf. gefiltert
werden da es ansonsten möglich wäre schadhaften PHP Quelltext auszuführen.
Vertraue niemals Benutzereingaben!
@@ -55,10 +55,11 @@ Todo
----
- Den internen HTML Parser verbessern
-- Ein eigenes Query system entwickeln und nicht
- das PHP eigene nutzen ( Die eval()-Lösung gefällt mir nicht wirklich)
+- Ein eigenes Query system entwickeln und nicht das PHP eigene nutzen
+ (Die eval()-Lösung ist nicht wirklich schön)
- Mehr Fehlerprüfungen
-- LIMIT Funktion einbauen
+- Unit tests
+- LIMIT Funktion (wie in SQL)
Anwendungsgebiete von htmlSQL
@@ -80,12 +81,3 @@ Lizenz
htmlSQL benützt eine modifizierte BSD Lizenz, welche ziemlich offen ist.
Der Lizenztext befindet sich in der "htmlsql.class.php".
-Kurz zusammengefasst besagt er folgendes:
-
-- Die htmlSQL Klasse kann frei in kommerziellen und nicht-kommerziellen Projekten benützt werden
-- Die Klasse darf mit oder ohne Änderungen frei weitergegeben werden
-- Der Copyright-Hinweis darf nicht entfernt werden
-- Der Autor übernimmt keine Haftung für eventuelle Schäden
-- Der Name des Autors oder anderen beteiligten Autoren darf nur mit
- schriftlicher Genehmigung benützt werden um für Produkte, welche
- htmlSQL benützen, zu werben
diff --git a/README.md b/README.md
index 1b471bb..db2399c 100755
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
htmlSQL - Version 0.5
=====================
-htmlSQL is a experimental PHP class which allows you to access HTML
+htmlSQL is a experimental PHP library which allows you to access HTML
values by an SQL like syntax. This means that you don't have to write
complex functions or regular expressions to extract specific values.
@@ -20,28 +20,30 @@ The project has been abandoned
------------------------------
htmlSQL was a experiment I made in 2006. I'm **not** supporting or extending the library anymore, this repository is only for historical purposes.
-But feel free to fork, modify and study the source code. If you need a reliable library for data scraping I recommend using **other modules** (see below).
+But feel free to fork, modify and study the source code. If you need a reliable library for data scraping I recommend using **other modules**.
Related projects:
-* PHP: [SimpleXML](http://www.php.net/dom), [DOM](http://www.php.net/dom)
-* Perl: [pQuery](http://search.cpan.org/~ingy/pQuery-0.07/lib/pQuery.pm)
-* Python: [Scrapy](http://scrapy.org/)
+* PHP: [phpQuery](http://code.google.com/p/phpquery/), [SimpleXML](http://www.php.net/simplexml), [DOM](http://www.php.net/dom)
+* Perl: [WWW::Mechanize](http://search.cpan.org/dist/WWW-Mechanize/), [pQuery](http://search.cpan.org/~ingy/pQuery-0.07/lib/pQuery.pm)
+* Python: [Scrapy](http://scrapy.org/), [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/)
* JavaScript: [node.js](http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs)
+* .NET: [Html Agility Pack](http://htmlagilitypack.codeplex.com/)
+Related links:
-Related Hacker News threads:
-
-* [PHP class to query the web by an SQL like language](http://news.ycombinator.com/item?id=2097008)
-* [Ask YC: What do you scrape? How do you scrape?](http://news.ycombinator.com/item?id=159025)
+* [Stack Overflow: Options for HTML scraping?](http://stackoverflow.com/questions/2861/options-for-html-scraping)
+* [Stack Overflow: HTML Scraping in PHP](http://stackoverflow.com/questions/34120/html-scraping-in-php)
+* [Hacker News: PHP class to query the web by an SQL like language](http://news.ycombinator.com/item?id=2097008)
+* [Hacker News: Ask YC: What do you scrape? How do you scrape?](http://news.ycombinator.com/item?id=159025)
Requirements
------------
- Any flavor of PHP4+ should do
-- [Snoopy PHP class - Version 1.2.3](http://sourceforge.net/projects/snoopy/) (optional - required for web transfers)
+- [Snoopy PHP class - Version 1.2.3](http://sourceforge.net/projects/snoopy/) (optional - required for web transfers)
You find all Snoopy related documents (copyright, readme, etc) in the snoopy_data/ subdirectory.
@@ -50,7 +52,7 @@ Usage
Just include the "snoopy.class.php" and the "htmlsql.class.php" files
into your PHP scripts and look at the examples to get an idea of how
-to use the htmlSQL class. It should be very simple :-)
+to use the htmlSQL library. It should be very simple :-)
Background / idea
@@ -59,9 +61,9 @@ Background / idea
I had this idea while extracting some data from a website. As I realized
that the algorithms and functions to extract links and other tags are
often the same - I had the idea to combine all functions to an universal
-usable class. While drinking a coffee and thinking on that problem, I
+usable library. While drinking a coffee and thinking about that, I
thought it would be cool to access HTML elements by using SQL. So I
-started creating this class...
+started creating this library...
Warning
@@ -78,8 +80,9 @@ Todo
* Enhance the HTML parser
* Test htmlSQL with invalid and bad HTML files
* Replace the ugly `eval()` method for the WHERE statement with an own method
-* More error checks
-* Include the LIMIT function/method like in SQL
+* Add more error checks
+* Add unit tests
+* Add a LIMIT function like in SQL
Author
diff --git a/htmlsql.class.php b/htmlsql.class.php
index 2292073..0a21f13 100755
--- a/htmlsql.class.php
+++ b/htmlsql.class.php
@@ -3,13 +3,13 @@
/*
htmlSQL - version 0.5
--------------------------------------------------------------------
-htmlSQL is a experimental class to query websites or HTML code with
+htmlSQL is a experimental library to query websites or HTML code with
an SQL-like language.
AUTHOR: Jonas John (http://www.jonasjohn.de/)
The latest version of htmlSQL can be obtained from:
-http://www.jonasjohn.de/lab/htmlsql.htm
+https://github.com/hxseven/htmlSQL
LICENSE:
--------------------------------------------------------------------
@@ -45,14 +45,12 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
CHANGELOG:
0.4 -> 0.5 (May 07, 2006):
-- Renamed the project from webSQL to htmlSQL, because webSQL
- is already existing... :-(
-- Added some error checks and error messages
-- Added the convert_tagname_to_key function and
- fixed a few issues
+- Renamed the project from webSQL to htmlSQL because webSQL already exists
+- Added more error checks
+- Added the convert_tagname_to_key function and fixed a few issues
0.1 -> 0.4 (April 2006):
-- Created main parts of the class
+- Created main parts of the library
*/
@@ -146,18 +144,25 @@ class htmlsql {
** connects to a data source (url, file or string)
*/
- function connect($type, $resource){
+ function connect($type, $resource){
+
if ($type == 'url'){
return $this->_fetch_url($resource);
}
- else if ($type == 'file') {
+ else if ($type == 'file') {
+
if (!file_exists($resource)){
$this->error = 'The given file "'.$resource.' does not exist!';
return false;
- }
- $this->page = file_get_contents($resource); return true;
+ }
+
+ $this->page = file_get_contents($resource);
+ return true;
+ }
+ else if ($type == 'string') {
+ $this->page = $resource;
+ return true;
}
- else if ($type == 'string') { $this->page = $resource; return true; }
return false;
}
@@ -201,7 +206,8 @@ class htmlsql {
else {
$this->error = 'Could not establish a connection to the given URL!';
return false;
- }
+ }
+
return true;
}
@@ -214,10 +220,11 @@ class htmlsql {
function _extract_all_tags($html, &$tag_names, &$tag_attributes, &$tag_values, $depth=0){
- // stop endless loops:
- if ($depth > 99999){ return; }
+ // stop endless loops -> ugly...
+ if ($depth > 99999) return;
- preg_match_all('/<([a-z0-9\-]+)(.*?)>((.*?)<\/\1>)?/is', $html, $m);
+ preg_match_all('/<([a-z0-9\-]+)(.*?)>((.*?)<\/\1>)?/is', $html, $m);
+
if (count($m[0]) != 0){
for ($t=0; $t < count($m[0]); $t++){
@@ -332,8 +339,7 @@ class htmlsql {
return false;
}
- return $r;
-
+ return $r;
}
@@ -433,9 +439,7 @@ class htmlsql {
$search_term = $last;
}
- /*
- ** find tags:
- */
+ // find tags
if ($search_term == '*'){
// search all
@@ -448,8 +452,7 @@ class htmlsql {
$this->_extract_all_tags($html, $tag_names, $tag_attributes, $tag_values);
- $this->_match_tags($results, $return_values, $where_term, $tag_attributes, $tag_values, $tag_names);
-
+ $this->_match_tags($results, $return_values, $where_term, $tag_attributes, $tag_values, $tag_names);
}
else {
@@ -474,12 +477,7 @@ class htmlsql {
$this->results = $results;
// was there a error during the search process?
- if ($this->error != ''){
- return false;
- }
-
- return true;
-
+ return ($this->error == '');
}
/*
@@ -490,7 +488,8 @@ class htmlsql {
function convert_tagname_to_key(){
- $new_array = array();
+ $new_array = array();
+ $tag_name = '';
while(list($key,$val) = each($this->results)){
@@ -559,13 +558,10 @@ class htmlsql {
$results[$key] = $this->_array2object($val);
}
- $this->results_objects = $results;
-
- return $this->results_objects;
- }
- else {
- return $this->results_objects;
+ $this->results_objects = $results;
}
+
+ return $this->results_objects;
}
/*