Commit 995b3518 authored by Bjørnar Hansen's avatar Bjørnar Hansen Committed by Bjørnar Hansen
Browse files

Open source release.

parents
*.swp
*.swo
*.err
*.log
crawl.json
.cabal-sandbox
cabal.sandbox.config
.python-sandbox
[submodule "hs-ttrpc"]
path = hs-ttrpc
url = git@gitlab.tingtun.no:eiii_source/hs-ttrpc.git
[submodule "lib/msgpack-haskell"]
path = lib/msgpack-haskell
url = https://github.com/tingtun/msgpack-haskell.git
[submodule "wam"]
path = wam
url = git@gitlab.tingtun.no:eiii_source/wam.git
[submodule "hs-checker-common"]
path = hs-checker-common
url = git@gitlab.tingtun.no:eiii_source/hs-checker-common.git
[submodule "lib/html5parser"]
path = lib/html5parser
url = git@gitlab.tingtun.no:eiii_source/html5parser.git
[submodule "lib/iso639-language-codes"]
path = lib/iso639-language-codes
url = git@gitlab.tingtun.no:eiii_source/iso639-language-codes.git
[submodule "databus"]
path = databus
url = git@gitlab.tingtun.no:eiii_source/databus.git
[submodule "sampler"]
path = sampler
url = git@gitlab.tingtun.no:eiii_source/sampler.git
[submodule "eiii-crawler"]
path = eiii-crawler
url = git@gitlab.tingtun.no:eiii_source/eiii_crawler.git
[submodule "py-ttrpc"]
path = py-ttrpc
url = git@gitlab.tingtun.no:eiii_source/py-ttrpc.git
[submodule "logging"]
path = logging
url = git@gitlab.tingtun.no:eiii_source/logging.git
# Installing the EIII checker suite
## Dependencies
First, make sure that you have the required dependencies installed. Different
operating system distributions provide these in packages with different names.
Here is the list of dependencies:
- ghc >=7.10
- cabal
- logrotate
- phantomjs 2
- postgresql 9.4
- python 2
- virtualenv 2
- pip 2
- zeromq 4
- python-psycopg 2
For Debian, most of these can be installed with `apt-get`:
sudo apt-get install python-virtualenv python-pip python-dev libzmq3-dev\
libpq-dev gcc g++ happy python-psycopg2 postgresql
The GHC and Cabal packages available in the Debian repositories is too old; get the source
distributions from [](http://www.haskell.org/) and install them manually
([See here for instructions](https://gist.github.com/yantonov/10083524); ignore the *stack* parts).
## Installation
Yes, this could be automated. That wouldn't be half as fun, now would it?
1. Clone the master repository and submodules
git clone --recursive git@gitlab.tingtun.no:eiii_source/checker-suite
cd checker-suite
2. Get Selenium-server-standalone
cd selenium
./getsel.sh
cd ..
3. Create and activate Python sandbox
virtualenv2 -p/usr/bin/python2 .python-sandbox
source .python-sandbox/bin/activate
4. Install Python dependencies
pip install superlance supervisor
5. Install crawler
pip install --allow-unverified sgmlop ./py-ttrpc ./eiii-crawler
6. Create Haskell sandbox
cabal update
cabal sandbox init --sandbox=.cabal-sandbox
cabal sandbox add-source ./hs-checker-common \
./lib/{html5parser,iso639-language-codes,msgpack-haskell/msgpack} \
./sampler ./logging ./hs-ttrpc ./wam ./databus
7. Install Haskell packages (grab a cup of tea)
cabal install ./databus ./wam
8. Create database and schema
cd databus
export username=$(whoami)
export dbname=eiii
sudo -u postgres createuser $username --createdb
sudo -u postgres createdb -U $username $dbname
sudo -u postgres psql -c "
create extension if not exists \"tablefunc\";
create extension if not exists \"uuid-ossp\";
create language plpythonu;
create role dba with superuser noinherit;
grant dba to $username;
"
sudo -u $username psql -d$dbname -f schema.sql
cd ..
If all the steps completed successfully, the installation of the checker suite
is now complete.
## Running the checker suite
- Update configuration in `checker-suite.conf`.
- To start the checker suite, run `./checker-suite`.
Copyright (c) 2015, Tingtun AS, http://tingtun.no
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# EIII checker suite
This is the main source code repository in the EIII checker suite. It pulls in
the various components of the EIII checker suite of tools. A
[Supervisor](http://supervisord.org/) configuration for launching the suite is provided.
The checker suite has been tested on Linux only.
## Installation
See [INSTALL](INSTALL.md).
## Running
The checker suite is configured in [checker-suite.conf](checker-suite.conf), and started by running `./checker-suite`.
## Main capabilities and functions
There are three main capabilities available for use.
- Check a single web page
- Check a single web site
- Check multiple web sites
### Check a single web page
Using the command line interface, a check of a single web page can be accomplished with the ` ./checkerctl wam-check` command. For example,
./checkerctl wam-check http://www.example.com
will run a page check of [example.com](http://www.example.com) and return the results on stdout in YAML format.
This can be done through the HTTP interface as well. For instance, using cURL, if 'httpctl' is configured to listen on `localhost:9014` (the default):
curl localhost:9014/wam-check?url=http://www.example.com&type=raw
This returns the checker result in JSON format.
### Checking a web site
When you start a check of a web site, it will first be crawled and then a selection of the found URLs will be checked.
Using the HTTP interface, submit a POST request like so
curl -XPOST localhost:9014/start-site-check?url=http://example.com
while the equivalent 'checkerctl' command is
./checkerctl start-site-check http://www.example.com
This generates some default crawler, sampler, and checker rules and starts the site check. It returns a UUID, the identifier for the eventual site result.
You can fetch the site result immediately, but will not contain complete information until the site check has completed. The way to get the result is as follows:
#### HTTP:
curl localhost:9014/site-result/<UUID>
#### CMD
./checkerctl get-site-result <UUID>
where `<UUID>` is the UUID of the site result. These commands return only the site result itself; for the complete result set including page results, replace 'site-result' with site-page-results' in the commands above. This will return the tuple (site result, list of page results); be advised that the returned data may be very large.
### Check multiple web sites
It is also possible to perform checks on many sites in one go. This is termed a testrun.
In order to start a testrun, it must first be defined. Examples in JSON and YAML format are provided.
Having defined the testrun, the file can be used in order to create a set of testrun rules in the database. This ruleset will be given a unique identifier which can be used in order to start a testrun with those rules.
Here is how to do this with the supplied example files. The CMD interface uses YAML while the HTTP interface uses JSON.
#### CMD
# First we upload the testrun definition to the database.
# When successful, the command returns a UUID.
% ./checkerctl create-testrun example-testrun-rules.yml \
example-testrun-sites.yml
c1b69974-6839-11e5-82ef-2f129eb1698d # this is the UUID that was generated for us
# Using the UUID for the testrun ruleset, we can start the testrun.
# Again this returns a UUID, which identifies the testrun result.
% ./checkerctl start-stored-testrun c1b69974-6839-11e5-82ef-2f129eb1698d
cc0e0065-29b0-4eac-88dc-6ac8fc66c4b8
#### HTTP
% curl -XPOST 'localhost:9014/create-testrun' -d@example-testrun.json
6fba2dce-683a-11e5-82ef-b77119365d34
% curl -XPOST 'localhost:9014/start-testrun/6fba2dce-683a-11e5-82ef-b77119365d34'
fab0fbf2-c52a-4e1c-ba9d-b4751b0036ca
## What are the provided components and what do they do?
What follows are short descriptions of some of the components that make up the EIII suite.
- **Databus** -- The central hub in the checker suite. It orchestrates the things to do. It is backed by the PostgreSQL database.
- **httpctl** -- Provides a HTTP interface to the bus.
- **checkerctl** -- Provides a command-line interface to the bus.
- **EIII Crawler** -- This component performs the crawling of sites
- **Sampler** -- Given a set of sampler rules and tuples (Content-Type,URL), this component selects which URLs to check.
- **WAM** -- Also known as webpage-wam, or simply “checker”, this is the component performing the actual accessibility checks.
- **{py-,hs-}TTRPC** -- The RPC implementation used for communication across the bus.
- **Supervisor** -- A process control system used to control the starting up of the suite.
- **Selenium and PhantomJS** -- In order to download web pages, the WAM uses the headless browser PhantomJS. The Selenium server is employed in order to serve up multiple instances of PhantomJS.
## Notes on working with sandboxes
### Haskell (cabal sandbox)
If you make changes to a Haskell component you can reinstall it by using
`cabal install`. For instance, if you've modified 'databus', then reinstall it
using `cabal install ./databus`.
### Python (virtualenv)
If you make changes to a Python component you can reinstall it by using
`pip install --upgrade` after having activated the virtualenv.
For instance, if you've modified 'eiii-crawler', then
reinstall it by issuing the following two commands
source .python-sandbox/bin/activate
pip install --upgrade ./eiii-crawler
## Licensing
The source code published here is subject to the BSD3 license. For more info, see [LICENSE](LICENSE).
#!/usr/bin/env sh
export PATH="${PWD}/.cabal-sandbox/bin:${PATH}"
. .python-sandbox/bin/activate
. ./checker-suite.conf
exec supervisord -c supervisor.ini
# Configuration file for checker-suite.
# … actually it's mostly a bunch of shell `export` statements.
# use it like `. checker-suite.conf` in a script.
export psql_connection_string="postgresql:///eiii"
export databus_url="tcp://127.0.0.1:9000"
export crawler_controller_url="tcp://127.0.0.1:9001"
export sampler_controller_url="tcp://127.0.0.1:9002"
export webpage_wam_controller_url="tcp://127.0.0.1:9003"
export httpctl_port=9014
export httpctl_url="http://127.0.0.1:$httpctl_port"
export httpctl_ttrpc_url="tcp://127.0.0.1:9005"
export wam_addr='127.0.0.1'
export wam_max_memory=1200MB
# These cannot be interpolated in supervisor.ini; you need to edit them in
# that file.
# export wam_start_port=9501
# export wam_count=1
export crawler_addr='127.0.0.1'
# These cannot be interpolated in supervisor.ini; you need to edit them in
# that file.
# export crawler_start_port=9501
# export crawler_count=1
export selenium_server_port=4444
export selenium_server_url="127.0.0.1:$selenium_server_port"
export accountability_proxy_cachetime=600
export email_address=""
# export accountability_proxy_url='tcp://127.0.0.1:9010'
#!/usr/bin/env sh
. ./checker-suite.conf
.cabal-sandbox/bin/checkerctl $databus_url $@
Subproject commit f73c9fad22e742c704c167b9f1e6a6ef27ec01c6
Subproject commit e2c6522e10b200cbf406ad48e88b4912ddee30c6
name: Tingtun Test
sites:
- http://tingtun.no
- http://eiii.eu
content-types:
- text/html
- application/xhtml+xml
- application/xml
tools:
- webpage-wam
default: &def
# Site name
-
# Crawler rules
- &defcr
# Seeds and scoping rules are initialized automatically from the
# domain name
loglevel: info
seeds: []
scoping-rules: []
obey-robotstxt: True
min-crawl-delay: 2
max-pages:
- - &html
["text/html", "application/xml", "application/xhtml+xml"]
- 50
size-limits:
- - *html
- 2097152 # 2 MiB
# Sampler rules
- &defsr
max-pages:
- - *html
- 600
# Checker rules
- &defchr
tools:
- - *html
- ["webpage-wam"]
http://tingtun.no:
- Tingtun
- *defcr
- *defsr
- *defchr
http://eiii.eu: *def
{
"testrun-rules": {
"content-types": [
"text/html",
"application/xhtml+xml",
"application/xml"
],
"name": null,
"sites": [
"http://www.tingtun.no",
"http://eiii.eu"
],
"tools": [
"webpage-wam"
]
},
"owner": null,
"site-rules": {
"http://eiii.eu": [
"EIII Project",
{
"loglevel": "debug",
"max-pages": [
[
[
"text/html",
"application/xml",
"application/xhtml+xml"
],
60
]
],
"min-crawl-delay": 1,
"obey-robotstxt": true,
"scoping-rules": [],
"seeds": [
"http://eiii.eu"
],
"size-limits": [
[
[
"text/html",
"application/xml",
"application/xhtml+xml"
],
2097152
]
]
},
{
"max-pages": [
[
[
"text/html",
"application/xml",
"application/xhtml+xml"
],
600
]
]
},
{
"tools": [
[
[
"text/html",
"application/xml",
"application/xhtml+xml"
],
[
"webpage-wam"
]
]
]
}
],
"http://www.tingtun.no": [
null,
{
"loglevel": "debug",
"max-pages": [
[
[
"text/html",
"application/xml",
"application/xhtml+xml"
],
60
]
],
"min-crawl-delay": 1,
"obey-robotstxt": true,
"scoping-rules": [],
"seeds": [
"http://www.tingtun.no"
],
"size-limits": [
[
[
"text/html",
"application/xml",
"application/xhtml+xml"
],
2097152
]
]
},
{
"max-pages": [
[
[
"text/html",
"application/xml",
"application/xhtml+xml"
],
600
]
]
},
{
"tools": [
[
[
"text/html",
"application/xml",
"application/xhtml+xml"
],
[
"webpage-wam"
]
]
]
}
]
}
}
Subproject commit 2d4dce2e2c981992f774c48bcb00d4918d6efed3
Subproject commit 35a94ed9199926c567e3b743c90813605931d8ed
Subproject commit 2d3960fd78111a987918e5eb9608f62d589ac36f
Subproject commit 9aef49e56ecdea1983588e5a97d3195b1f9fcffd
Subproject commit 601444e3ee0846e1336e41fae8a99d2dff665984
Subproject commit 2bc7602532416350899e1bcc502ef740e502df83
# see "man logrotate" for details
# rotate log files daily
daily
# Rotate if size of log file is > 256M
maxsize 256M
# keep 4 weeks worth of backlogs
rotate 28
# copy logs, then truncate the old file
copytruncate
# uncomment this if you want your log files compressed
compress
# don't touch empty files
notifempty
# add date to the filename of rotated files
dateext
# date format: -YYYY-MM-DD
dateformat -%Y-%m-%d
"logs/*.log" {
}
Subproject commit 44ace4c1b853a42a4ffcb387f905eccc832e874d
Subproject commit 7afc61bf39d53e6b6bc0e02caf51611e50c3ceff
#!/bin/bash
# Download selenium server jar file.
selenium_version=2.47
selenium_minor_version=1
file_name=selenium-server-standalone-${selenium_version}.${selenium_minor_version}.jar
wget -N http://selenium-release.storage.googleapis.com/$selenium_version/$file_name
ln -sf $file_name selenium-server-standalone.jar
[inet_http_server]
port=127.0.0.1:8999
[supervisord]
logfile=%(here)s/logs/supervisord.log
loglevel=info
nodaemon=true
childlogdir=logs
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[supervisorctl]
serverurl=http://127.0.0.1:8999
[program:master]
command=master
--listen-address %(ENV_databus_url)s
--psql-connection-string %(ENV_psql_connection_string)s
+RTS
-N2
directory=%(here)s
priority=1 ; the relative start priority (default 999)
redirect_stderr=true
stdout_logfile=logs/master.stdout.log ; stdout log path, NONE for none; default AUTO
autorestart=true
stopsignal=INT
[program:crawler-controller]
command=crawler-controller
--master-address %(ENV_databus_url)s
--listen-address %(ENV_crawler_controller_url)s
--psql-connection-string %(ENV_psql_connection_string)s
+RTS
-N2
directory=%(here)s
priority=200 ; the relative start priority (default 999)
redirect_stderr=true ; redirect proc stderr to stdout (default false)
stdout_logfile=logs/%(program_name)s.stdout.log ; stdout log path, NONE for none; default AUTO
numprocs = 1
numprocs_start = 1
autostart=true
autorestart=true
stopsignal=INT
[program:sampler-controller]
command=sampler-controller
--master-address %(ENV_databus_url)s