Prüfe Existenz von /favicon.ico und werte dies ebenso wie ein Icon, das im HTML Head verlinkt ist (#115)
* Fix full JSON export * Update ignore list * Update README * Check for /favicon.ico and rate it as icon available * Remove broken cookies test
This commit is contained in:
parent
9e5426ccde
commit
04a1e98b79
|
@ -1,10 +1,8 @@
|
||||||
venv
|
venv
|
||||||
cache
|
cache
|
||||||
webapp/node_modules
|
|
||||||
secrets
|
secrets
|
||||||
temp
|
temp
|
||||||
__pycache__
|
__pycache__
|
||||||
.vscode/settings.json
|
.vscode/settings.json
|
||||||
webapp/dist/bundle.js
|
kubernetes/green-spider-secret.yaml
|
||||||
dev-shm
|
/volumes
|
||||||
/export-*
|
|
71
README.md
71
README.md
|
@ -6,13 +6,13 @@ Zur Auswertung: [https://green-spider.netzbegruenung.de/](https://green-spider.n
|
||||||
|
|
||||||
## Tools
|
## Tools
|
||||||
|
|
||||||
- Spider: Sammelt Informationen über Websites von B90/GRÜNE Gliederungen
|
- **Spider:** Sammelt Informationen über Websites von B90/GRÜNE Gliederungen
|
||||||
|
- **Screenshotter:** Erstellt Seiten-Screenshots. Siehe [netzbegruenung/green-spider-screenshotter](https://github.com/netzbegruenung/green-spider-screenshotter/)
|
||||||
- Screenshotter: Erstellt Seiten-Screenshots. Siehe [netzbegruenung/green-spider-screenshotter](https://github.com/netzbegruenung/green-spider-screenshotter/)
|
- **Webapp:** Darstellung der Spider-Ergebnisse. Siehe [netzbegruenung/green-spider-webapp](https://github.com/netzbegruenung/green-spider-webapp/). Dazu gehören
|
||||||
|
- **API**: [netzbegruenung/green-spider-api](https://github.com/netzbegruenung/green-spider-api)
|
||||||
- Webapp: Darstellung der Spider-Ergebnisse. Siehe [netzbegruenung/green-spider-webapp](https://github.com/netzbegruenung/green-spider-webapp/)
|
- **Elasticsearch**
|
||||||
|
- **Indexer:** Lädt Ergebnisdaten in Elasticsearch. Siehe [netzbegruenung/green-spider-indexer](https://github.com/netzbegruenung/green-spider-indexer)
|
||||||
- Indexer: Lädt Ergebnisdaten in Elasticsearch. Siehe [netzbegruenung/green-spider-indexer](https://github.com/netzbegruenung/green-spider-indexer)
|
- **Auswertung**: R Projekt zur Auswertung der Ergebnisse. Siehe [netzbegruenung/green-spider-analysis](https://github.com/netzbegruenung/green-spider-analysis)
|
||||||
|
|
||||||
## Aktivitäten
|
## Aktivitäten
|
||||||
|
|
||||||
|
@ -24,40 +24,37 @@ Green Spider ist ein Projekt des [netzbegrünung](https://blog.netzbegruenung.de
|
||||||
|
|
||||||
Zur Kommunikation dient der Chatbegrünung-Kanal [#green-spider](https://chatbegruenung.de/channel/green-spider) sowie die [Issues](https://github.com/netzbegruenung/green-spider/issues) hier in diesem Repository.
|
Zur Kommunikation dient der Chatbegrünung-Kanal [#green-spider](https://chatbegruenung.de/channel/green-spider) sowie die [Issues](https://github.com/netzbegruenung/green-spider/issues) hier in diesem Repository.
|
||||||
|
|
||||||
## Anleitung
|
## Betrieb
|
||||||
|
|
||||||
|
Alle Informationen zum Betrieb befinden sich im Verzeichnis [devops](https://github.com/netzbegruenung/green-spider/tree/master/devops).
|
||||||
|
|
||||||
|
## Entwicklung
|
||||||
|
|
||||||
|
Green Spider ist in Python 3 geschrieben und wird aktuell unter 3.6 getestet und ausgeführt.
|
||||||
|
|
||||||
|
Aufgrund zahlreicher Dependencies empfiehlt es sich, den Spider Code lokal in Docker
|
||||||
|
auszuführen.
|
||||||
|
|
||||||
|
Das Image wird über den folgenden Befehl erzeugt:
|
||||||
|
|
||||||
|
```nohighlight
|
||||||
|
make
|
||||||
|
```
|
||||||
|
|
||||||
|
Das dauert beim ersten Ausführen einige Zeit, wiel einige Python-Module das Kompilieren diverser Libraries erfordern.
|
||||||
|
Nach dem ersten erfolgreichen Durchlauf dauert ein neuer Aufruf von `make` nur noch wenige Sekunden.
|
||||||
|
|
||||||
|
### Tests ausführen
|
||||||
|
|
||||||
|
In aller Kürze: `make test`
|
||||||
|
|
||||||
### Spider ausführen
|
### Spider ausführen
|
||||||
|
|
||||||
Zum Ausführen des Spider auf einem Server siehe Verzeichnis [devops](https://github.com/netzbegruenung/green-spider/tree/master/devops).
|
Der Spider kann einzelne URLs verarbeiten, ohne die Ergebnisse in eine Datenbank zu schreiben.
|
||||||
|
Am einfachsten geht das über den `make spider` Befehl, so:
|
||||||
Voraussetzungen zum lokalen Ausführen:
|
|
||||||
|
|
||||||
- Docker
|
|
||||||
- Schlüssel mit Schreibrecht für die Ergebnis-Datenbank
|
|
||||||
|
|
||||||
Um alle Sites aus aus [netzbegruenung/green-directory](https://github.com/netzbegruenung/green-directory) zu spidern:
|
|
||||||
|
|
||||||
```nohighlight
|
```nohighlight
|
||||||
make spiderjobs
|
make spider ARGS="--url http://www.example.com/"
|
||||||
make spider
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Alternativ kann wie im nachfolgenden Beispiel gezeogt das Spidern einer einzelnen URL angestoßen werden. Diese muss nicht zwingend Teil des `green-directory` sein.
|
Ohne `ARGS` aufgerufen, arbeitet der Spider eine Jobliste ab. Dies erfordert Zugriff auf die entsprechende Datenank.
|
||||||
|
|
||||||
```nohighlight
|
|
||||||
docker run --rm -ti \
|
|
||||||
-v $PWD/secrets:/secrets
|
|
||||||
quay.io/netzbegruenung/green-spider:latest \
|
|
||||||
--credentials-path /secrets/datastore-writer.json \
|
|
||||||
jobs --url https://www.trittin.de/
|
|
||||||
|
|
||||||
make spider
|
|
||||||
```
|
|
||||||
|
|
||||||
### Screenshots erstellen
|
|
||||||
|
|
||||||
Siehe Verzeichnis [devops](https://github.com/netzbegruenung/green-spider/tree/master/devops).
|
|
||||||
|
|
||||||
### Webapp deployen
|
|
||||||
|
|
||||||
Siehe Verzeichnis [devops](https://github.com/netzbegruenung/green-spider/tree/master/devops).
|
|
||||||
|
|
|
@ -5,21 +5,22 @@ functionality of a site or individual pages.
|
||||||
|
|
||||||
import logging
|
import logging
|
||||||
|
|
||||||
from checks import charset
|
|
||||||
from checks import certificate
|
from checks import certificate
|
||||||
|
from checks import charset
|
||||||
from checks import dns_resolution
|
from checks import dns_resolution
|
||||||
from checks import duplicate_content
|
|
||||||
from checks import domain_variations
|
from checks import domain_variations
|
||||||
|
from checks import duplicate_content
|
||||||
from checks import frameset
|
from checks import frameset
|
||||||
from checks import generator
|
from checks import generator
|
||||||
from checks import html_head
|
from checks import html_head
|
||||||
from checks import http_and_https
|
from checks import http_and_https
|
||||||
from checks import hyperlinks
|
from checks import hyperlinks
|
||||||
from checks import page_content
|
from checks import load_favicons
|
||||||
from checks import load_feeds
|
from checks import load_feeds
|
||||||
from checks import load_in_browser
|
from checks import load_in_browser
|
||||||
from checks import url_reachability
|
from checks import page_content
|
||||||
from checks import url_canonicalization
|
from checks import url_canonicalization
|
||||||
|
from checks import url_reachability
|
||||||
|
|
||||||
from checks.config import Config
|
from checks.config import Config
|
||||||
|
|
||||||
|
@ -46,6 +47,7 @@ def perform_checks(input_url):
|
||||||
('frameset', frameset),
|
('frameset', frameset),
|
||||||
('hyperlinks', hyperlinks),
|
('hyperlinks', hyperlinks),
|
||||||
('generator', generator),
|
('generator', generator),
|
||||||
|
('load_favicons', load_favicons),
|
||||||
('load_feeds', load_feeds),
|
('load_feeds', load_feeds),
|
||||||
('load_in_browser', load_in_browser),
|
('load_in_browser', load_in_browser),
|
||||||
]
|
]
|
||||||
|
|
|
@ -0,0 +1,35 @@
|
||||||
|
"""
|
||||||
|
Loads /favicon if no icon has been found otherwise
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from time import mktime
|
||||||
|
from datetime import datetime
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
|
||||||
|
import requests
|
||||||
|
|
||||||
|
from checks.abstract_checker import AbstractChecker
|
||||||
|
|
||||||
|
class Checker(AbstractChecker):
|
||||||
|
def __init__(self, config, previous_results=None):
|
||||||
|
super().__init__(config, previous_results)
|
||||||
|
self.favicons = {}
|
||||||
|
|
||||||
|
def run(self):
|
||||||
|
for url in self.config.urls:
|
||||||
|
self.load_favicon(url)
|
||||||
|
|
||||||
|
return self.favicons
|
||||||
|
|
||||||
|
def load_favicon(self, url):
|
||||||
|
"""
|
||||||
|
This loads /favicon.ico for the site's URL
|
||||||
|
"""
|
||||||
|
parsed = urlparse(url)
|
||||||
|
ico_url = parsed.scheme + "://" + parsed.hostname + "/favicon.ico"
|
||||||
|
r = requests.head(ico_url)
|
||||||
|
if r.status_code == 200:
|
||||||
|
self.favicons[url] = {
|
||||||
|
'url': ico_url,
|
||||||
|
}
|
|
@ -0,0 +1,43 @@
|
||||||
|
from pprint import pprint
|
||||||
|
|
||||||
|
import httpretty
|
||||||
|
from httpretty import httprettified
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from checks import load_favicons
|
||||||
|
from checks.config import Config
|
||||||
|
|
||||||
|
@httprettified
|
||||||
|
class TestFavicons(unittest.TestCase):
|
||||||
|
|
||||||
|
def test_favicons(self):
|
||||||
|
# This site has a favicon
|
||||||
|
url1 = 'http://example1.com/favicon.ico'
|
||||||
|
httpretty.register_uri(httpretty.HEAD, url1,
|
||||||
|
body='',
|
||||||
|
adding_headers={
|
||||||
|
"Content-type": "image/x-ico",
|
||||||
|
})
|
||||||
|
|
||||||
|
# This site has no favicon
|
||||||
|
url2 = 'http://example2.com/favicon.ico'
|
||||||
|
httpretty.register_uri(httpretty.HEAD, url2,
|
||||||
|
status=404,
|
||||||
|
body='Not found',
|
||||||
|
adding_headers={
|
||||||
|
"Content-type": "text/plain",
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
config = Config(urls=['http://example1.com/path/', 'http://example2.com/'])
|
||||||
|
checker = load_favicons.Checker(config=config)
|
||||||
|
|
||||||
|
result = checker.run()
|
||||||
|
pprint(result)
|
||||||
|
|
||||||
|
self.assertEqual(result, {
|
||||||
|
'http://example1.com/path/': {
|
||||||
|
'url': 'http://example1.com/favicon.ico'
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
|
@ -26,21 +26,5 @@ class TestLoadInBrowser(unittest.TestCase):
|
||||||
self.assertEqual(result[url]['font_families'], ['"times new roman"'])
|
self.assertEqual(result[url]['font_families'], ['"times new roman"'])
|
||||||
|
|
||||||
|
|
||||||
def test_cookies(self):
|
|
||||||
"""Loads a page that sets cookies"""
|
|
||||||
url = 'https://httpbin.org/cookies/set/cookiename/cookievalue'
|
|
||||||
config = Config(urls=[url])
|
|
||||||
checker = load_in_browser.Checker(config=config, previous_results={})
|
|
||||||
result = checker.run()
|
|
||||||
|
|
||||||
self.assertEqual(result[url]['cookies'], [{
|
|
||||||
'domain': 'httpbin.org',
|
|
||||||
'httpOnly': False,
|
|
||||||
'name': 'cookiename',
|
|
||||||
'path': '/',
|
|
||||||
'secure': False,
|
|
||||||
'value': 'cookievalue'
|
|
||||||
}])
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
unittest.main()
|
unittest.main()
|
||||||
|
|
|
@ -2,15 +2,27 @@
|
||||||
Exports data from the database to JSON files for use in a static webapp
|
Exports data from the database to JSON files for use in a static webapp
|
||||||
"""
|
"""
|
||||||
|
|
||||||
from hashlib import md5
|
import datetime
|
||||||
import json
|
|
||||||
import logging
|
import logging
|
||||||
import sys
|
import sys
|
||||||
import os
|
import os
|
||||||
|
from hashlib import md5
|
||||||
|
|
||||||
|
import json
|
||||||
import requests
|
import requests
|
||||||
|
|
||||||
|
|
||||||
|
class DateTimeEncoder(json.JSONEncoder):
|
||||||
|
def default(self, obj):
|
||||||
|
if isinstance(obj, datetime.datetime):
|
||||||
|
return obj.isoformat()
|
||||||
|
elif isinstance(obj, datetime.date):
|
||||||
|
return obj.isoformat()
|
||||||
|
elif isinstance(obj, datetime.timedelta):
|
||||||
|
return (datetime.datetime.min + obj).time().isoformat()
|
||||||
|
else:
|
||||||
|
return super(DateTimeEncoder, self).default(obj)
|
||||||
|
|
||||||
def export_results(client, entity_kind):
|
def export_results(client, entity_kind):
|
||||||
"""
|
"""
|
||||||
Export of the main results data
|
Export of the main results data
|
||||||
|
@ -31,6 +43,6 @@ def export_results(client, entity_kind):
|
||||||
'score': entity.get('score'),
|
'score': entity.get('score'),
|
||||||
})
|
})
|
||||||
|
|
||||||
output_filename = "spider_result.json"
|
output_filename = "/json-export/spider_result.json"
|
||||||
with open(output_filename, 'w', encoding="utf8") as jsonfile:
|
with open(output_filename, 'w', encoding="utf8") as jsonfile:
|
||||||
json.dump(out, jsonfile, indent=2, sort_keys=True, ensure_ascii=False)
|
json.dump(out, jsonfile, indent=2, sort_keys=True, ensure_ascii=False, cls=DateTimeEncoder)
|
||||||
|
|
|
@ -8,7 +8,7 @@ class Rater(AbstractRater):
|
||||||
|
|
||||||
rating_type = 'boolean'
|
rating_type = 'boolean'
|
||||||
default_value = False
|
default_value = False
|
||||||
depends_on_checks = ['html_head']
|
depends_on_checks = ['html_head', 'load_favicons']
|
||||||
max_score = 1
|
max_score = 1
|
||||||
|
|
||||||
def __init__(self, check_results):
|
def __init__(self, check_results):
|
||||||
|
@ -23,6 +23,12 @@ class Rater(AbstractRater):
|
||||||
value = True
|
value = True
|
||||||
score = self.max_score
|
score = self.max_score
|
||||||
break
|
break
|
||||||
|
|
||||||
|
# /favicon.ico as fall back
|
||||||
|
if url in self.check_results['load_favicons']:
|
||||||
|
value = True
|
||||||
|
score = self.max_score
|
||||||
|
break
|
||||||
|
|
||||||
return {
|
return {
|
||||||
'type': self.rating_type,
|
'type': self.rating_type,
|
||||||
|
|
Loading…
Reference in New Issue