bfilter

Langue: en

Version: January 2007 (debian - 07/07/09)

Section: 8 (Commandes administrateur)

NAME

BFilter - An ad-filtering web proxy using heuristic ad-detection algorithms

SYNOPSIS

bfilter [-c directory] [-C directory] [-r directory] [-u user] [-g group] [-n] [-p file] [-k] [-h] [-v]

DESCRIPTION

BFilter is a web proxy that uses effective heuristic ad-detection algorithms to remove banner adverts, popups and webbugs from web pages. The traditional blocklist based approach is also implemented, but it is mostly used for dealing with false positives. Unlike other tools that require constant updates of their blocklists, bfilter manages to remove over 90% of adverts even with an empty blocklist!

All processing is done on the fly, it doesn't load the whole page or image before processing. It uses heuristic and regex-based approaches to detect adverts and webbugs. It also uses a Javascript engine to combat Javascript generated adverts and popups.

The web proxy supports the following features;

o HTTP/0.9 - HTTP/1.1 support
o Persistent connections (HTTP/1.1 only)
o Pipelining (HTTP/1.1 only)
o HTTP compression
o Forwarding to another proxy

However, while it does support CONNECT requests used for HTTPS it does no filtering on those requests.

OPTIONS

-c, --confdir directory
Set custom config directory
-C, --cachedir directory
Set a directory to store cached external scripts.
-r, --chroot directory
Set chroot directory. This must contain the config directory but if config directory is not specified then chroot directory is used as config directory.
-u, --user user
Set unprivileged user
-g, --group group
Set unprivileged group
-n, --nodaemon
Disable background daemon mode
-p, --pid file
Write process ID to a file
-k --kill
Kill the running process specified with -p
-h, --help
Show help
-v, --version
Print version

FILES

The default configuration settings for bfilter are located underneath the /etc/bfilter (and optionally ~/.bfilter for the user GUI configuration) directories.

For the base configuration the config and (optinally) forwarding.xml files are used. For URL pattern matching the urls and urls.local files are used. For content filtering the filters/ directory may contain files specifying groups of filters and whether they are enabled.

PROXY CONFIGURATION

There are two configuration files, config.default which is shipped with bfilter and is overwritten when upgrading and config which has a higher priority so it can override rules specified in the config.default file. The following parameters can be defined in these files.

The first section is the [global] section;

listen_address = host:port
The address and port to which to bind the proxy. If host is unspecified it will bind to all interfaces. Multiple address seperated with a comman may be specified.

client_compression = yes | no
If set to yes, all the textual data with "Content-Type: text/*" will be compressed before sending it to the client. This option can be useful if you are on a slow connection and you set up bfilter somewhere on a fast connection. In other cases, setting this option to yes will just introduce additional latency to the loading process.

ad_border = rrggbb | none
The default behaviour is to draw borders around removed adverts. You may want to change the border color or turn the borders off.

page_cleanup = off | safe | maximum
Allows removing of adverts completely, as opposed to substituting them with a clickable replacement image. When set to "maximum" it worsens ad detection accuracy, which may result in both false-positives and false-negatives. When set to "safe" it doesn't worsen ad detection accuracy, but some ads won't be removed. For such ads, you can still use "ad_border = none" to make them invisible but still clickable.

try_icon_animation = yes | no
Enable or disable the tray icon animation which indicate traffic is passing through bfilter (GUI only).

max_script_fetch_size = size_in_kilobytes
Limits the size of external scripts that bfilter fetches. Large script are not likely to be used to serve ads.

max_script_eval_size = size_in_kilobytes
Protection against compressed scripts decompressing to very large sizes.

max_script_nest_level = number
Limits nesting level of scripts. The reasoning is the same as for max_script_fetch_size. A smaller value like 3 will make bfilter faster, while a bigger value like 9 will make it detect more ads. The author has never seen an ad that is generated at levels higher than 6.

save_traffic_threshold = size_in_kilobytes
Sometimes bfilter needs to download an image or a flash file to determine if it's an advert or not. Since bfilter tries to do everything on the fly, it usually knows the answer before the whole file is downloaded. At that time it checks how much data is left to be downloaded and if it's more than the value of this parameter (or if the size is unknown), bfilter will drop the connection to the server in order to save some traffic. The default value of 15 is good for most people, but if you use a dialup or a GPRS connection you may want to lower it to maybe 8 and if you use a satellite connection you may want to raise it to maybe 40.

report_client_ip = yes | no | fixed_ip
Enable reporting the client IP to servers using the X-Forwarded-For header.

allowed_tunnel_ports = port1, port2, from..to
Specifies the ports allowed for CONNECT requests. Ports allowed by default are 443 and 563, which are used for https and nntps respectively.

cache_size = size_in_megabytes
BFilter can cache external scripts that it fetches for analyzing. This parameter sets the cache size (in megabytes). To disable caching, set cache_size to 0.

The second section is the [forwarding] section;

use_proxy = yes | no
When use_proxy is set to yes, you may specify a proxy for bfilter to forward requests onto.

proxy_type = http | socks4 | socks4a | socks5
proxy_host = host
proxy_port = port
proxy_user = username
proxy_pass = password
Upstream proxy authentication is currently implemented only for SOCKS proxies. Note that socks4 and socks4a don't use password, just the username.

no_proxy_for = host, host, host
When use_proxy is set to yes, you may specify some hosts to be contacted directly. The separator may be either a comma or a semicolon. If a host starts or ends with a dot it is assumed that any prefix or suffix can be appended to it, so for example "no_proxy_for = .mydomain.com, 192.168."). Note however that .mydomain.com won't cover mydomain.com itself but only its subdomains. (When matching no_proxy_for hosts, no DNS queries are being made. That means 127.0.0.1 won't act as localhost or the other way around.)

URL PATTERNS

BFilter allows you to block an arbitrary URL (web address) and to assign hints to URL's in order to influence the heuristic analyzer. To do so you assign a tag to a URL allowing both blocking and hinting (and more).

There are two configuration files, urls which is shipped with bfilter but is overwritten when upgrading and urls.local which has a higher priority so it can override rules specified in the urls file.

These files specify a number of rules. Each rule has the following syntax;

TAG url_pattern

Where TAG can be one of the following;

HTML Output a blank page.
IMAGE Output a transparent image.
FLASH Output a blank flash file.
JS Output an empty JavaScript file.
AD Output appropriate blank or transparent content.
FORBID Output an error page.
ALLOW Cancel any of the above tags.
NOFILTER Don't filter a page or a script.
+++ Be more suspicious about the URL (any number of plus signs).
--- Be less suspicious about the URL (any number of minus signs).
+N -N Alternative syntax for the above two (where N is a number).

The last three tags are special. They provide a hint to the heuristic analyzer and are only considered when we already have an ad suspect. For example, if we have a clickable image on a page we are going to consider hints for;

o The image URL.
o The link URL.
o The page URL.

Sometimes an advert can't be blocked with hints which can happen if bfilter doesn't see it (probably because of a problem interpreting a script) or doesn't support that kind of advert (text or hover adverts). In that case you may still block it using other tags. Note that hints don't intersect with other tags, when we are looking for a hint we don't consider other tags (and vice versa).

BFilter supports two types of patterns;

o Simple strings with wildcards.
o Regular expressions.

The simple string wildcards are ? and * meaning respectively "any character" and "any number of any characters". For example;

FORBID http://ads.somehost.com/*

This will block any URL starting with "http://ads.somehost.com/". Note that for broad ad-blocking patterns like this, it is recommended to use IMAGE rather than FORBID. This sounds wrong as we don't exactly know the type of the object we are going to replace with an image, but it turns out that IMAGE produces better results than any other tag. Any other tag results in broken images and FORBID will additionally cause error pages in place of IFRAME ads. Browsers accept an image where html was expected just fine and are even smart enough not to interpret an image where a script was expected.

Regular expression patterns must be enclosed within two slashes. For example;

JS /http://(www.)?somehost.com/ads/.*.js/

This regex can be interpreted like this: match "http://", optionally match "www.", match "somehost.com/ads/", match any number of any characters or match ".js".

As a quick summary, in regular expressions;

. means any character
\. means the "." character
\? means the "?" character
.* means any number of any characters including none
(this|that) means "this" or "that"
(something)? means "something" or nothing

You may find a tutorial and a complete reference on regular expressions at http://www.regular-expressions.info.

Note that both simple and regex patterns are case insensitive.

CONTENT FILTERS

BFilter allows you to apply regular expressions to page content. This can be used for things like removing portions of a page, altering scripts or injecting your own scripts. There are a couple of things that make bfilter's implementation of this feature unique;
o Applying a regex doesn't cause buffering of the whole page.
o Replacement expressions can contain JavaScript code.

Content filter configuration is not currently covered in this man page. Please view the bfilter web page at http://bfilter.sourceforge.net/doc/content-filters.php for further information.

EXAMPLES

All images from known advert domains are replaced with a transparent GIF or empty flash.
IMAGE /http://(.*.)?(doubleclick|fastclick|tradedoubler)..*/
FLASH /http://(.*.)?(doubleclick|fastclick|tradedoubler)..*/

Prevent hover adverts (DHTML pop-ups) from known advert domain.

FORBID /http://([^/]+.)?layer-ads.de/.*/

Prevent tooltip adverts from known advert domain.

JS http://kona.kontera.com/javascript/*
FORBID /http://[^/]+.intellitxt.com/intellitxt/.*/

Allow images used to count page views for projects hosted on SourceForge.

ALLOW /(www\.)?sourceforge.net/sflogo.php\?.*/

Apply hints to suspicious URL's.

++++++ /http://ads[]*..*/
+++++ /.*/(ad[sv]?|advert|banners?)[^a-z].*/
++++ *banners*
+++ *banner*
+++ *click*

NOTES

If the HTML processor is in doubt about an image or a Flash file, it defers the decision until the browser has requested that file. The response is then analyzed (redirects, cookies) as well as the file itself. For an image, the analyzer checks its dimensions and whether it's animated or not. For Flash files, the analyzer is trying to find a button that covers most of the object's area and has a getURL action associated with it. Depending on the results, the object is either forwarded to the client, or substituted with a generated replacement. (Unfortunately, analyzing objects that are placed with Javascript doesn't work, as their URLs in javascript source cannot be altered.)

BUGS

Please report any bugs you may find to:

http://sourceforge.net/projects/bfilter

AUTHOR

Joseph Artsimovich <joseph_a@mail.ru>
http://bfilter.sourceforge.net

SEE ALSO

regex(7)