Roll your own search engine

linux howto docker searx docker-compose filtron

Searx is a meta search engine, you can run on your own. If you do so (please do), you should take some actions to avoid getting blocked by the queried search engines (like Google).

Fortunately the searx team has also developed filtron. Filtron acts as an proxy between the client and your searx installation, filtering out malicious or abusive request. Therefore you will need an Ruleset (rules.json):

[
    {
        "name": "search request",
        "filters": ["Param:q", "Path=^(/|/search)$"],
        "subrules": [
            {
                "name": "roboagent limit",
                "limit": 0,
                "filters": ["Header:User-Agent=(curl|cURL|Wget|python-requests|Scrapy|FeedFetcher|Go-http-client)"],
                "actions": [
                    {
                     "name": "block",
                     "params": {"message": "Rate limit exceeded"}}
                ]
            },
            {
                "name": "botlimit",
                "limit": 0,
                "stop": true,
                "filters": ["Header:User-Agent=(Googlebot|bingbot|Baiduspider|yacybot|YandexMobileBot|YandexBot|Yahoo! Slurp|MJ12bot|AhrefsBot|archive.org_bot|msnbot|MJ12bot|SeznamBot|linkdexbot|Netvibes|SMTBot|zgrab|James BOT)"],
                "actions": [
                    {
                     "name": "block",
                     "params": {"message": "Rate limit exceeded"}}
                ]
            },
            {
                "name": "IP limit",
                "interval": 300,
                "limit": 256,
                "stop": true,
                "aggregations": ["Header:X-Forwarded-For"],
                "actions": [
                    {
                     "name": "block",
                     "params": {"message": "Rate limit exceeded, try again later."}}
                ]
            },
            {
                "name": "rss/json limit",
                "interval": 600,
                "limit": 4,
                "stop": true,
                "filters": ["Param:format=(csv|json|rss)"],
                "actions": [
                    {
                     "name": "block",
                     "params": {"message": "Rate limit exceeded, try again later."}}
                ]
            },
            {
                "name": "useragent limit",
                "interval": 300,
                "limit": 128,
                "aggregations": ["Header:User-Agent"],
                "actions": [
                    {
                     "name": "block",
                     "params": {"message": "Rate limit exceeded, try again later."}}
                ]
            }
        ]
    }
]

The first part of the ruleset (name: search request) defines that only search queries are handled. The real filtering is done by the subrules section. The rule roboagent limit blocks tools like curl and wget (limit: 0) completely, as they are often used to automate requests causing a huge load. The same is for bots, so you should block them as well (botlimit). The rule IP limit limits the number of requests from a single IP to 256 queries in 300 seconds. If the limits are exceeded the message Rate limit exceeded, try again later. appears. The rule rss/json limit limits the access to csv, json and rss versions of query results to 4 in 600 seconds. The rule useragent limit should be quite self explainatory.

I'm running searx as an docker container so I do the same with filtron, of course. The Dockerfile looks like this:

FROM golang:alpine

ENV APP_PORT 8888
ENV RULES_FILE /etc/filtron/rules.json

RUN apk add --no-cache git
RUN go get github.com/asciimoo/filtron
RUN mkdir /etc/filtron
EXPOSE 4004 4005

ADD rules.json /etc/filtron/rules.json
ADD entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]

The searx Dockerfile itself looks as follows:

FROM alpine:3.5
MAINTAINER searx https://github.com/asciimoo/searx
LABEL description "A privacy-respecting, hackable metasearch engine."
ENV BASE_URL=False IMAGE_PROXY=True
EXPOSE 8888
WORKDIR /usr/local/searx
CMD ["/sbin/tini","--","/usr/local/searx/run.sh"]
RUN adduser -D -h /usr/local/searx -s /bin/sh searx searx \
 && echo '#!/bin/sh' >> run.sh \
 && echo 'sed -i "s|base_url : False|base_url : $BASE_URL|g" searx/settings.yml' >> run.sh \
 && echo 'sed -i "s/image_proxy : False/image_proxy : $IMAGE_PROXY/g" searx/settings.yml' >> run.sh \
 && echo 'sed -i "s/ultrasecretkey/openssl rand -hex 16/g" searx/settings.yml' >> run.sh \
 && echo 'python searx/webapp.py' >> run.sh \
 && chmod +x run.sh
COPY requirements.txt ./requirements.txt
RUN echo "@commuedge http://nl.alpinelinux.org/alpine/edge/community" >> /etc/apk/repositories \
 && apk -U add \
    build-base \
    python \
    python-dev \
    py-pip \
    libxml2 \
    libxml2-dev \
    libxslt \
    libxslt-dev \
    libffi-dev \
    openssl \
    openssl-dev \
    ca-certificates \
    tini@commuedge \
 && pip install --no-cache -r requirements.txt \
 && apk del \
    build-base \
    python-dev \
    libffi-dev \
    openssl-dev \
    libxslt-dev \
    libxml2-dev \
    openssl-dev \
    ca-certificates \
 && rm -f /var/cache/apk/*
COPY . .
RUN chown -R searx:searx *
USER searx

RUN sed -i "s/127.0.0.1/0.0.0.0/g" searx/settings.yml

To create an image from the two Dockerfiles simply run docker build . -t localhost:5000/filtron and docker build . -t localhost:5000/searx. As you can see I tag them (-t) to use the images with my local registry, which isn't needed but sometimes handy. I push the newly created images into the registry by running docker push localhost:5000/filtron and docker push localhost:5000/searx.

To combine searx and filtron easily I use docker-compose, so we need a docker-compose.yml file:

version: '2'

services:
  searx:
    image: localhost:5000/searx
    depends_on:
      - filtron
    networks:
      - searx
    restart: always

filtron:
    image: localhost:5000/filtron
    ports:
      - "127.0.0.1:8888:4004"
    environment:
      - APP_PORT=8888
      - RULES_FILE=/etc/filtron/rules.json
    networks:
      - searx
    restart: always

networks:
  searx:
    driver: bridge

As I already created and pushed the images into my registry I simply download them (image: localhost:5000/*). If you don't want to use a local registry just replace the image: lines with build: /path/to/your/searx|filtron/Dockerfile).

Create your running container by executing docker-compose up or docker-compose create && docker-compose start. You should be able to access filtron via 127.0.0.1:8888 afterwards.


Comments

Blog Comments powered by Disqus.

Next Post Previous Post