dev | ||
schemas | ||
scripts | ||
LICENSE | ||
README.md | ||
requirements.txt |
FrCrawler Scripts
This repository provides scripts to use the FrCrawler results as well as a more complete database schema for Clickhouse.
Getting started: initialize the database
The dev
folder contains a docker-compose.yml
to start a Clickhouse cluster, useful for developpement:
- default credentials : user
test
with passwordtest
, - cluster name :
dev_cluster
.
# Start Clickhouse cluster
docker-compose -f dev/docker-compose.yml up -d
# Install additional Python modules
# The frcrawler and frcrawler_clustering packages are supposed to be already installed
pip install -r requirements.txt
# Initialise database
python scripts/create-db.py --cluster dev_cluster --database crawler --schemas ./schemas --init-db --dictionnary-database external
The scripts are using environment variables for configuration. They can also be loaded from the crawler.env
file, this default environment file can be overriden with the environment variable FRCRAWLER_SCRIPT_ENV_FILE
.
Configuration
CH_DB_URL
: (default:clickhouse://test:test@localhost:9001/test
)
Run the clustering script over the computed of HTML pages
# Run the clustering scripts
python scripts/run-clustering.py run
# Cleanup remaining temporary data
python scripts/run-clustering.py clean
Configuration
The following environment variables can be provided to configure the script.
CLUSTERING_NB_THREAD
: (default: 1) number of threads used to run the similarity computationCLUSTERING_MAX_MEMORY
: (default: 2GB per thread) maximum memory usage for the similarity computation, the given amount of memory will be allocated on startup.CLUSTERING_THRESHOLD
: (default: 80) similarity threshold for two items to be considered in the same cluster, from 0 to 100.CLUSTER_MIN_SIZE
: (default: 10) minimum cluster sizeCH_DB_URL
: (default:clickhouse://test:test@localhost:9001/test
)CH_HTTP_URL
: (default:http://test:test@localhost:8124/?database=test
)CH_CLUSTER
: (default:dev_cluster
)
Name a cluster and create hints for following runs
# Create a label
python scripts/label-cluster.py create-label --label-name error:not_found
# Label the given cluster (identified by its ID, see table clustering_results)
# * The cluster ID will be replace by the label ID after this operation
# * A hundred hashes will be randomly copied from the cluster to be used as hints to be able to label clusters for subsequent runs of the clustering scripts
python scripts/label-cluster.py label-cluster --cluster-id e3812caa-597c-4f96-9956-0165c28861ed --label-name error:not_found
# List existing labels
python scripts/label-cluster.py list-labels
Configuration
CH_DB_URL
: (default:clickhouse://test:test@localhost:9001/test
)CH_CLUSTER
: (default:dev_cluster
)
License
Copyright (C) 2023 Afnic
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/. (END)