mirror of https://gitlab.rd.nic.fr/labs/frcrawler/scripts.git synced 2025-07-18 18:38:40 +02:00

No description

Find a file

Gaël Berthaud-Müller 8c5c866df3 add license		2024-03-06 14:53:48 +01:00
dev	add all files	2024-03-06 14:49:11 +01:00
schemas	add all files	2024-03-06 14:49:11 +01:00
scripts	add license mention	2024-03-06 14:51:03 +01:00
LICENSE	add license	2024-03-06 14:53:48 +01:00
README.md	add all files	2024-03-06 14:49:11 +01:00
requirements.txt	add all files	2024-03-06 14:49:11 +01:00

README.md

FrCrawler Scripts

This repository provides scripts to use the FrCrawler results as well as a more complete database schema for Clickhouse.

Getting started: initialize the database

The dev folder contains a docker-compose.yml to start a Clickhouse cluster, useful for developpement:

default credentials : user test with password test,
cluster name : dev_cluster.

# Start Clickhouse cluster
docker-compose -f dev/docker-compose.yml up -d

# Install additional Python modules
# The frcrawler and frcrawler_clustering packages are supposed to be already installed
pip install -r requirements.txt

# Initialise database
python scripts/create-db.py --cluster dev_cluster --database crawler --schemas ./schemas --init-db --dictionnary-database external

The scripts are using environment variables for configuration. They can also be loaded from the crawler.env file, this default environment file can be overriden with the environment variable FRCRAWLER_SCRIPT_ENV_FILE.

Configuration

CH_DB_URL: (default: clickhouse://test:test@localhost:9001/test)

Run the clustering script over the computed of HTML pages

# Run the clustering scripts
python scripts/run-clustering.py run

# Cleanup remaining temporary data
python scripts/run-clustering.py clean

Configuration

The following environment variables can be provided to configure the script.

CLUSTERING_NB_THREAD: (default: 1) number of threads used to run the similarity computation
CLUSTERING_MAX_MEMORY: (default: 2GB per thread) maximum memory usage for the similarity computation, the given amount of memory will be allocated on startup.
CLUSTERING_THRESHOLD: (default: 80) similarity threshold for two items to be considered in the same cluster, from 0 to 100.
CLUSTER_MIN_SIZE: (default: 10) minimum cluster size
CH_DB_URL: (default: clickhouse://test:test@localhost:9001/test)
CH_HTTP_URL: (default: http://test:test@localhost:8124/?database=test )
CH_CLUSTER: (default: dev_cluster)

Name a cluster and create hints for following runs

# Create a label
python scripts/label-cluster.py create-label --label-name error:not_found

# Label the given cluster (identified by its ID, see table clustering_results)
# * The cluster ID will be replace by the label ID after this operation
# * A hundred hashes will be randomly copied from the cluster to be used as hints to be able to label clusters for subsequent runs of the clustering scripts
python scripts/label-cluster.py label-cluster --cluster-id e3812caa-597c-4f96-9956-0165c28861ed --label-name error:not_found

# List existing labels
python scripts/label-cluster.py list-labels

Configuration

CH_DB_URL: (default: clickhouse://test:test@localhost:9001/test)
CH_CLUSTER: (default: dev_cluster)

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/. (END)