No description
Find a file
Gaël Berthaud-Müller 8c5c866df3 add license
2024-03-06 14:53:48 +01:00
dev add all files 2024-03-06 14:49:11 +01:00
schemas add all files 2024-03-06 14:49:11 +01:00
scripts add license mention 2024-03-06 14:51:03 +01:00
LICENSE add license 2024-03-06 14:53:48 +01:00
README.md add all files 2024-03-06 14:49:11 +01:00
requirements.txt add all files 2024-03-06 14:49:11 +01:00

FrCrawler Scripts

This repository provides scripts to use the FrCrawler results as well as a more complete database schema for Clickhouse.

Getting started: initialize the database

The dev folder contains a docker-compose.yml to start a Clickhouse cluster, useful for developpement:

  • default credentials : user test with password test,
  • cluster name : dev_cluster.
# Start Clickhouse cluster
docker-compose -f dev/docker-compose.yml up -d

# Install additional Python modules
# The frcrawler and frcrawler_clustering packages are supposed to be already installed
pip install -r requirements.txt

# Initialise database
python scripts/create-db.py --cluster dev_cluster --database crawler --schemas ./schemas --init-db --dictionnary-database external

The scripts are using environment variables for configuration. They can also be loaded from the crawler.env file, this default environment file can be overriden with the environment variable FRCRAWLER_SCRIPT_ENV_FILE.

Configuration

  • CH_DB_URL: (default: clickhouse://test:test@localhost:9001/test)

Run the clustering script over the computed of HTML pages

# Run the clustering scripts
python scripts/run-clustering.py run

# Cleanup remaining temporary data
python scripts/run-clustering.py clean

Configuration

The following environment variables can be provided to configure the script.

  • CLUSTERING_NB_THREAD: (default: 1) number of threads used to run the similarity computation
  • CLUSTERING_MAX_MEMORY: (default: 2GB per thread) maximum memory usage for the similarity computation, the given amount of memory will be allocated on startup.
  • CLUSTERING_THRESHOLD: (default: 80) similarity threshold for two items to be considered in the same cluster, from 0 to 100.
  • CLUSTER_MIN_SIZE: (default: 10) minimum cluster size
  • CH_DB_URL: (default: clickhouse://test:test@localhost:9001/test)
  • CH_HTTP_URL: (default: http://test:test@localhost:8124/?database=test )
  • CH_CLUSTER: (default: dev_cluster)

Name a cluster and create hints for following runs

# Create a label
python scripts/label-cluster.py create-label --label-name error:not_found

# Label the given cluster (identified by its ID, see table clustering_results)
# * The cluster ID will be replace by the label ID after this operation
# * A hundred hashes will be randomly copied from the cluster to be used as hints to be able to label clusters for subsequent runs of the clustering scripts
python scripts/label-cluster.py label-cluster --cluster-id e3812caa-597c-4f96-9956-0165c28861ed --label-name error:not_found

# List existing labels
python scripts/label-cluster.py list-labels

Configuration

  • CH_DB_URL: (default: clickhouse://test:test@localhost:9001/test)
  • CH_CLUSTER: (default: dev_cluster)

License

Copyright (C) 2023 Afnic

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/. (END)