frcrawler-scripts/README.md
Gaël Berthaud-Müller 38679a009a add all files
2024-03-06 14:49:11 +01:00

87 lines
3.4 KiB
Markdown

# FrCrawler Scripts
This repository provides scripts to use the FrCrawler results as well as a more complete database schema for Clickhouse.
## Getting started: initialize the database
The `dev` folder contains a `docker-compose.yml` to start a Clickhouse cluster, useful for developpement:
* default credentials : user `test` with password `test`,
* cluster name : `dev_cluster`.
```sh
# Start Clickhouse cluster
docker-compose -f dev/docker-compose.yml up -d
# Install additional Python modules
# The frcrawler and frcrawler_clustering packages are supposed to be already installed
pip install -r requirements.txt
# Initialise database
python scripts/create-db.py --cluster dev_cluster --database crawler --schemas ./schemas --init-db --dictionnary-database external
```
The scripts are using environment variables for configuration. They can also be loaded from the `crawler.env` file, this default environment file can be overriden with the environment variable `FRCRAWLER_SCRIPT_ENV_FILE`.
### Configuration
* `CH_DB_URL`: (default: `clickhouse://test:test@localhost:9001/test`)
## Run the clustering script over the computed of HTML pages
```sh
# Run the clustering scripts
python scripts/run-clustering.py run
# Cleanup remaining temporary data
python scripts/run-clustering.py clean
```
### Configuration
The following environment variables can be provided to configure the script.
* `CLUSTERING_NB_THREAD`: (default: 1) number of threads used to run the similarity computation
* `CLUSTERING_MAX_MEMORY`: (default: 2GB per thread) maximum memory usage for the similarity computation, the given amount of memory will be allocated on startup.
* `CLUSTERING_THRESHOLD`: (default: 80) similarity threshold for two items to be considered in the same cluster, from 0 to 100.
* `CLUSTER_MIN_SIZE`: (default: 10) minimum cluster size
* `CH_DB_URL`: (default: `clickhouse://test:test@localhost:9001/test`)
* `CH_HTTP_URL`: (default: `http://test:test@localhost:8124/?database=test` )
* `CH_CLUSTER`: (default: `dev_cluster`)
## Name a cluster and create hints for following runs
```sh
# Create a label
python scripts/label-cluster.py create-label --label-name error:not_found
# Label the given cluster (identified by its ID, see table clustering_results)
# * The cluster ID will be replace by the label ID after this operation
# * A hundred hashes will be randomly copied from the cluster to be used as hints to be able to label clusters for subsequent runs of the clustering scripts
python scripts/label-cluster.py label-cluster --cluster-id e3812caa-597c-4f96-9956-0165c28861ed --label-name error:not_found
# List existing labels
python scripts/label-cluster.py list-labels
```
### Configuration
* `CH_DB_URL`: (default: `clickhouse://test:test@localhost:9001/test`)
* `CH_CLUSTER`: (default: `dev_cluster`)
## License
Copyright (C) 2023 Afnic
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
(END)