mirror of
https://gitlab.rd.nic.fr/labs/frcrawler/scripts.git
synced 2025-04-04 19:45:48 +02:00
87 lines
3.4 KiB
Markdown
87 lines
3.4 KiB
Markdown
# FrCrawler Scripts
|
|
|
|
This repository provides scripts to use the FrCrawler results as well as a more complete database schema for Clickhouse.
|
|
|
|
## Getting started: initialize the database
|
|
|
|
The `dev` folder contains a `docker-compose.yml` to start a Clickhouse cluster, useful for developpement:
|
|
* default credentials : user `test` with password `test`,
|
|
* cluster name : `dev_cluster`.
|
|
|
|
```sh
|
|
# Start Clickhouse cluster
|
|
docker-compose -f dev/docker-compose.yml up -d
|
|
|
|
# Install additional Python modules
|
|
# The frcrawler and frcrawler_clustering packages are supposed to be already installed
|
|
pip install -r requirements.txt
|
|
|
|
# Initialise database
|
|
python scripts/create-db.py --cluster dev_cluster --database crawler --schemas ./schemas --init-db --dictionnary-database external
|
|
```
|
|
|
|
|
|
The scripts are using environment variables for configuration. They can also be loaded from the `crawler.env` file, this default environment file can be overriden with the environment variable `FRCRAWLER_SCRIPT_ENV_FILE`.
|
|
|
|
### Configuration
|
|
|
|
* `CH_DB_URL`: (default: `clickhouse://test:test@localhost:9001/test`)
|
|
|
|
## Run the clustering script over the computed of HTML pages
|
|
|
|
```sh
|
|
# Run the clustering scripts
|
|
python scripts/run-clustering.py run
|
|
|
|
# Cleanup remaining temporary data
|
|
python scripts/run-clustering.py clean
|
|
```
|
|
|
|
### Configuration
|
|
|
|
The following environment variables can be provided to configure the script.
|
|
* `CLUSTERING_NB_THREAD`: (default: 1) number of threads used to run the similarity computation
|
|
* `CLUSTERING_MAX_MEMORY`: (default: 2GB per thread) maximum memory usage for the similarity computation, the given amount of memory will be allocated on startup.
|
|
* `CLUSTERING_THRESHOLD`: (default: 80) similarity threshold for two items to be considered in the same cluster, from 0 to 100.
|
|
* `CLUSTER_MIN_SIZE`: (default: 10) minimum cluster size
|
|
* `CH_DB_URL`: (default: `clickhouse://test:test@localhost:9001/test`)
|
|
* `CH_HTTP_URL`: (default: `http://test:test@localhost:8124/?database=test` )
|
|
* `CH_CLUSTER`: (default: `dev_cluster`)
|
|
|
|
## Name a cluster and create hints for following runs
|
|
|
|
```sh
|
|
# Create a label
|
|
python scripts/label-cluster.py create-label --label-name error:not_found
|
|
|
|
# Label the given cluster (identified by its ID, see table clustering_results)
|
|
# * The cluster ID will be replace by the label ID after this operation
|
|
# * A hundred hashes will be randomly copied from the cluster to be used as hints to be able to label clusters for subsequent runs of the clustering scripts
|
|
python scripts/label-cluster.py label-cluster --cluster-id e3812caa-597c-4f96-9956-0165c28861ed --label-name error:not_found
|
|
|
|
# List existing labels
|
|
python scripts/label-cluster.py list-labels
|
|
```
|
|
|
|
### Configuration
|
|
|
|
* `CH_DB_URL`: (default: `clickhouse://test:test@localhost:9001/test`)
|
|
* `CH_CLUSTER`: (default: `dev_cluster`)
|
|
|
|
## License
|
|
|
|
Copyright (C) 2023 Afnic
|
|
|
|
This program is free software: you can redistribute it and/or modify
|
|
it under the terms of the GNU General Public License as published by
|
|
the Free Software Foundation, either version 3 of the License, or
|
|
(at your option) any later version.
|
|
|
|
This program is distributed in the hope that it will be useful,
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
GNU General Public License for more details.
|
|
|
|
You should have received a copy of the GNU General Public License
|
|
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
|
(END)
|