frcrawler-scripts/README.md

# FrCrawler Scripts

This repository provides scripts to use the FrCrawler results as well as a more complete database schema for Clickhouse.

## Getting started: initialize the database

The `dev` folder contains a `docker-compose.yml` to start a Clickhouse cluster, useful for developpement:
* default credentials : user `test` with password `test`,
* cluster name : `dev_cluster`.

```sh
# Start Clickhouse cluster
docker-compose -f dev/docker-compose.yml up -d

# Install additional Python modules
# The frcrawler and frcrawler_clustering packages are supposed to be already installed
pip install -r requirements.txt

# Initialise database
python scripts/create-db.py --cluster dev_cluster --database crawler --schemas ./schemas --init-db --dictionnary-database external
```


The scripts are using environment variables for configuration. They can also be loaded from the `crawler.env` file, this default environment file can be overriden with the environment variable `FRCRAWLER_SCRIPT_ENV_FILE`.

### Configuration

* `CH_DB_URL`: (default: `clickhouse://test:test@localhost:9001/test`)

## Run the clustering script over the computed of HTML pages

```sh
# Run the clustering scripts
python scripts/run-clustering.py run

# Cleanup remaining temporary data
python scripts/run-clustering.py clean
```

### Configuration

The following environment variables can be provided to configure the script.
* `CLUSTERING_NB_THREAD`: (default: 1) number of threads used to run the similarity computation
* `CLUSTERING_MAX_MEMORY`: (default: 2GB per thread) maximum memory usage for the similarity computation, the given amount of memory will be allocated on startup.
* `CLUSTERING_THRESHOLD`: (default: 80) similarity threshold for two items to be considered in the same cluster, from 0 to 100.
* `CLUSTER_MIN_SIZE`: (default: 10) minimum cluster size
* `CH_DB_URL`: (default: `clickhouse://test:test@localhost:9001/test`)
* `CH_HTTP_URL`: (default: `http://test:test@localhost:8124/?database=test` )
* `CH_CLUSTER`: (default: `dev_cluster`)

## Name a cluster and create hints for following runs

```sh
# Create a label
python scripts/label-cluster.py create-label --label-name error:not_found

# Label the given cluster (identified by its ID, see table clustering_results)
# * The cluster ID will be replace by the label ID after this operation
# * A hundred hashes will be randomly copied from the cluster to be used as hints to be able to label clusters for subsequent runs of the clustering scripts
python scripts/label-cluster.py label-cluster --cluster-id e3812caa-597c-4f96-9956-0165c28861ed --label-name error:not_found

# List existing labels
python scripts/label-cluster.py list-labels
```

### Configuration

* `CH_DB_URL`: (default: `clickhouse://test:test@localhost:9001/test`)
* `CH_CLUSTER`: (default: `dev_cluster`)

## License

Copyright (C) 2023 Afnic

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.
(END)