BERT Score¶

Module Interface¶

class torchmetrics.text.bert.BERTScore(model_name_or_path=None, num_layers=None, all_layers=False, model=None, user_tokenizer=None, user_forward_fn=None, verbose=False, idf=False, device=None, max_length=512, batch_size=64, num_threads=4, return_hash=False, lang='en', rescale_with_baseline=False, baseline_path=None, baseline_url=None, compute_on_step=None, **kwargs)[source]

Bert_score Evaluating Text Generation leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.

This implemenation follows the original implementation from BERT_score.

Parameters

preds¶ – An iterable of predicted sentences.
target¶ – An iterable of target sentences.
model_type¶ – A name or a model path used to load transformers pretrained model.
num_layers¶ (Optional[int]) – A layer of representation to use.
all_layers¶ (bool) – An indication of whether the representation from all model’s layers should be used. If all_layers = True, the argument num_layers is ignored.
model¶ (Optional[Module]) – A user’s own model. Must be of torch.nn.Module instance.
user_tokenizer¶ (Optional[Any]) – A user’s own tokenizer used with the own model. This must be an instance with the __call__ method. This method must take an iterable of sentences (List[str]) and must return a python dictionary containing “input_ids” and “attention_mask” represented by torch.Tensor. It is up to the user’s model of whether “input_ids” is a torch.Tensor of input ids or embedding vectors. This tokenizer must prepend an equivalent of [CLS] token and append an equivalent of [SEP] token as transformers tokenizer does.
user_forward_fn¶ (Optional[Callable[[Module, Dict[str, Tensor]], Tensor]]) – A user’s own forward function used in a combination with user_model. This function must take user_model and a python dictionary of containing “input_ids” and “attention_mask” represented by torch.Tensor as an input and return the model’s output represented by the single torch.Tensor.
verbose¶ (bool) – An indication of whether a progress bar to be displayed during the embeddings calculation.
idf¶ (bool) – An indication whether normalization using inverse document frequencies should be used.
device¶ (Union[str, device, None]) – A device to be used for calculation.
max_length¶ (int) – A maximum length of input sequences. Sequences longer than max_length are to be trimmed.
batch_size¶ (int) – A batch size used for model processing.
num_threads¶ (int) – A number of threads to use for a dataloader.
return_hash¶ (bool) – An indication of whether the correspodning hash_code should be returned.
lang¶ (str) – A language of input sentences.
rescale_with_baseline¶ (bool) – An indication of whether bertscore should be rescaled with a pre-computed baseline. When a pretrained model from transformers model is used, the corresponding baseline is downloaded from the original bert-score package from BERT_score if available. In other cases, please specify a path to the baseline csv/tsv file, which must follow the formatting of the files from BERT_score.
baseline_path¶ (Optional[str]) – A path to the user’s own local csv/tsv file with the baseline scale.
baseline_url¶ (Optional[str]) – A url path to the user’s own csv/tsv file with the baseline scale.
compute_on_step¶ (Optional[bool]) –
Forward only calls update() and returns None if this is set to False.

Deprecated since version v0.8: Argument has no use anymore and will be removed v0.9.
kwargs¶ (Dict[str, Any]) – Additional keyword arguments, see Advanced metric settings for more info.

Returns

Python dictionary containing the keys precision, recall and f1 with corresponding values.

Example

>>> from torchmetrics.text.bert import BERTScore
>>> preds = ["hello there", "general kenobi"]
>>> target = ["hello there", "master kenobi"]
>>> bertscore = BERTScore()
>>> score = bertscore(preds, target)
>>> from pprint import pprint
>>> rounded_score = {k: [round(v, 3) for v in vv] for k, vv in score.items()}
>>> pprint(rounded_score)
{'f1': [1.0, 0.996], 'precision': [1.0, 0.996], 'recall': [1.0, 0.996]}

Initializes internal Module state, shared by both nn.Module and ScriptModule.

compute()[source]

Calculate BERT scores.

Return type: Dict[str, Union[List[float], str]]
Returns: Python dictionary containing the keys precision, recall and f1 with corresponding values.

update(preds, target)[source]

Store predictions/references for computing BERT scores. It is necessary to store sentences in a tokenized form to ensure the DDP mode working.

Parameters

preds¶ (List[str]) – An iterable of predicted sentences.
target¶ (List[str]) – An iterable of reference sentences.

Return type

None

Functional Interface¶

torchmetrics.functional.text.bert.bert_score(preds, target, model_name_or_path=None, num_layers=None, all_layers=False, model=None, user_tokenizer=None, user_forward_fn=None, verbose=False, idf=False, device=None, max_length=512, batch_size=64, num_threads=4, return_hash=False, lang='en', rescale_with_baseline=False, baseline_path=None, baseline_url=None)[source]¶

Bert_score Evaluating Text Generation leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.

It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.

This implemenation follows the original implementation from BERT_score.

Parameters

preds¶ (Union[List[str], Dict[str, Tensor]]) – Either an iterable of predicted sentences or a Dict[input_ids, attention_mask].
target¶ (Union[List[str], Dict[str, Tensor]]) – Either an iterable of target sentences or a Dict[input_ids, attention_mask].
model_name_or_path¶ (Optional[str]) – A name or a model path used to load transformers pretrained model.
num_layers¶ (Optional[int]) – A layer of representation to use.
all_layers¶ (bool) – An indication of whether the representation from all model’s layers should be used. If all_layers = True, the argument num_layers is ignored.
model¶ (Optional[Module]) – A user’s own model.
user_tokenizer¶ (Optional[Any]) – A user’s own tokenizer used with the own model. This must be an instance with the __call__ method. This method must take an iterable of sentences (List[str]) and must return a python dictionary containing "input_ids" and "attention_mask" represented by torch.Tensor. It is up to the user’s model of whether "input_ids" is a torch.Tensor of input ids or embedding vectors. his tokenizer must prepend an equivalent of [CLS] token and append an equivalent of [SEP] token as transformers tokenizer does.
user_forward_fn¶ (Optional[Callable[[Module, Dict[str, Tensor]], Tensor]]) – A user’s own forward function used in a combination with user_model. This function must take user_model and a python dictionary of containing "input_ids" and "attention_mask" represented by torch.Tensor as an input and return the model’s output represented by the single torch.Tensor.
verbose¶ (bool) – An indication of whether a progress bar to be displayed during the embeddings’ calculation.
idf¶ (bool) – An indication of whether normalization using inverse document frequencies should be used.
device¶ (Union[str, device, None]) – A device to be used for calculation.
max_length¶ (int) – A maximum length of input sequences. Sequences longer than max_length are to be trimmed.
batch_size¶ (int) – A batch size used for model processing.
num_threads¶ (int) – A number of threads to use for a dataloader.
return_hash¶ (bool) – An indication of whether the correspodning hash_code should be returned.
lang¶ (str) – A language of input sentences. It is used when the scores are rescaled with a baseline.
rescale_with_baseline¶ (bool) – An indication of whether bertscore should be rescaled with a pre-computed baseline. When a pretrained model from transformers model is used, the corresponding baseline is downloaded from the original bert-score package from BERT_score if available. In other cases, please specify a path to the baseline csv/tsv file, which must follow the formatting of the files from BERT_score
baseline_path¶ (Optional[str]) – A path to the user’s own local csv/tsv file with the baseline scale.
baseline_url¶ (Optional[str]) – A url path to the user’s own csv/tsv file with the baseline scale.

Return type

Dict[str, Union[List[float], str]]

Returns

Python dictionary containing the keys precision, recall and f1 with corresponding values.

Raises

ValueError – If len(preds) != len(target).
ModuleNotFoundError – If tqdm package is required and not installed.
ModuleNotFoundError – If transformers package is required and not installed.
ValueError – If num_layer is larger than the number of the model layers.
ValueError – If invalid input is provided.

Example

>>> from torchmetrics.functional.text.bert import bert_score
>>> preds = ["hello there", "general kenobi"]
>>> target = ["hello there", "master kenobi"]
>>> score = bert_score(preds, target)
>>> from pprint import pprint
>>> rounded_score = {k: [round(v, 3) for v in vv] for k, vv in score.items()}
>>> pprint(rounded_score)
{'f1': [1.0, 0.996], 'precision': [1.0, 0.996], 'recall': [1.0, 0.996]}