CLIP Score¶
Module Interface¶
- class torchmetrics.multimodal.clip_score.CLIPScore(model_name_or_path='openai/clip-vit-large-patch14', **kwargs)[source]
CLIP Score is a reference free metric that can be used to evaluate the correlation between a generated caption for an image and the actual content of the image. It has been found to be highly correlated with human judgement. The metric is defined as:
which corresponds to the cosine similarity between visual CLIP embedding
for an image
and textual CLIP embedding
for an caption
. The score is bound between 0 and 100 and the closer to 100 the better.
Note
Metric is not scriptable
- Parameters
model_name_or_path¶ (
Literal
[‘openai/clip-vit-base-patch16’, ‘openai/clip-vit-base-patch32’, ‘openai/clip-vit-large-patch14-336’, ‘openai/clip-vit-large-patch14’]) – string indicating the version of the CLIP model to use. Available models are “openai/clip-vit-base-patch16”, “openai/clip-vit-base-patch32”, “openai/clip-vit-large-patch14-336” and “openai/clip-vit-large-patch14”,kwargs¶ (
Any
) – Additional keyword arguments, see Advanced metric settings for more info.
- Raises
ModuleNotFoundError – If transformers package is not installed or version is lower than 4.10.0
Example
>>> import torch >>> _ = torch.manual_seed(42) >>> from torchmetrics.multimodal import CLIPScore >>> metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16") >>> score = metric(torch.randint(255, (3, 224, 224)), "a photo of a cat") >>> print(score.detach()) tensor(25.0936)
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- update(images, text)[source]
Updates CLIP score on a batch of images and text.
- Parameters
- Raises
ValueError – If not all images have format [C, H, W]
ValueError – If the number of images and captions do not match
- Return type
Functional Interface¶
- torchmetrics.functional.multimodal.clip_score.clip_score(images, text, model_name_or_path='openai/clip-vit-large-patch14')[source]
CLIP Score is a reference free metric that can be used to evaluate the correlation between a generated caption for an image and the actual content of the image. It has been found to be highly correlated with human judgement. The metric is defined as:
which corresponds to the cosine similarity between visual CLIP embedding
for an image
and textual CLIP embedding
for an caption
. The score is bound between 0 and 100 and the closer to 100 the better.
Note
Metric is not scriptable
- Parameters
images¶ (
Union
[Tensor
,List
[Tensor
]]) – Either a single [N, C, H, W] tensor or a list of [C, H, W] tensorstext¶ (
Union
[str
,List
[str
]]) – Either a single caption or a list of captionsmodel_name_or_path¶ (
Literal
[‘openai/clip-vit-base-patch16’, ‘openai/clip-vit-base-patch32’, ‘openai/clip-vit-large-patch14-336’, ‘openai/clip-vit-large-patch14’]) – string indicating the version of the CLIP model to use. Available models are “openai/clip-vit-base-patch16”, “openai/clip-vit-base-patch32”, “openai/clip-vit-large-patch14-336” and “openai/clip-vit-large-patch14”,
- Raises
ModuleNotFoundError – If transformers package is not installed or version is lower than 4.10.0
ValueError – If not all images have format [C, H, W]
ValueError – If the number of images and captions do not match
Example
>>> import torch >>> _ = torch.manual_seed(42) >>> from torchmetrics.functional.multimodal import clip_score >>> score = clip_score(torch.randint(255, (3, 224, 224)), "a photo of a cat", "openai/clip-vit-base-patch16") >>> print(score.detach()) tensor(24.4255)
- Return type