CV看圖說故事_Phase1.COCO資料集載入與劃分

- 11月 30, 2025

COCO 是一個大規模的物件偵測、分割與影像描述資料集。COCO 有幾項特色：

Object segmentation
Recognition in context
Superpixel stuff segmentation
330K images (>200K labeled)
1.5 million object instances
80 object categories
91 stuff categories
5 captions per image
250,000 people with keypoints

COCO數據集資料LINK

http://images.cocodataset.org/zips/train2014.zip

http://images.cocodataset.org/zips/val2014.zip

http://images.cocodataset.org/annotations/annotations_trainval2014.zip

下載解壓後存放至Jupyter Notebook專案之下，檔案大小依序分別如下(過程要等一段時間)

val2014.zip 大小6.18GB

train2014.zip 大小12.5GB

annotations_trainval2014.zip 大小241MB

記得解壓縮存放到./data相對目錄下

訓練與驗證影像的描述文字可在 annotations 子資料夾內中的

captions_train2014.json 或 captions_val2014.json JSON 檔中找到，而所有影像都放在 train2014 或 val2014 資料夾中。

關於JSON檔案格式基本上

可以使用如下程式碼，先觀察到有以下四個主要KEY值

info、image、license 和 annotation

import json
valcaptions = json.load(open(
    './data/annotations/captions_val2014.json', 'r'))

trcaptions = json.load(open(
    './data/annotations/captions_train2014.json', 'r'))

# inspect the annotations
print(trcaptions.keys())

image KEY主要包含每張影像的記錄，以及關於大小、URL、名稱和用來在資料集中參照該影像的唯一 ID 等資訊。

起初目標就是先產生一個簡單的單一檔，包含兩

一欄是影像檔名，另一欄則為該檔案的描述(caption)。

而根據論文 Deep Visual-Semantic Alignments for Generating Image Descriptions

https://arxiv.org/abs/1412.2306

一篇關於影像描述的論文，提出一個可為影像及其區域產生自然語言描述的模型。

這篇論文方法利用影像及其句子描述的資料集，學習語言與視覺資料之間的跨模態對應關係。

該論文實驗是拿 5,000 張圖片拿來做 validation 集，此外建議對所有訓練與驗證影像一起進行訓練。而COCO資料集特點是每張影像都有5段描述，因此驗證集不能根據描述資料來切分。

以下是後半部程式處理

prefix = "./data/"
val_prefix = prefix + 'val2014/'
train_prefix = prefix + 'train2014/'

# training images，前面用 json.load 讀進來的 captions_train2014.json
#裡面的 images 是一個 list，根據id,file_name來組對應key-value pair的dict
#對 images 裡的每一個 x
#key：x['id']（圖片的數字 ID）
#value：x['file_name']（那張圖的檔名）
trimages = {x['id']: x['file_name'] for x in trcaptions['images']}

# validation images
# take half images from validation - karpathy split
#valcaptions['images']是 captions_val2014.json 裡的 images list，包含 COCO 的 全部 val2014 圖片資訊。
#這邊保留後面5000張影像留著作為後續驗證測試用途，前面全部拿來訓練。
valset = len(valcaptions['images']) - 5000 # leave last 5k 
valimages = {x['id']: x['file_name'] for x in valcaptions['images'][:valset]} #從 index 0 到 valset - 1
truevalimg = {x['id']: x['file_name'] for x in valcaptions['images'][valset:]} #從 valset 一直到最後

#Flatten to (caption, image_path) structure
data = list()  #放「訓練用」的資料：每一筆都是 (caption, image_path)
errors = list() #放「保留做驗證/測試」的資料
validation = list() #放「有問題的標註」，例如找不到對應圖片 id 的 annotation

for item in trcaptions['annotations']:
    if int(item['image_id']) in trimages:
        fpath = train_prefix + trimages[int(item['image_id'])]
        caption = item['caption']
        data.append((caption, fpath))
    else:
        errors.append(item)

for item in valcaptions['annotations']:
    caption = item['caption']
    if int(item['image_id']) in valimages:
        fpath = val_prefix + valimages[int(item['image_id'])]
        data.append((caption, fpath))
    elif int(item['image_id']) in truevalimg: # reserved
        fpath = val_prefix + truevalimg[int(item['image_id'])]
        validation.append((caption, fpath))
    else:
        errors.append(item)

接續將訓練資料集進行隨機打亂以利訓練

在此準備額外另存寫出 CSV ，用於將訓練和測試資料各自保存為兩個 CSV 檔案。

import random
import os
import time
import datetime
import csv
random.seed(42)  # for reproducibility
# lets shuffle the list in place
print("Before Shuffling: ", data[:5])

random.shuffle(data)
print("Post-shuffling: ", data[:5])
# persist for future use
with open(prefix + 'data.csv', 'w') as file:
    writer = csv.writer(file, quoting=csv.QUOTE_ALL)
    writer.writerows(data)
# persist for future use
with open(prefix + 'validation.csv', 'w') as file:
    writer = csv.writer(file, quoting=csv.QUOTE_ALL)
    writer.writerows(validation)
print("TRAINING: Total Number of Captions: {},  Total Number of Images: {}".format(
    len(data), len(trimages) + len(valimages)))
print("VALIDATION/TESTING: Total Number of Captions: {},  Total Number of Images: {}".format(
    len(validation), len(truevalimg)))
print("Errors: ", errors)

TRAINING:

Total Number of Captions: 591751, Total Number of Images: 118287

VALIDATION/TESTING:

Total Number of Captions: 25016, Total Number of Images: 5000

後續可以嘗試將圖片對應5個描述隨機取一個對應作呈現

# 顯示圖片 + 對應 caption 做肉眼驗證用
from PIL import Image
import matplotlib.pyplot as plt
def show_random_samples(pairs, n=3):
    """
    pairs: list[(caption, image_path)]
    n: 要顯示幾筆
    """
    for i in range(n):
        caption, img_path = random.choice(pairs)  # 隨機抽一筆

        # 印出文字資訊
        print(f"Sample {i+1}")
        print(f"Image path: {img_path}")
        print(f"Caption   : {caption}")
        print("-" * 80)

        # 開啟圖片並顯示
        img = Image.open(img_path)
        plt.imshow(img)
        plt.axis("off")
        plt.show()
		
# 從 training data 抽幾張來看
show_random_samples(data, n=3)

這邊再多設計一個是針對隨機抽樣某3張圖片對應5個描述全呈現

from collections import defaultdict

def build_imgid_to_file(captions_json):
    """從 JSON 建 image_id -> 檔名"""
    return {img["id"]: img["file_name"] for img in captions_json["images"]}

def build_imgid_to_captions(captions_json):
    """從 JSON 建 image_id -> [caption1, caption2, ...]"""
    mapping = defaultdict(list)
    for ann in captions_json["annotations"]:
        mapping[ann["image_id"]].append(ann["caption"])
    return mapping

# 建 mapping（只要跑一次就可以重複使用）
tr_imgid2file = build_imgid_to_file(trcaptions)
tr_imgid2caps = build_imgid_to_captions(trcaptions)

val_imgid2file = build_imgid_to_file(valcaptions)
val_imgid2caps = build_imgid_to_captions(valcaptions)

def show_image_with_all_captions(imgid2file, imgid2caps, prefix, n=1):
    """
    隨機挑 n 張圖，每張圖顯示一次 + 印出該圖所有 caption
    """
    img_ids = list(imgid2file.keys())

    for _ in range(n):
        img_id = random.choice(img_ids)
        file_name = imgid2file[img_id]
        fpath = prefix + file_name
        caps = imgid2caps[img_id]

        print("=" * 80)
        print(f"Image ID : {img_id}")
        print(f"Image path: {fpath}")
        print("Captions:")
        for i, c in enumerate(caps, 1):
            print(f"  {i}. {c}")

        # 顯示圖片
        img = Image.open(fpath)
        plt.imshow(img)
        plt.axis("off")
        plt.show()

# 看幾張 training 圖片 + 5 個 caption
show_image_with_all_captions(tr_imgid2file, tr_imgid2caps, train_prefix, n=3)

呈現出來發掘微軟等參與研究的團隊，當時收集的資料集真的是千奇百怪....

搜尋此網誌

第25個冬天

CV看圖說故事_Phase1.COCO資料集載入與劃分

留言

張貼留言

這個網誌中的熱門文章

何謂淨重(Net Weight)、皮重(Tare Weight)與毛重(Gross Weight)

外貿Payment Term 付款條件(方式)常見的英文縮寫與定義

鼎新ERP_會計系統_總帳管理_財務參數設定_傳票處理