国立国会図書館の画像データレイアウト認識

背景
図書館が所蔵資料をデジタル化することは、資料の検索可能性や提供可能性を向上させる上で重要な意味を持ちます。デジタル化は、多くの場合テキスト情報を持たない画像データの作製にとどまっており、そのままでは資料本文中に記述されている情報を検索することができませんが、昨今のOCR技術の進展により、画像データからテキストデータを作成することで本文検索が可能になるため、技術的検討が進められています。例えば、国立国会図書館ではOCRを利用した本文検索機能を含む、機械学習技術を応用した実験的検索サービス「次世代デジタルライブラリー(https://lab.ndl.go.jp/dl/)」が公開されています。しかし、昭和前期以前に刊行された資料の画像は、撮影時の資料の状態や現代の刊行物とのレイアウトの違いといった問題から、新刊書等をOCR処理する場合に比べてテキスト化の精度が大きく低下するという課題があります。このような課題を解決しOCR処理の精度の向上を図るため、処理を行う前段階で、画像データのレイアウト認識を行いテキスト領域や資料画像データを対象とした研究等に有用な領域を特定するアルゴリズムの作成に挑戦していただきます。

課題
資料画像に対して、予測対象のレイアウトラベルを含む矩形領域を、bounding box =(x1, y1, x2, y2) として割り当て、且つラベルの1つを付与していただきます。各画像には、1つ以上の bounding boxが割り当てられます。bounding box は画像の左上を原点（0,0）とし、オブジェクト領域の左上の座標（x1, y1）、右下の座標（x2, y2）の4つを指定することで表現されます。なお、予測対象となるレイアウトラベルは、「古典籍資料」と「明治期以降刊行資料」で異なります。

データセット

内訳

	学習用データセット	評価用データセット
古典籍資料	1219枚	211枚
明治期以降刊行資料	1175枚	252枚
合計	2394枚	463枚

※本データセットは国立国会図書館が作成・公開しているレイアウトデータセット(https://github.com/ndl-lab/layout-dataset)をもとに、「国立国会図書館刊行物」のデータセット及びモデル性能評価用データを追加したものです。

構成
・資料画像（jpeg画像）
・アノテーションデータ（資料の種別やデータ公開可否、メタ情報（著者名や出版年等）、出現するレイアウトラベル・矩形タグ領域）（json形式）
　※評価用データセットには、出現するレイアウトラベル・矩形タグ領域は含まれません。

古典籍資料
特徴
明治期より前に出版された出版物であり、浮世絵や和書・漢籍資料が含まれる。また、マイクロ資料を再デジタル化した場合など、強いノイズの乗った資料も存在する。浮世絵の中に文字が書き込まれているなど、複数のレイアウトの重なりがある場合が多い。

含まれるレイアウトラベル

ラベル名	説明
1_overall	資料範囲全体
2_handwritten	くずし字の文字ライン
3_typography	くずし字以外の文字ライン
4_illustration	イラスト（写真含む）
5_stamp	印影（蔵書印等）

画像サンプル

明治期以降刊行資料
特徴
明治期以降に出版された、冊子の形態をとる出版物である。マイクロ資料をデジタル化した資料など、強いノイズの乗った資料が多く存在する。多くは昭和前期より前に刊行された資料であるが、一部戦後に刊行された刊行物を含む。

含まれるレイアウトラベル

ラベル名	説明
1_overall	資料範囲全体
4_illustration	イラスト（写真含む）
5_stamp	印影（蔵書印等）
6_headline	見出し
7_caption	図表見出し
8_textline	6_headline, 7_caption 以外の文字ライン
9_table	表

※ラベル名の先頭の数字は両資料で通し番号。

画像サンプル

評価対象（予測対象）となるレイアウトラベル

ラベル名	説明
1_overall	資料範囲全体
2_handwritten	くずし字の文字ライン
3_typography	くずし字以外の文字ライン
4_illustration	イラスト（写真含む）
5_stamp	印影（蔵書印等）
6_headline	見出し
7_caption	図表見出し
8_textline	6_headline, 7_caption 以外の文字ライン

※ラベル「9_table（表）」は、学習用データセットには含まれますが、評価対象（予測対象）からは外します。

評価関数
・評価関数「mean IoU」を使用します。
・評価値は0～1の値をとり、精度が高いほど大きな値となります。

評価関数の詳細
　①全ラベルの予測領域と正解領域（矩形）の重なり（IoU）を計算
　②各画像に対して、ラベル毎の平均IoUを計算
　③各画像に対して、画像毎の平均IoU（②の平均値）を計算
　④評価対象の全画像の平均IoU（③の平均値）を計算

入賞者の決定
1. コンテスト最終日までの評価（暫定評価）は評価用データセットの一部で評価し、コンテスト終了後の評価（最終評価）は評価用データセットの残りの部分で評価します。
　リーダーボードはコンテスト終了時に自動的に最終評価に切り替わり、それを元に最終順位を決定します。このため、開催中と終了後では順位が大きく変動する場合もあります。
2. スコアが同値の場合は、早い日時でご応募いただいた参加者を上位とします。
3. 最終順位が上位の方を入賞候補者とし、事務局から連絡いたします。
4. 事務局での経済的負担や、権利侵害がないことを保証いただくことを条件に、入賞候補者には以下を提出していただきます。
・予測モデルのソースコード
・学習済モデル
・予測結果の再現の為の手順書（前処理部分、学習部分、予測部分が分かるよう明記）
・実行環境（OSのバージョン、使用ソフトウェア及び解析手法）　
・乱数シード（Random Forest等の乱数を利用した手法の場合）
・各説明変数の予測モデルへの寄与度（寄与度の算出が可能な手法を用いた場合）
・データの解釈、工夫点、モデリングから得られる示唆等
5. 再現性検証において、入賞候補者及び、その提出モデルが下記いずれかに該当する場合は懸賞の獲得資格を失います。
・事務局からの手続き上の連絡・要求に対して指定された期限内に対応しない
・参加条件やルールを満たしていない
・プログラムが動作しない
・学習済モデルから出力されるスコアと最終評価スコアが一致しない
・その他、事務局が不当と判断した場合
6. 再現性を確認できた方から入賞者を選定します。（最終提出物のプログラムは、オープンソースとして公開予定です）

総合ランキング
・本コンペは、総合ランキング（スコア・メダル）の対象です。

心構え
・企業課題の達成、社会問題の解決、研究成果の共有等、大前提となる目的に合わせ、実用性を意識したアプローチで臨んでください。特に運用性・拡張性を鑑みた実装を期待します。すなわち、機械的・自動的な処理に基づく学習・推論を前提とし、人間の判断に依存しない処理フローを期待します。

システムの利用
・1日5回まで投稿が可能です。
・1参加者につき1アカウントが必要です。1人で複数アカウントを利用、1アカウントを複数人で共有することは禁止です。
・チームでの参加を希望する場合は、こちらをご一読の上、チームを作成ください。（作成期限：1/31）。

情報の取り扱い
・他の（同じチーム以外の）参加者と本コンペの予測に関連するデータ・ソースコードの共有・公開は禁止します。
・ただし、本コンペのフォーラムで、全ての参加者に対して公開する場合に限り、共有可能です。
・入賞者のソースコードは、MITライセンス等のオープンソースライセンスを付与して公開する可能性があります。

データの利用
・第三者の権利を侵害しない、無償で誰でも手に入るオープンなデータに限り、提供データ以外のデータや、学習済モデル、API等も利用可能です。
・例えば、学習データの修正（手動でのラベル付けやラベルの書き換え等含む）も可能です。

実装方法
・モデルの学習に利用するツールは、オープン且つ無料なもの（python, R 等）に限定します。
・提案した方法が一般的な環境において追加費用負担を伴わず、再現及び継続利用可能であることを保証すること。同じフォーマットで、異なるデータを入力した場合にも同様なロジックで予測できなければなりません。
・ソースコードは、以下のように、前処理、学習、予測、の3つに分け、それぞれを実行すれば処理が進むように実装してください。（やむを得ない事由がある場合はこの限りではありません）

①Preprocessing（前処理）
　　提供するデータを読み込み、データに前処理を施し、モデルに入力が可能な状態でファイル出力するモジュール。get_train_dataやget_test_dataのように、学習用と評価用を分けて、前処理を行う関数を定義してください。
　　※preprocessに渡す情報として、学習用データと評価用データを混在させることは可能ですが、get_train_dataで返す結果は前処理された学習用データ、get_test_dataで返す結果は前処理された評価用データとなるように、処理の内容を独立させて下さい。
②Learning（学習）
　　①で作成したファイルを読み込み、モデルを学習するモジュール。学習済みモデルや特徴量、クロスバリデーションの評価結果を出力する関数も定義してください。
③Predicting（予測）
　　①で作成したテストデータ及び②で作成したモデルを読み込み、予測結果をファイルとして出力するモジュール。

Disclosure policy

As a general rule, in accordance with Article 4, Paragraph 1 of the terms of participation, diclosing any contents such as insights and deliverables transmitted through the information or data provided by our company in relation to this competition is not permitted, however, only after the completion of this competition and for non-commercial purposes, it will be possible to disclose the contents within the score of the table below

Model *1	Public
Analysis results *2	Public

Public : Posting to social media sites, blogs and source repositories, and citing to papers
Restricted : Using in a limited range from research, education to seminars, where many unspecified people cannot access

*1 Execution unit source code and learned models
*2 The insights obtained using the information and data provided, or the solutions including scripts and processed data such as summary statistics

※Notes

本コンペティションの開催期間中でも、本コンペティションフォーラム上であれば公開可能です。ただし、入賞者の最終成果物等については、権利取得者がMITライセンス等のオープンソースライセンスを付与して公開する可能性があるため、入賞者本人による公開は原則不可とします。また、資料画像を掲載する場合は、コンペティションの開催期間中・終了後問わず、アノテーションの項目「データ公開」が"可"の画像のみを公開可能とし、"不可"の画像の公開は禁止します。

Q：文字ラインについて、1文の中で分割している場合と分割していない場合の基準は何か？

A：レイアウトラベルごとに異なり、基準は以下のとおりです。
- textline, handwritten, typography : 1文であっても概ねその行のフォント2文字分より離れていたら分ける
- headline, caption : 1文であれば離れていても分けない

Q：分数を含む数式の扱いは？

A：分数かどうかを問わず、「1行の式は文字ライン1行」としています。

Q：学習用データセットには9_tableが含まれているが、評価用データでは評価しないのか？

A：評価しません。tableを学習することで予測結果の改善には利用できる可能性(illustrationの誤検出を減らす等)も考慮し、学習用データセットには残しています。

Q：文中の「……」や「―」で文字ラインの矩形が途切れるのはなぜか？

A：目次等の区切り記号に使われているためです。一括で区切り記号としています。

In order to participate in the Competition, you are required to agree to these Terms, in addition to the Terms of Use of SIGNATE.JP Site (hereinafter referred to as the “Terms of Use”). You should participate in the Competition after reading carefully and agreeing to these Terms. If you agree, these Terms, the matters that are added to these Terms as "additional matters", the Terms of Use and other terms and conditions that you have agreed to shall be binding on the relevant parties as integral documents.

Article 1. Definitions

１.For the purpose of these Terms, the following terms shall be defined as follows:

(１)"Site" means the website "SIGNATE (https://signate.jp)” on which the Competitions are posted.
(２)"Competition" means any competition on AI development or data analysis on the Site as held by the Host.
(３)"Host" is the host(s) of the Competitions. The Host may be SIGNATE, Inc. (hereinafter referred to as the “Company”) or the Company’s client companies, affiliated companies, schools or organizations, etc. (hereinafter referred to as the “Client(s)”).
(４)"Participant(s)" means the member(s) who participate in a Competition.
(５)"Submissions" means, collectively, the analysis and prediction results and reports, etc. as submitted in the Competition.
(６)"Final Submissions" means the Submissions that are specified by a Participant on the prescribed page in the Site by the time of completion of a Competition.
(７)"Winner Candidate" means the Participant who has received a notice from the Company that he/she is nominated as a winner candidate.
(８)"Submissions for Final Judgment" means the analysis and prediction model and learning data, etc. as submitted by a Winner Candidate pursuant to the instructions of the Company.
(９)"Final Judgment" means the acceptance inspection and judgment, including reproducibility verification, by the Company for the Final Submissions and Submissions for Final Judgment of a Winner Candidate.
(１０)"Winner" means the Winner Candidate who is informed by the Company that he/she has won a prize.
２.Unless otherwise defined in these Terms, the terms used in these Terms that are defined in the Terms of Use shall have the same meaning as defined in the Terms of Use.

Article 2. Competition

１.A member who desires to participate in a Competition shall be required to agree to these Terms and to satisfy the conditions for participation as specified in each such Competition. Any person who is not a member shall not participate in any Competition.
２.Participants shall participate in each Competition in the manner as advised by the Company and shall be obligated to comply with the rules as prescribed in each Competition.
３.Participants may submit the Submissions for the assignment of each Competition during the period of such Competition and submit a proposal on the method of solving the problem to the Host by the end of the period of the said Competition.
４.Participants may submit the Final Submissions in the form specified in each Competition by the time specified by the said Competition.
５.The Final Submissions as submitted shall be evaluated by the evaluation method as specified in each Competition and the final rank order shall be determined based on such evaluation.
６.Any Participant may, as a general rule, check the evaluation results of the Participant him/herself and each of the other Participants on the Site for the Submissions that may be evaluated quantitatively.
７.Participants shall be liable or otherwise responsible for their own Submissions, including their legality.
８.Participants shall not submit any Submissions that have no direct relationship to each Competition.
９.Unless otherwise provided for, Participants shall not directly communicate to, consult with, make a request to, solicit or take any other actions with the Host in respect of the matters related to a Competition during the period of the said Competition.
１０.Any Participant who has uncertainty or questions about any Competition shall make sure to contact the Company or its designee through the procedures prescribed by the Company as posted on the Site.
１１.The Company shall not be obligated to pay any remuneration or other consideration other than those prescribed in the following Article for any act of the Participants as prescribed in paragraphs hereof.

Article 3 Reward and Vesting of Rights

１.Unless otherwise provided for, any Participant shall satisfy the following requirements in order to be entitled to receive a reward in any Competition that offers a reward:

(１)To be a winner;
(２)To agree to transfer to the Host and the relevant transferee of rights in such Competition all transferable rights, such as copyrights, rights to obtain patents and know-how, etc. in and to all analysis and prediction results, reports, analysis and prediction model, algorithm, source code and documentations for the model reproducibility, etc., and the Submissions contained in the Final Submissions and Submissions for Final Judgment (including the rights as prescribed in Article 27 and Article 28 of the Copyright Act and the rights to obtain patents; hereinafter referred to as the "Rights");
(３)To agree that any relevant transferee of rights exclusively has the right to use the know-how contained in the Final Submissions and Submissions for Final Judgment for its own business and other purpose without any restriction;
(４)To agree not to exercise moral rights to the Rights against the relevant transferee of rights;
(５)To enter into an agreement for the transfer of the Rights with the relevant eligible transferee of rights, including the agreement to the matter in the preceding three (3) items and other reasonable provisions;
(６)To have the personal identity of such Participant verified by the Company.
(７)Not to breach any provision of these Terms and the Terms of Use.

２.Any Winner Candidate shall, after having received a notice from the Company that he/she is nominated as a winner candidate, submit the Submissions for Final Judgment on or before the designated date and communicate the matters requiring confirmation or response in relation to the Final Submissions and the Submissions for Final Judgment to the Company on or before the designated date, in accordance with the instructions of the Company. The Company shall carry out the final judgment based on such matters requiring confirmation or response. If the Company receives no confirmation or response satisfactory to the Company on or before the designated date, the Company may exclude such Winner Candidate from the subject of the final judgment.
３.If the Company considers that the Final Submissions or Submissions for Final Judgment need to be amended or modified, or there occur any additional matters requiring confirmation, in the course of the final judgment, any Winner Candidate shall take action or make response in relation to the matters that require amendment, etc. or the detailed information on the matters requiring confirmation, on or before the designated date in accordance with the instructions of the Company. If the Company receives no action or response satisfactory to the Company on or before the designated date, the Company may exclude such Winner Candidate from the final judgment.
４.The Company shall determine the Winner through the final judgment and inform the Winner to that effect.

Article 4 Confidentiality

１.Participants shall treat any information, data, or such contents as insights and deliverables transmitted through the service where they receive from the Company in relation to each Competition (hereinafter referred to as the "Company-Provided Information") as confidential information and shall not disclose the same to any third party and use the same for any purpose other than for such Competition and purpose specified by the Company separately; provided, however, that the confidential information shall not include any information that falls under any of the following items:

(１)Information that is known to the public at the time of the disclosure;
(２)Information that is already possessed by the Participant at the time of the disclosure (only in the case where such Participant may demonstrate such fact by reasonable means);
(３)Information that becomes known to the public without the fault of the Participant after the disclosure;
(４)Information that is independently developed by the Participant without reference to any information as disclosed (except for those Submissions of the person eligible for a prize which are evaluated); or
(５)Information that is rightfully disclosed by any third party having a right to do so without the obligations of confidentiality (only in the case where such Participant may demonstrate such fact by reasonable means).

２.Any Participant shall delete or return to the Company the Company-Provided Information immediately after the completion of each Competition.
３.Any Winner shall handle his/her Final Submissions and Submissions for Final Judgment in the same manner as prescribed in paragraph 1 hereof.
４.If there is any separate arrangement in relation to the confidential information in each Competition, the provisions of such arrangement shall prevail over the provisions of these Terms.
５.If any dispute occurs between the Host or other third party and the Company due to the breach by any Participant of the provisions of this Article and such other party makes any claim against the Company, such Participant shall compensate for any damage, loss, expenses (including, but not limited to, attorneys’ fees), lost profits and lost revenues, etc. incurred by the Company.
６.The provisions of this Article shall survive the termination of the relevant Competition or the Participant’s completion of the procedures for withdrawal from the service of the Company, with respect to the Company-Provided Information and the Winner’s Final Submissions and Submissions for Final Judgment for a period of five (5) years thereafter.

Article 5 Prohibited Acts of Participants

１．The Company shall prohibit Participants from engaging in any of the following acts in any Competition:

(１)An act of cracking, cheating, spoofing other misconduct;
(２)An act of directly communicating to, consulting with, making a request to, soliciting or responding to solicitation or other activities to other Participants or the Host (other than the Company) without the involvement of the Company;
(３)Any profitmaking activities using the Competition (including solicitation or scouting activities, and use for a third party in educational business, etc.) without the prior approval of the Company in writing or any other manner specified by the Company;
(４)Transfer, offering as collateral or other disposition of the status as a Participant or the rights or obligations as a Participant (except with the prior written consent of the Company); and
(５)Any other act in breach of the Terms of Use.

２.If the Company deems that a Participant engages in any of the prohibited acts as prescribed in the preceding paragraph, the Company may, without prior notice to the Participant, disqualify the Participant from the Competition in which the Participant participates, temporarily suspend the Participant from using the service of the Company, withdraw the Participant’s membership, claim damages from the Participant or take any other measures deemed necessary by the Company.

Article 6. Change, Discontinuation or Termination of Provision of Services under These Terms

１.The Company may change or temporarily suspend the services provided by the Company under these Terms without prior notice to the members.
２.Upon one (1) month prior notice to the members, the Company may suspend for a long period of time or terminate the services provided by the Company under these Terms.
３.The Company shall not be liable for any results or damage arising from the measures taken by the Company under this Article.

Article 7 Modification of Terms

１.The Company may modify, add or delete any provisions of these Terms from time to time without the approval of the members.

Enforced on April 1, 2018
Last updated on January 18, 2019

Disclosure policy

Article 1. Definitions

Article 2. Competition

Article 3 Reward and Vesting of Rights

Article 4 Confidentiality

Article 5 Prohibited Acts of Participants

Article 6. Change, Discontinuation or Termination of Provision of Services under These Terms

Article 7 Modification of Terms

SIGNATE Competition

Article 1. Definitions

Article 2. Competition

Article 3 Reward and Vesting of Rights

Article 4 Confidentiality

Article 5 Prohibited Acts of Participants

Article 6. Change, Discontinuation or Termination of Provision of Services under These Terms

Article 7 Modification of Terms

General posting guidelines

This is a forum used by SIGNATE members to exchange thoughts and ideas on data science and competitions. As your membership here is conditional, please keep in mind to familiarize yourself before joining in on discussion.

Please sign in

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

Must update your profile to join the competition

Must update your profile to join the competition

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

本コンペに参加するには下記項目への入力が必須です

Must update your profile to join the competition

Error details

Preparing to download the contents.

Must update your profile to join the competition