ISO 28500:2009 信息和文献 WARC文件格式

标准编号:ISO 28500:2009

中文名称:信息和文献 WARC文件格式

英文名称:Information and documentation — WARC file format

发布日期:2009-05

标准范围

ISO 28500:2009规定了WARC文件格式:存储来自主流互联网应用层协议(诸如超文本传输协议(HTTP)、域名系统(DNS)和文件传输协议(FTP))的有效载荷内容和控制信息;存储链接到其他存储数据的任意元数据(例如,主题分类器、发现的语言、编码);支持数据压缩并保持数据记录的完整性;存储来自采集协议的所有控制信息(例如,请求报头),而不仅仅是响应信息;存储链接到其他存储数据的数据转换的结果;存储链接到其他存储数据的重复检测事件(以减少在存在相同或基本相似资源的情况下的存储);在不中断现有功能的情况下进行扩展;如果需要,支持通过截断或分段来处理过长的记录。

ISO 28500:2009 specifies the WARC file format:

  • to store both the payload content and control information from mainstream Internet application layer protocols, such as the Hypertext Transfer Protocol (HTTP), Domain Name System (DNS), and File Transfer Protocol (FTP);
  • to store arbitrary metadata linked to other stored data (e.g. subject classifier, discovered language, encoding);
  • to support data compression and maintain data record integrity;
  • to store all control information from the harvesting protocol (e.g. request headers), not just response information;
  • to store the results of data transformations linked to other stored data;
  • to store a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or substantially similar resources);
  • to be extended without disruption to existing functionality;
  • to support handling of overly long records by truncation or segmentation, where desired.

标准预览图


立即下载标准文件