ISO 28500:2009 信息和文献 WARC文件格式
标准编号:ISO 28500:2009
中文名称:信息和文献 WARC文件格式
英文名称:Information and documentation — WARC file format
发布日期:2009-05
标准范围
ISO 28500:2009规定了WARC文件格式:存储来自主流互联网应用层协议(诸如超文本传输协议(HTTP)、域名系统(DNS)和文件传输协议(FTP))的有效载荷内容和控制信息;存储链接到其他存储数据的任意元数据(例如,主题分类器、发现的语言、编码);支持数据压缩并保持数据记录的完整性;存储来自采集协议的所有控制信息(例如,请求报头),而不仅仅是响应信息;存储链接到其他存储数据的数据转换的结果;存储链接到其他存储数据的重复检测事件(以减少在存在相同或基本相似资源的情况下的存储);在不中断现有功能的情况下进行扩展;如果需要,支持通过截断或分段来处理过长的记录。
ISO 28500:2009 specifies the WARC file format:
- to store both the payload content and control information from mainstream Internet application layer protocols, such as the Hypertext Transfer Protocol (HTTP), Domain Name System (DNS), and File Transfer Protocol (FTP);
- to store arbitrary metadata linked to other stored data (e.g. subject classifier, discovered language, encoding);
- to support data compression and maintain data record integrity;
- to store all control information from the harvesting protocol (e.g. request headers), not just response information;
- to store the results of data transformations linked to other stored data;
- to store a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or substantially similar resources);
- to be extended without disruption to existing functionality;
- to support handling of overly long records by truncation or segmentation, where desired.
标准预览图


