WEB页面中主题文本信息的自动提取研究
WEB页面中主题文本信息的自动提取研究
因特网发展到现在已经成为全球的信息中心,成为人们获取信息的主要来源,但目前互联网中使用的HTML标准绝大部分标记都是针对浏览器的表述,而浏览器只负责对网页页面结构和表现形式进行布局,不能对网页所表现的内容作分析处理,且HTML所具有的开放性与异构性使计算机快速准确的获取网页主题信息变的极为困难。本文通过分析目前互联网中的网页结构表现形式,对其布局结构分析内容,设计出了一种提取出网页中主题信息的新方法,使计算机能够通过该算法提取出网页主题内容,以方便计算机作数据化处理。
关键字 HTML网页 主题信息 提取算法 JAVA
Title Study of Approach and Algorithm to AutomaticExtract
Topic Information from Semi-Structured Text(Webpage)
Abstract
The Internet has now become a global information center, and it is also the main source that the public get information from. But at present, the HTML standard markings which are used in the internet are mostly directed at the browser presentation, and the browser is only responsible for the structure and layout of web pages. However, it can not process or analyze the content of the web pages. And at the same time, with the HTML by the openness and the heterogeneity, it becomes extremely difficult for computer with rapid access to accurate information of the website theme. Through analyzing the current structure and content of the Internet web pages, this text designs a new method that extracts theme information from web pages so that the computer can extract the website theme by this algorithm, which will helps facilitate computer for data processing.
Keywords HTML Webpage Topic Information
Information Extraction JAVA