- 2024-12-17 15:06:21
- 8209 热度
- 0 评论
jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
jsoup实现html5规范,并将HTML解析为与现代浏览器相同的DOM
1)从一个URL,文件或字符串中解析HTML
2)使用DOM或CSS选择器来查找、取出数据
3)可操作HTML元素、属性、文本
注意:jsoup是基于MIT协议发布的,可放心使用于商业项目。
jsoup入门示例程序(网络爬虫)
http://www.javacui.com/opensource/463.html
Jsoup加载HTML的三种方式
http://www.javacui.com/opensource/464.html
这里需要解析一个HTML内容的电子病历,经过分析,该病历每段以P标签来划分,然后里面是其具体内容,但是P标签内的HTML内容并没有特别明确的格式,想要一步步解析有些难度。
经过同事点醒,可以在获取到每段P标签的HTML内容后,通过正则来移除所有HTML标签,这样就只剩文本内容了,试了一下一段代码搞定了需求。
以下是这个病例提取到的HTML内容
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>电子病历手术记录</title></head> <body> <mate http-equiv="Content-Type" content="text/html; charset=utf-8" charset="utf-8"></mate> <div style="margin-bottom: 0px;"> <p style="text-align: center;"> <span style="font-family: 宋体, SimSun; font-size: 21px;"> <strong> <span sde-model="" contenteditable="false" id="id1524386428512" name="name1524386428512" ele-keyname="43"> <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span> <span title="43" style="color:#808080;" contenteditable="false">XXX医院</span> <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span> <br></strong> </span> </p> <p style="text-align: center;"> <span style="font-size: 12px;">姓名: <span sde-model="" contenteditable="false" id="id1524386542648" name="name1524386542648" ele-keyname="39"> <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span> <span title="39" style="color:#808080;" contenteditable="false">XXX</span> <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span> 性别: <span sde-model="" contenteditable="false" id="id1524386566826" name="name1524386566826" ele-keyname="35" keyval="W"> <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span> <span title="35" style="color:#808080;" contenteditable="false">女</span> <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span>年龄: <span sde-model="" contenteditable="false" id="id1524386566827" name="name1524386566827" ele-keyname="34"> <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span> <span title="34" style="color:#808080;" contenteditable="false">60</span> <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span> <span sde-model="{"ID":"id1569204207166","TYPE":"text","ISPRINT":"Y","NAME":"","TAG":"","DESCNAME":"","VERIFYTYPE":"text","VALUE":"j","REQUIRED":0,"READONLY":0,"COLOR":"FF0808"}" contenteditable="false" id="id1569204207166" name="name1569204207166" ele-keyname="3661" keyval="4"> <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span> <span title="3661" style="color: rgb(128, 128, 128);" contenteditable="false">岁</span> <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span>科室: <span sde-model="" contenteditable="false" id="id1524386566828" name="name1524386566828" ele-keyname="30" keyval="18"> <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span> <span title="30" style="color:#808080;" contenteditable="false">内科</span> <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span>病室: <span sde-model="" contenteditable="false" id="id1524386566829" name="name1524386566829" ele-keyname="2678" keyval="7"> <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span> <span title="2678" style="color:#808080;" contenteditable="false">1</span> <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span>床号: <span sde-model="" contenteditable="false" id="id1524386566830" name="name1524386566830" ele-keyname="27" keyval="1298"> <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span> <span title="27" style="color:#808080;" contenteditable="false">3</span> <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span>住院号: <span sde-model="" contenteditable="false" id="id1524386566831" name="name1524386566831" ele-keyname="17"> <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span> <span title="17" style="color:#808080;" contenteditable="false">52525252</span> <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span> </span> </p> <hr></div> <p style="text-align: center;"> <span style="font-size: 20px;"> <strong>手术记录</strong></span> </p> <p> <span style="font-size: 16px;"> <span sde-model="{"ID":"id1532241260523","TYPE":"date","ISPRINT":"Y","NAME":"病程记录时间","TAG":"","DESCNAME":"病程记录时间","MAX":"","MIN":"","FORMAT":"Y-m-d H:i:S","VALUE":"","REQUIRED":0,"READONLY":0,"COLOR":"FF346A"}" contenteditable="false" id="id1532241260523" name="name1532241260523" ele-keyname="3109"> <span style="color:#808080;" contenteditable="false">[</span> <span title="3109" style="color:#808080;" contenteditable="true">2021-04-19 09:42:15</span> <span style="color:#808080;" contenteditable="false">]</span></span> </span> <span style="font-size: 20px;"> <strong> <br></strong> </span> </p> <p> <span style="font-size: 16px; font-family: 宋体, SimSun;"> <strong style="font-size: 16px; font-family: 宋体, SimSun;">手术开始时间:</strong> <span sde-model="{"ID":"id1532241220356","TYPE":"date","ISPRINT":"Y","NAME":"日期时间","TAG":"","DESCNAME":"日期时间","MAX":"","MIN":"","FORMAT":"Y-m-d H:i:S","VALUE":"2021-04-19 09:42:17","REQUIRED":0,"READONLY":0,"COLOR":"FF346A"}" contenteditable="false" id="id1532241220356" name="name1532241220356" ele-keyname="2564"> <span style="color:#0000FF;" contenteditable="false">[</span> <span title="2564" style="color:rgb(0,0,0);" contenteditable="true">2021-04-19 09:42:17</span> <span style="color:#0000FF;" contenteditable="false">]</span></span> <strong style="font-size: 16px; font-family: 宋体, SimSun;">手术结束时间:</strong> <span sde-model="{"ID":"id1532241236113","TYPE":"date","ISPRINT":"Y","NAME":"日期时间","TAG":"","DESCNAME":"日期时间","MAX":"","MIN":"","FORMAT":"Y-m-d H:i:S","VALUE":"2021-04-19 09:42:18","REQUIRED":0,"READONLY":0,"COLOR":"FF346A"}" contenteditable="false" id="id1532241236113" name="name1532241236113" ele-keyname="2565"> <span style="color:#0000FF;" contenteditable="false">[</span> <span title="2565" style="color:rgb(0,0,0);" contenteditable="true">2021-04-19 09:42:18</span> <span style="color:#0000FF;" contenteditable="false">]</span></span> </span> </p> <p> <span style="font-size: 16px; font-family: 宋体, SimSun;"> <span contenteditable="false" style="font-weight: bold; white-space: nowrap; font-size: 16px; font-family: 宋体, SimSun;" id="98_8326a">术前诊断:</span> <span style="color:#0000FF;" contenteditable="false">{</span> <span outlineid="98" outlinekeyname="sqzd" style="color:gbk(0,0,0);" contenteditable="true"> <span sde-model="{"ID":"id1615423213918","TYPE":"text","NAME":"术前诊断","TAG":"","DESCNAME":"术前诊断","VERIFYTYPE":"text","VALUE":"术前诊断","REQUIRED":0,"READONLY":0,"COLOR":"000000"}" contenteditable="false" id="id1615423213918" name="name1615423213918" ele-keyname="4368__" title="术前诊断" keycode="A00.900"> <span style="color:#0000FF" contenteditable="false">[</span> <span title="术前诊断" style="color:#000000;" contenteditable="true">霍乱</span> <span style="color:#0000FF" contenteditable="false">]</span></span> </span> <span style="color:#0000FF;" contenteditable="false">}</span></span> </p> <span> <strong> <span contenteditable="false" name="name1566953853584">手术类别:</span></strong> <span sde-model="{"ID":"id1566953853584","TYPE":"select","ISPRINT":"Y","NAME":"手术类别","TAG":"","DESCNAME":"手术类别","REQUIRED":0,"FREEINPUT":0,"COLOR":"000000","VALUE":"急诊手术","TEXT":"急诊手术","REMOTEURL":"","BINDINGDATA":[{"VALUE":"日间手术","TEXT":"日间手术","SELECTED":0},{"VALUE":"急诊手术","TEXT":"急诊手术","SELECTED":0},{"VALUE":"择期手术","TEXT":"择期手术","SELECTED":0}]}" contenteditable="false" id="id1566953853584" name="name1566953853584" ele-keyname="3643"> <span style="color:#0000FF;" contenteditable="false">[</span> <span title="3643" style="color: rgb(0, 0, 0);" contenteditable="true">急诊手术</span> <span style="color:#0000FF;" contenteditable="false">]</span></span> <strong> <span contenteditable="false" name="name1567158158826">是否微创:</span></strong> <span sde-model="{"ID":"id1567158158826","TYPE":"select","ISPRINT":"Y","NAME":"是否微创","TAG":"","DESCNAME":"是否微创","REQUIRED":0,"FREEINPUT":0,"COLOR":"000000","VALUE":"是","TEXT":"是","REMOTEURL":"","BINDINGDATA":[{"VALUE":"是","TEXT":"是","SELECTED":0},{"VALUE":"否","TEXT":"否","SELECTED":0}]}" contenteditable="false" id="id1567158158826" name="name1567158158826" ele-keyname="3660"> <span style="color:#0000FF;" contenteditable="false">[</span> <span title="3660" style="color: rgb(0, 0, 0);" contenteditable="true">是</span> <span style="color:#0000FF;" contenteditable="false">]</span></span> </span> </p> <p> <span style="font-size: 16px; font-family: 宋体, SimSun;"> <strong style="font-size: 16px; font-family: 宋体, SimSun;"> <span contenteditable="false" name="name1534267325574">手术者指导者:</span></strong> <span sde-model="{"ID":"id1615368733633","TYPE":"text","NAME":"手术者指导者","TAG":"","DESCNAME":"手术者指导者","VERIFYTYPE":"text","VALUE":"手术者指导者","REQUIRED":0,"READONLY":0,"COLOR":"000000"}" contenteditable="false" id="id1615368733633" name="name1615368733633" ele-keyname="3335__" title="手术者指导者" keyval="6"> <span style="color:#0000FF" contenteditable="false">[</span> <span title="手术者指导者" style="color:#000000;" contenteditable="true">XX</span> <span style="color:#0000FF" contenteditable="false">]</span></span> <strong style="font-size: 16px; font-family: 宋体, SimSun;"> <span contenteditable="false" name="name1532241467065">手术者:</span></strong> <span sde-model="{"ID":"id1615368899407","TYPE":"text","NAME":"手术者","TAG":"","DESCNAME":"手术者","VERIFYTYPE":"text","VALUE":"手术者","REQUIRED":0,"READONLY":0,"COLOR":"000000"}" contenteditable="false" id="id1615368899407" name="name1615368899407" ele-keyname="2572__" title="手术者" isset="true" keyval="5,6"> <span style="color:#0000FF" contenteditable="false">[</span> <span title="手术者" style="color:#000000;" contenteditable="true">WW,XX</span> <span style="color:#0000FF" contenteditable="false">]</span></span> <strong> <span contenteditable="false" name="name1562504319639">一助:</span></strong> <span contenteditable="false" name="name1562504319639"> <span sde-model="{"ID":"id1615368944630","TYPE":"text","NAME":"一助","TAG":"","DESCNAME":"一助","VERIFYTYPE":"text","VALUE":"一助","REQUIRED":0,"READONLY":0,"COLOR":"000000"}" contenteditable="false" id="id1615368944630" name="name1615368944630" ele-keyname="3410__" title="一助" keyval="5,6"> <span style="color:#0000FF" contenteditable="false">[</span> <span title="一助" style="color:#000000;" contenteditable="true">GG,HH</span> <span style="color:#0000FF" contenteditable="false">]</span></span> </span> <strong> <span contenteditable="false" name="name1562504319640">二助:</span></strong> <span sde-model="{"ID":"id1615368997929","TYPE":"text","NAME":"二助","TAG":"","DESCNAME":"二助","VERIFYTYPE":"text","VALUE":"二助","REQUIRED":0,"READONLY":0,"COLOR":"000000"}" contenteditable="false" id="id1615368997929" name="name1615368997929" ele-keyname="3411__" title="二助" keyval="17"> <span style="color:#0000FF" contenteditable="false">[</span> <span title="二助" style="color:#000000;" contenteditable="true">TTT</span> <span style="color:#0000FF" contenteditable="false">]</span></span> <br style="font-size: 16px; font-family: 宋体, SimSun;"></span> </p> <p> <strong> <span style="font-size: 16px; font-family: 宋体, SimSun;">手术麻醉方法:</span></strong> <span style="font-size: 16px; font-family: 宋体, SimSun;"> <span sde-model="{"ID":"id1615430630235","TYPE":"select","NAME":"麻醉方法","TAG":"","DESCNAME":"麻醉方法","REQUIRED":0,"FREEINPUT":0,"COLOR":"000000","VALUE":"11","TEXT":"吸入麻醉","REMOTEURL":"","BINDINGDATA":[{"VALUE":"1","TEXT":"全身麻醉","SELECTED":0},{"VALUE":"11","TEXT":"吸入麻醉","SELECTED":0},{"VALUE":"12","TEXT":"静脉麻醉","SELECTED":0},{"VALUE":"13","TEXT":"基础麻醉","SELECTED":0},{"VALUE":"2","TEXT":"稚管内麻醉","SELECTED":0},{"VALUE":"21","TEXT":"蛛网膜下腔阻滞麻醉","SELECTED":0},{"VALUE":"22","TEXT":"硬脊膜外腔阻滞麻醉","SELECTED":0},{"VALUE":"3","TEXT":"局部麻醉","SELECTED":0},{"VALUE":"31","TEXT":"神经丛阻滞麻醉","SELECTED":0},{"VALUE":"32","TEXT":"神经节阻滞麻醉","SELECTED":0},{"VALUE":"33","TEXT":"神经阻滞麻醉","SELECTED":0},{"VALUE":"34","TEXT":"区域阻滞麻醉","SELECTED":0},{"VALUE":"35","TEXT":"局部浸润麻醉","SELECTED":0},{"VALUE":"36","TEXT":"表面麻醉","SELECTED":0},{"VALUE":"4","TEXT":"复合麻醉","SELECTED":0},{"VALUE":"41","TEXT":"静吸复合全麻","SELECTED":0},{"VALUE":"42","TEXT":"针药复合麻醉","SELECTED":0},{"VALUE":"43","TEXT":"神经丛与硬膜外阻滞复合麻醉","SELECTED":0},{"VALUE":"44","TEXT":"全麻复合全身降温","SELECTED":0},{"VALUE":"45","TEXT":"全麻复合控制性降压","SELECTED":0},{"VALUE":"9","TEXT":"其他麻醉方法","SELECTED":0}]}" contenteditable="false" id="id1615430630235" name="name1615430630235" ele-keyname="2631" title="麻醉方法" isset="true"> <span style="color:#0000FF" contenteditable="false">[</span> <span title="麻醉方法" style="color: rgb(0, 0, 0);" contenteditable="false">吸入麻醉</span> <span style="color:#0000FF" contenteditable="false">]</span></span> <strong style="font-size: 16px; font-family: 宋体, SimSun;"> <span contenteditable="false" name="name1534267356013">麻醉指导者:</span></strong> <span contenteditable="false" name="name1534267356013"> <span sde-model="{"ID":"id1615369061026","TYPE":"text","NAME":"麻醉指导者","TAG":"","DESCNAME":"麻醉指导者","VERIFYTYPE":"text","VALUE":"麻醉指导者","REQUIRED":0,"READONLY":0,"COLOR":"000000"}" contenteditable="false" id="id1615369061026" name="name1615369061026" ele-keyname="3336__" title="麻醉指导者" keyval="5"> <span style="color:#0000FF" contenteditable="false">[</span> <span title="麻醉指导者" style="color:#000000;" contenteditable="true">WW</span> <span style="color:#0000FF" contenteditable="false">]</span></span> </span> <strong> <span contenteditable="false" name="name1532241729677" style="font-size: 16px; font-family: 宋体, SimSun;">麻醉者</span></strong> <span contenteditable="false" name="name1532241729677">:</span> <span sde-model="{"ID":"id1615369113136","TYPE":"text","NAME":"麻醉者","TAG":"","DESCNAME":"麻醉者","VERIFYTYPE":"text","VALUE":"麻醉者","REQUIRED":0,"READONLY":0,"COLOR":"000000"}" contenteditable="false" id="id1615369113136" name="name1615369113136" ele-keyname="2632__" title="麻醉者" isset="true" keyval="6"> <span style="color:#0000FF" contenteditable="false">[</span> <span title="麻醉者" style="color:#000000;" contenteditable="true">BB</span> <span style="color:#0000FF" contenteditable="false">]</span></span> <strong style="font-size: 16px; font-family: 宋体, SimSun;"> <br style="font-size: 16px; font-family: 宋体, SimSun;"></strong> <strong style="font-size: 16px; font-family: 宋体, SimSun;"></strong> </span> </p> <p> <span style="font-size: 16px; font-family: 宋体, SimSun;"> <strong style="font-size: 16px; font-family: 宋体, SimSun;"> <span contenteditable="false" name="name1615369665198">操作方法描述:</span></strong> <span sde-model="{"ID":"id1615369665198","TYPE":"text","NAME":"操作方法描述","TAG":"","DESCNAME":"操作方法描述","VERIFYTYPE":"text","VALUE":"操作方法描述","REQUIRED":0,"READONLY":0,"COLOR":"000000"}" contenteditable="false" id="id1615369665198" name="name1615369665198" ele-keyname="98251265" title="操作方法描述"> <span style="color:#0000FF" contenteditable="false">[</span> <span title="操作方法描述" style="color:#000000;" contenteditable="true">操作方法描述</span> <span style="color:#0000FF" contenteditable="false">]</span></span> <strong style="font-size: 16px; font-family: 宋体, SimSun;"> <br></strong> </span> </p> <p> <span contenteditable="false" name="name1524470891570"> <span contenteditable="false" name="name1533108118633"> <strong> </strong> </span> <span contenteditable="false" name="name1539756454252">医生: <span sde-model="{"ID":"id1615369381891","TYPE":"text","NAME":"医生","TAG":"","DESCNAME":"医生","VERIFYTYPE":"text","VALUE":"医生","REQUIRED":0,"READONLY":0,"COLOR":"000000"}" contenteditable="false" id="id1615369381891" name="name1615369381891" ele-keyname="3128__" title="医生" keyval="3"> <span style="color:#0000FF" contenteditable="false">[</span> <span title="医生" style="color:#000000;" contenteditable="true">RRR</span> <span style="color:#0000FF" contenteditable="false">]</span></span> </span> </span> <br></p> </body> </html>
打开页面效果如下
我们不要CSS格式,只要这个页面显示的内容。
代码如下,POM引入
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.12.1</version> </dependency>
解析代码:
package com.example.demo; import java.io.File; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; /** * JSOUP加正则替换HTML文件内所有标签 * @author 崔素强 */ public class HtmlParse { public static void main(String[] args) throws Exception { File input = new File("D:\\emr.html"); Document doc = Jsoup.parse(input, "UTF-8"); Elements ts = doc.getElementsByTag("p"); for(int i=0;i<ts.size();i++) { Element t = ts.get(i); // 获得这段标签的整个HTML String str = t.html(); // 替换所有以<开头以>结尾的内容 String regex = "<([\\s\\S]*?)>"; str = str.replaceAll(regex, ""); // 替换一些其他字符 | . * [] \ { } 是特殊字符,在使用时要进行转义 str = str.replaceAll(" ", ""); str = str.replaceAll("\\[", ""); str = str.replaceAll("\\]", ""); str = str.replaceAll("\\{", ""); str = str.replaceAll("\\}", ""); System.out.println(str); } } }
解析后输出
XXX医院 姓名: XXX 性别: 女 年龄: 60 岁 科室: 内科 病室: 1 床号: 3 住院号: 52525252 手术记录 2021-04-19 09:42:15 手术开始时间: 2021-04-19 09:42:17 手术结束时间: 2021-04-19 09:42:18 术前诊断: 霍乱 手术者指导者: XX 手术者: WW,XX 一助: GG,HH 二助: TTT 手术麻醉方法: 吸入麻醉 麻醉指导者: WW 麻醉者 : BB 操作方法描述: 操作方法描述 医生: RRR
END
0 评论
留下评论
热门标签
- Spring(403)
- Boot(208)
- Spring Boot(187)
- Spring Cloud(82)
- Java(82)
- Cloud(82)
- Security(60)
- Spring Security(54)
- Boot2(51)
- Spring Boot2(51)
- Redis(31)
- SQL(29)
- Mysql(25)
- Dalston(24)
- IDE(24)
- mongoDB(22)
- MVC(22)
- JDBC(22)
- IDEA(22)
- Web(21)
- CLI(20)
- Alibaba(19)
- SpringMVC(19)
- Docker(17)
- SpringBoot(17)
- Git(16)
- Eclipse(16)
- Vue(16)
- JPA(15)
- Apache(15)
- ORA(15)
- Tomcat(14)
- Linux(14)
- HTTP(14)
- Mybatis(14)
- Oracle(14)
- jdk(14)
- OAuth(13)
- Nacos(13)
- Pro(13)
- XML(13)
- JdbcTemplate(13)
- JSON(12)
- OAuth2(12)
- Data(12)
- int(11)
- Myeclipse(11)
- stream(11)
- not(10)
- Bug(10)
- Hystrix(9)
- ast(9)
- maven(9)
- Map(9)
- Swagger(8)
- APP(8)
- Bit(8)
- API(8)
- session(8)
- Window(8)
- windows(7)
- too(7)
- HTML(7)
- Github(7)
- JavaMail(7)
- Cache(7)
- File(7)
- IntelliJ(7)
- mail(7)
- Server(6)
- nginx(6)
- jar(6)
- ueditor(6)
- ehcache(6)
- UDP(6)
- RabbitMQ(6)
- and(6)
- star(6)
- Excel(6)
- Log4J(6)
- pushlet(6)
- apt(6)
- Freemarker(6)
- read(6)
- WebFlux(6)
- JSP(6)
- Bean(6)
- error(6)
- are(5)
- SVN(5)
- for(5)
- DOM(5)
- Sentinel(5)
- the(5)
- JWT(5)
- rdquo(5)
- PHP(5)
- Struts(5)
- string(5)
- script(5)