-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely slow parsing of XML #39
Comments
I agree with the word "Extremely". The performance is far below usable. Jsoup spent 143 ms to parse this file, however, Ksoup spent 90 seconds which is >600x slow. The Jsoup was tested on desktop JVM, and Ksoup was tested on iosSimulatorArm64. Although using the simulator might run a bit slower but it should not be ~600x. |
For a smaller file which is 459kb, Ksoup took 20s while Jsoup only needed 0.1s (including VM startup time). |
@vanniktech @Him188 thanks for your feedback. I'm aware of this performance issue and will optimize it in the next few versions. @vanniktech I think it would be great if we had the option to ignore text, as it may save a lot of memory. |
@itboy87 Thanks. This is an amazing project and I'm looking forward to the updates |
@Him188 Thanks. I'm working on it. |
Could you please share the sample code with me? I'm testing it, and it took only 1 second to parse this file with ksoup, while Jsoup took 0.1 seconds. |
That's still 10x slower. |
@vanniktech yup i know. I'm looking into this, but @Him188 mentioned 90s |
@vanniktech @Him188 Currently, I'm working on two versions: one built with Ktor and kotlinx, and the other built using Korlibs. I see that the Korio branch has better performance; it took only 120ms compared to 1100ms. Actually, for now, I'm not sure which I will use for the upcoming versions. Korlibs are not widely used but are good; on the other hand, kotlinx-io and Ktor are more like standard libraries for kotlin. I might optimize ksoup with kotlinx-io and Ktor, or just go with Korio, which is already performing well. I haven't decided yet. Upcoming version 0.1.3 is ready to publish which is using korlibs |
I would recommend kotlinx-io because it's official. People are likely to be (already) using it and don't want to have multiple IO libraries. Or we can introduce separate modules for io support: Since you mentioned Ktor, let me also share some of my though about it :) Ktor currently maintains two major versions, 2.x and 3.x. 3.x is still in alpha and is binary incompatible with 2.x. I would expect most of the exisiting projects are using 2.x, and new projects are also likely to use the latest stable version 2.x. However, Ksoup depends on ktor-client-core 3.x, forcing its consumers to also use ktor 3.x. If the consumer (like my project) is using ktor 2.x, code still compiles, but it throws ClassNotFoundError at runtime. I had to migrate my project to 3.x in order to use Ksoup. So Ksoup may also publish separate variants based on ktor 2.x and 3.x. However, I might recommend instead, not depending on ktor, as it does not sound neccessary for a XML parser to depend on a HTTP client library. |
Testing code: Relevant code extracted: Note that the fun parseMikanSubjectIdsFromSearch(document: Document): List<String> {
return document.getElementsByClass("an-info").mapNotNull { anInfo ->
anInfo.parent()?.let { a ->
val attr = a.attr("href")
if (attr.isEmpty()) return@let null
attr.substringAfter("/Home/Bangumi/", "")
.takeIf { it.isNotBlank() }
}
}
}
@Test
fun `can parse subject index`() {
val ids = AbstractMikanMediaSource.parseMikanSubjectIdsFromSearch(
Xml.parse(
readTestResourceAsString("/mikan-search-无职转生.txt"),
),
)
assertEquals(listOf(3060, 2353, 2549, 3344).map { it.toString() }, ids)
} The resource is already read as a string so I would not expect such large performance difference on the IO side? Maybe the |
Please note that I was comparing Ksoup on iosSimulatorArm64 and Jsoup on desktop JVM. Maybe the simulator is actually far slower than I expected. |
This would really be ideal. The less dependencies the better. I am also now still using Ktor 2.0 and I can't upgrade to a beta version. |
@Him188 agree with you. |
Yes simulator may not perform like physical device. But still it need lot of improvements. I'm working on it and i will publish both kotlinx-io and korio variant. |
even if you use html string for parse it still use lot of IO operations for parsing and streaming. |
Actually I'm using ktor for charset encoder and decoder. Which is currently not available in kotlinx-io. |
@vanniktech @Him188 version 0.1.4 released with Korio with performance issues fixed and I'm already working on kotlinx-io + ktor variant. |
FWIW I'm trying out ksoup in my library, unfurl and I'm finding |
@itboy87 thanks for cutting a new release! I will try it out in the next few days. Regardless, I haven't checked this yet but I feel like that plain jsoup running on Android is slower than on Desktop. Obviously a Desktop is much more powerful but maybe jsoup does something that the Android phones don't like, some specific to the Android runtime. |
@saket @Him188, thanks for your feedback! I've been working on separating the IO dependency from the core code, which has now been implemented in the develop branch. I've also added performance comparison test code for Ksoup vs. Jsoup, which you can find here PerformanceComparisonTest.kt. Next, I'm finalizing the kotlinx-io variant, which is almost complete. After that, I'll focus on addressing performance issues. |
@itboy87 That sounds very nice! I will give it a try when you finish kotlinx-io variant. |
@itboy87 nice improvements. It's much faster, so much that I'm making the switch, I will release a new version and take it from there. |
I tested |
@saket @vanniktech @Him188 Thanks for your feedback. I have released Ksoup Next, I’m also working on a variant with no external dependencies, which will be a lightweight version supporting only string HTML and XML parsing and UTF-8. |
I tried to upgrade and use the
during Gradle sync. |
@vanniktech It was not published correctly, but I just fixed it in version 0.1.6-alpha1 and published it. It will be available in 15 minutes." |
@vanniktech please use com.fleeksoft.ksoup:ksoup-ktor2 with ksoup-network-ktor2 |
Amazing stuff! |
Currently I have two native parser implementations. Android uses
DocumentBuilderFactory
and the likes and on iOS I useNSXMLParser
. I'd like to replace this with Ksoup so I can also share the parsing logic and have everything unified. Additionally, Ksoup is a lot more lenient when it comes to the parsing logic which is nice because Rss Feeds often contain unescaped&
which throws off both of my parser right now and Ksoup would solve this as well. However Ksoup is in some instances substantially slower, for instance when trying to parse the XML from this site: https://www.1978.tokyo/rssI've ran a few tests on my phone and
Ksoup
is ~3x slower thanDocumentBuilderFactory
. My assumption is that it also parses all the 'text' as Nodes that are contained in each<description>
tag for instance. Is there any way to turn this off?The text was updated successfully, but these errors were encountered: