Skip to content
This repository has been archived by the owner on Jun 19, 2024. It is now read-only.

new validation seems quite slow #1051

Closed
jstrachan opened this issue Sep 14, 2017 · 29 comments
Closed

new validation seems quite slow #1051

jstrachan opened this issue Sep 14, 2017 · 29 comments
Assignees
Milestone

Comments

@jstrachan
Copy link
Contributor

now we've validation builds seem quite slow. e.g. if you clone and build this repo it used to be pretty fast: https://github.com/fabric8io/fabric8-platform

Now its about 9 minutes on my laptop - just doing YAML generation/validation mostly. After 1 build there should be no downloading from the internet or anything; so should be pretty snappy.

I wonder if the validation stuff could be tuned a bit? e.g. is it loading the schema from scratch as it validates every YAML file in every repo?

@hrishin
Copy link
Member

hrishin commented Sep 14, 2017

@jstrachan that's long. Btw how long it used to take without validation?
For the being you can skip the validation with mvn -Dfabric8.skipResourceValidation=false

Yes, work is going on for optimizing schema loading #1034

@jstrachan
Copy link
Contributor Author

jstrachan commented Sep 14, 2017

using that CLI still seems to validate....

$ mvn -Dfabric8.skipResourceValidation=false fabric8:resource
...
[INFO] Building Auth Service 4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- fabric8-maven-plugin:3.5.30:resource (default-cli) @ auth ---
[INFO] F8: Running in Kubernetes mode
[INFO] F8: using resource templates from /workspace/java/fabric8-platform/apps/auth/src/main/fabric8
[INFO] F8: fmp-revision-history: Adding revision history limit to 2
[INFO] F8: validating /workspace/java/fabric8-platform/apps/auth/target/classes/META-INF/fabric8/openshift/auth-configmap.yml resource
[INFO] F8: validating /workspace/java/fabric8-platform/apps/auth/target/classes/META-INF/fabric8/openshift/auth-deploymentconfig.yml resource

@hrishin
Copy link
Member

hrishin commented Sep 14, 2017

@jstrachan oh thats my bad, its should be true

mvn -Dfabric8.skipResourceValidation=true

@jstrachan
Copy link
Contributor Author

ok no validation:

$ mvn -Dfabric8.skipResourceValidation=true fabric8:resource
...
[INFO] Total time: 42.987 s

validation:

$ mvn fabric8:resource
...
[INFO] Total time: 06:58 min

so > 6 mins just validating

@hrishin
Copy link
Member

hrishin commented Sep 14, 2017

wohh that's too much!!
Need to fix it on high priority.

@nicolaferraro
Copy link
Member

I expected validation (until fixed by #1034) would download ~300k per resource from github, so ~2MB and few seconds of delay in the average case. @jstrachan what project are you trying to build?

@jstrachan
Copy link
Contributor Author

jstrachan commented Sep 15, 2017

am using this project: https://github.com/fabric8io/fabric8-platform which adds > 6 minutes of just validation to the build. I guess part of my issue is I don't have crazy fast wifi ;)

I wonder could we at least cache the ~300k resources in the local maven repo; it seems like its downloading ~300k to validate every YAML file - of which there are many of them in the fabric8-platform project - can we cache whatever is downloaded/parsed within a single maven build across different YAML files / maven modules?

@nicolaferraro
Copy link
Member

Yeah, in fact that is a collector of resources. The problem with current implementation is that no cache is being used at all. We were looking for a prepackaged jar with all resources to solve this problem, but also caching resources using tmp files or also the maven repo (to have them available across executions) is a good solution.

@hrishin are you looking into it?

@hrishin
Copy link
Member

hrishin commented Sep 15, 2017

@nicolaferraro yes I'm looking into it and maven package is ready but it has some issues, lets discuss it in #1034. Yes as suggested simple in memory caching of files sounds good idea as well, any pointers on how we can cache files in maven workflow?

Note: @jamesnetherton @nicolaferraro sorry to say that, but again problem in the current flow is, JSON schema has HTTP references to other schema definitions that may span couple of HTTP requests.
Not happy for this I/O behavior.

@hrishin
Copy link
Member

hrishin commented Sep 15, 2017

Following are observations for maven plugin (it has bunch of integration tests)

Validation:

  1. Maven packaged JSON schema:
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:07 min
[INFO] Finished at: 2017-09-15T15:40:08+05:30
[INFO] Final Memory: 92M/1980M
[INFO] ------------------------------------------------------------------------
  1. GitHub hosted JSON schema:
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:37 min
[INFO] Finished at: 2017-09-15T15:35:55+05:30
[INFO] Final Memory: 102M/1974M
[INFO] ------------------------------------------------------------------------

Without Validation:

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:12 min
[INFO] Finished at: 2017-09-15T15:58:02+05:30
[INFO] Final Memory: 102M/1950M
[INFO] ------------------------------------------------------------------------

@hrishin
Copy link
Member

hrishin commented Sep 15, 2017

@jstrachan Observations for https://github.com/fabric8io/fabric8-platform

With Validation:

  1. Maven packaged JSON schema:
mvn clean fabric8:resource 

[INFO] Total time: 43.498 s
  1. Github hosted JSON schema:
mvn clean fabric8:resource

[INFO] Total time: 03:20 min
(Had a snacks mean while)

Without Validation:

mvn -Dfabric8.skipResourceValidation=true fabric8:resource

[INFO] Total time: 24.472 s

@jstrachan
Copy link
Contributor Author

looks like the maven packaged schema is great ;)

@nicolaferraro
Copy link
Member

Looks much better!
Is the validation itself adding that delay? It's small considering the amount of resources present in that repo, but can you identify other sources of delay?

@hrishin
Copy link
Member

hrishin commented Sep 15, 2017

@nicolaferraro yes its validation only. I'm looking into other things as well. We can optimize the process of validation

  1. Convert sequential flow to concurrent flow: Right now all resources are processed in sequential, we can try to convert it into concurrent flow
  2. Caching of schema file

@pradeepto
Copy link
Collaborator

From 3:20m to 43s. Woah!

@nicolaferraro @jstrachan Just had a discussion with @hrishin. We should figure out a way to package different versions as required. I am not sure how it can be pushed to Maven central after that but we should definitely have the option to package a version of your choice (maybe we can pass it as a parameter to the CLI or something).

Current results shown above are based on master-standalone version. I would like to see what happens if we use 3.6-standalone which is what we should probably use for fabric8-*. What do you folks think?

@jstrachan
Copy link
Contributor Author

we can parameterise how the validation works if we need to. We should be able to release different versions of the schemas / artifacts like any other resource in maven central right?

@hrishin
Copy link
Member

hrishin commented Sep 18, 2017

Parameterise how the validation works if we need to

Yes we can. We can include all schemas in the single jar or for each version we can have different jar.
Single version of K8/OS contributes to 11MB jar, so if we include X versions in single jar then size would be that order of magnitude of versions we include.
@pradeepto has suggested, how about including latest stable 2 or 3 versions schema in the jar? In that case jar size would go approximately 33-35 MB

Assumption: Default validation schema would be latest stable release of K8/OS.

@nicolaferraro @jstrachan any suggestions?

@jstrachan
Copy link
Contributor Author

using the latest version would probably do the trick; as the schema is usually fully backwards compatible. I guess folks could parameterise the version of the schema to validate against? Then there's just 1 version of the schema in 1 jar and fmp defaults to the latest one?

@hrishin
Copy link
Member

hrishin commented Sep 18, 2017

Parameterise the version of the schema to validate against?

We have to add support for the version parameters. At present state its not implemented yet. If needed we can implement one.
Does that answers your question @jstrachan ?

fmp defaults to the latest one?

Yes, its latest one.

@jstrachan
Copy link
Contributor Author

probably sticking with the same schema version as is used for the kubernetes-model / kubernetes-client will do for now? (i.e. the latest)

@nicolaferraro
Copy link
Member

Agree we should target the latest version only, so 1 single jar with latest schema should be ok, as it's backward compatible.

I've included some Maven bits in the spring-boot live reload part of the plugin to dynamically retrieve a specific jar from Maven on the fly (in that case, it was the project-specific version of the spring-boot devtools library). But in this case I'd prefer to bundle the latest schema version directly in the plugin.

@hrishin
Copy link
Member

hrishin commented Sep 18, 2017

using the latest version would probably do the trick; as the schema is usually fully backwards compatible.

Sounds valid, it may be good to stick with latest version.
Assuming this premise holds true, we have java packages in place now.
The question is how to host maven modules to central repository?
@nicolaferraro @jamesnetherton @pradeepto @rhuss

@jstrachan
Copy link
Contributor Author

we just release those maven jars to maven central & depend on them in the fabric8-maven-plugin?

@hrishin
Copy link
Member

hrishin commented Sep 19, 2017

@nicolaferraro looks wise suggestion, thanks :)
In that way it will be less dependency and easy to package the schema files.
Note: if we package in plugin then fabric8-maven-core.jar size would go beyond at least 3+ MB.
Are we not defeating the origin purpose of externalizing schema files from plugin? just 2 confirm

@pradeepto
Copy link
Collaborator

I think we should just put the schema jars in maven-central (for now let's just have 1 schema per jar and let's use the "latest" released version i.e. 3.6). This keeps the size of fmp to its current size and also fmp can download schema jar when it needs it and also do so when schema jar gets updated.

My only suggestion would be let's stick some version that we can fall back on like 3.6 as opposed to master (that is what I understood when latest was mentioned in the above comments. I agree that that API will be backward compatible but I am not sure if that will always be the case in future when it comes to master.

Also, @hrishin can you please link your changes/fork here?

@jstrachan
Copy link
Contributor Author

the kubernetes-model project, that generates the DTOs for the kubernetes + openshift schema - also generates and releases a json schema.
e.g.

I wonder if we can reuse that - then the kubernetes-model + json schema stay in sync?

@jstrachan
Copy link
Contributor Author

BTW you should be able to look for schema/kube-schema.json on the CLASSPATH as this schema is included inside the kubernetes-model jar

@hrishin
Copy link
Member

hrishin commented Sep 21, 2017

schema/kube-schema.json(model-schema) looks good and could be single source of truth for resources (model) definition across all the projects (kubernetes-model, kubernetes-client, this project)
To accommodate model schema in current scope we need to address following issues first.

  1. JSON structure:
    Current structure of JSON schema not suitable to use for validation and querying resources is difficult. We need some changes like this
{
  id:
  $schema:
  definitions:
  .....
  resources: {
    kubernetes: {}
    openshift: {}
  }
} 

Here we need a property like resources that contains all resources of Kubernetes, OpenShift. If validator wants to validate deployment or deploymentconfig then it can pick it's properties for respective resource accordingly. In current state model generator updates resource name with some prefix e.g. os_deploy_DeploymentConfig which is not constant and its hard to query the resource properties. Second thing is in schema all properties key has all required resources but in camel case and validation gets resource name which is in small case which makes querying resource properties difficult.

  1. Definition gaps:
    As per observation we found some gaps between the original resources definitions and model resources definitions. e.g. IntOrString definition, which looks incorrect. Hence it's not giving correct results.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants