Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add document not working well #311

Closed
arthcras opened this issue Nov 16, 2019 · 24 comments
Closed

add document not working well #311

arthcras opened this issue Nov 16, 2019 · 24 comments
Assignees
Labels

Comments

@arthcras
Copy link

arthcras commented Nov 16, 2019

The issue is as follows
For some reason new documents added to the datashare folder are not shown with search document. I believe the issue relates to that the indexing is not working when adding new documents
Even after reinstalling datashare conform manual and removing the whole application, still an old document is visible in the data share application; while it is not in the datashare folder anymore; also the new documents added to the datashare folder are not feasible
Screen Shot 2019-11-16 at 1 33 04 PM
Screen Shot 2019-11-16 at 1 36 15 PM
2019-11-17 11:58:21,032 [pool-12-thread-2] ERROR DocumentConsumer - Exception while consuming file: "/home/datashare/data/saudi-aramco-prospectus-en.pdf".
java.net.ConnectException: Connection refused
at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:949)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:229)
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1593)
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1563)
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1546)
at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1512)
at org.elasticsearch.client.RestHighLevelClient.index(RestHighLevelClient.java:858)
at org.icij.datashare.text.indexing.elasticsearch.ElasticsearchSpewer.writeDocument(ElasticsearchSpewer.java:62)
at org.icij.spewer.Spewer.write(Spewer.java:56)
at org.icij.extract.extractor.Extractor.extract(Extractor.java:275)
at org.icij.extract.extractor.DocumentConsumer.lambda$accept$0(DocumentConsumer.java:125)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:171)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:145)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348)
at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192)
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
... 1 common frames omitted
2019-11-17 11:58:21,215 [pool-2-thread-2] INFO DocumentConsumer - Terminated.
2019-11-17 11:58:22,448 [pool-2-thread-2] INFO IndexTask - exiting
2019-11-17 11:58:43,788 [pool-13-thread-2] ERROR DocumentConsumer - Exception while consuming file: "/home/datashare/data/saudi-aramco-prospectus-en.pdf".
java.net.ConnectException: Connection refused
at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:949)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:229)
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1593)
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1563)
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1546)
at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1512)
at org.elasticsearch.client.RestHighLevelClient.index(RestHighLevelClient.java:858)
at org.icij.datashare.text.indexing.elasticsearch.ElasticsearchSpewer.writeDocument(ElasticsearchSpewer.java:62)
at org.icij.spewer.Spewer.write(Spewer.java:56)
at org.icij.extract.extractor.Extractor.extract(Extractor.java:275)
at org.icij.extract.extractor.DocumentConsumer.lambda$accept$0(DocumentConsumer.java:125)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:171)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:145)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348)
at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192)
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
... 1 common frames omitted
2019-11-17 11:58:43,836 [pool-2-thread-1] INFO DocumentConsumer - Terminated.
2019-11-17 11:58:44,232 [pool-2-thread-1] INFO IndexTask - exiting
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/datashare/lib/datashare-nlp-opennlp-4.21.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/datashare/lib/datashare-nlp-mitie-4.21.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/datashare/lib/datashare-nlp-ixapipe-4.21.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/datashare/lib/datashare-nlp-corenlp-4.21.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/datashare/lib/datashare-db-4.21.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/datashare/lib/datashare-app-4.21.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
15:03:07,995 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback.groovy]
15:03:07,997 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]
15:03:07,998 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [jar:file:/home/datashare/lib/datashare-nlp-opennlp-4.21.0-jar-with-dependencies.jar!/logback.xml]
15:03:08,002 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs multiple times on the classpath.
15:03:08,006 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/datashare/lib/datashare-app-4.21.0-jar-with-dependencies.jar!/logback.xml]
15:03:08,006 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/datashare/lib/datashare-nlp-mitie-4.21.0-jar-with-dependencies.jar!/logback.xml]
15:03:08,006 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/datashare/lib/datashare-nlp-corenlp-4.21.0-jar-with-dependencies.jar!/logback.xml]
15:03:08,006 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/datashare/lib/datashare-nlp-ixapipe-4.21.0-jar-with-dependencies.jar!/logback.xml]
15:03:08,006 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/datashare/lib/datashare-nlp-opennlp-4.21.0-jar-with-dependencies.jar!/logback.xml]
15:03:08,086 |-INFO in ch.qos.logback.core.joran.spi.ConfigurationWatchList@6979e8cb - URL [jar:file:/home/datashare/lib/datashare-nlp-opennlp-4.21.0-jar-with-dependencies.jar!/logback.xml] is not of type file
15:03:08,508 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
15:03:08,537 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
15:03:08,569 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]
15:03:08,594 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
15:03:09,036 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.rolling.RollingFileAppender]
15:03:09,048 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [FILE]
15:03:09,467 |-INFO in c.q.l.core.rolling.TimeBasedRollingPolicy@1983747920 - No compression will be used
15:03:09,476 |-INFO in c.q.l.core.rolling.TimeBasedRollingPolicy@1983747920 - Will use the pattern ./logs/datashare_opennlp.%d{yyyy-MM-dd}.log for the active file
15:03:09,504 |-INFO in c.q.l.core.rolling.DefaultTimeBasedFileNamingAndTriggeringPolicy - The date pattern is 'yyyy-MM-dd' from file name pattern './logs/datashare_opennlp.%d{yyyy-MM-dd}.log'.
15:03:09,504 |-INFO in c.q.l.core.rolling.DefaultTimeBasedFileNamingAndTriggeringPolicy - Roll-over at midnight.
15:03:09,524 |-INFO in c.q.l.core.rolling.DefaultTimeBasedFileNamingAndTriggeringPolicy - Setting initial period to Sun Nov 17 15:01:58 GMT 2019
15:03:09,534 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
15:03:09,566 |-INFO in ch.qos.logback.core.rolling.RollingFileAppender[FILE] - Active log file name: ./logs/datashare.log
15:03:09,566 |-INFO in ch.qos.logback.core.rolling.RollingFileAppender[FILE] - File property is set to [./logs/datashare.log]
15:03:09,575 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO
15:03:09,576 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]
15:03:09,578 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [FILE] to Logger[ROOT]
15:03:09,578 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
15:03:09,581 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@5c0369c4 - Registering current configuration as safe fallback point
2019-11-17T15:03:09.600813300Z
2019-11-17 15:03:10,570 [main] INFO Main - Running datashare web server
2019-11-17 15:03:10,588 [main] INFO Main - with properties: {defaultUserName=local, nlpParallelism=1, parserParallelism=1, stages=SCAN,INDEX,NLP, elasticsearchAddress=http://elasticsearch:9200, defaultProject=local-datashare, messageBusAddress=redis, dataSourceUrl=jdbc:sqlite:/home/datashare/dist/database.sqlite, queueName=extract:queue, parallelism=2, mode=LOCAL, cors=no-cors, ocr=true, redisAddress=redis://redis:6379, dataDir=/home/datashare/data, clusterName=datashare}
2019-11-17 15:03:10,626 [main] INFO PropertiesProvider - reading properties from jar:file:/home/datashare/lib/datashare-app-4.21.0-jar-with-dependencies.jar!/datashare.properties
2019-11-17 15:03:10,776 [main] INFO PropertiesProvider - adding properties from env vars {mountedDataDir=/Users/arthurnamiasdecrasto/Datashare}

@arthcras arthcras reopened this Nov 16, 2019
@Soliine
Copy link
Collaborator

Soliine commented Nov 18, 2019

Hi Arthcras, thanks a lot for using Datashare and for your feedback. I can't reproduce the bug so I have more questions for you:

"For some reason new documents added to the datashare folder are not shown with search document. I believe the issue relates to that the indexing is not working when adding new documents"
-> indeed, every time you add documents in the Datashare folder, you need to 'analyze documents' again so the new documents are indexed in Datashare and you can search them: https://icij.gitbook.io/datashare/all/analyze-documents. Does this answer your question?

"Even after reinstalling datashare conform manual and removing the whole application, still an old document is visible in the data share application; while it is not in the datashare folder anymore; also the new documents added to the datashare folder are not feasible"
-> if you want to remove documents from Datashare, you need to delete them from Datashare https://icij.gitbook.io/datashare/faq/can-i-remove-a-document-from-datashare
If you don't do this, they will remain indexed even if you uninstall the application and reinstall it later. Does this work on your side?

Please let us know if this answer your questions or if there is a bug that we haven't seen. Thanks a lot.

@Soliine Soliine self-assigned this Nov 18, 2019
@arthcras
Copy link
Author

arthcras commented Nov 18, 2019 via email

@arthcras
Copy link
Author

arthcras commented Nov 18, 2019 via email

@Soliine
Copy link
Collaborator

Soliine commented Nov 18, 2019

OK I am going to test with the Saudi Aramco document. Can you tell us which OS do you use and how much RAM you have?

@arthcras
Copy link
Author

arthcras commented Nov 18, 2019 via email

@arthcras
Copy link
Author

arthcras commented Nov 20, 2019 via email

@Soliine
Copy link
Collaborator

Soliine commented Nov 21, 2019

Hello Arthur,
We confirm that this document in particular is too big and complex to work on some machines right now. The PDF is 11,8 MB and is complex (columns, etc). The indexing of this document didn't work on my computer (Mac OS Mojave 10.14.6 with 16GB of RAM) but did work on my colleague's machine (Linux Ubuntu Core i7 10th generation, 16GB of RAM). It did work on my colleague's machine also because we use an ElasticSearch cluster. ElasticSearch went down on my machine. It is something we are going to improve, notably through a project of creating a Datashare Light #317.
At the moment, you can try indexing documents that are less big and/or less complex.
Thanks a lot for your interest.

@Soliine Soliine closed this as completed Nov 21, 2019
@arthcras
Copy link
Author

arthcras commented Nov 21, 2019 via email

@Soliine
Copy link
Collaborator

Soliine commented Nov 21, 2019

Hello Arthur,
I cannot reproduce the bug on my side. I put 25 documents in my Datashare folder, analyze them and I get 25 documents in Datashare. So we will need more information in order to understand why some of your documents were not indexed. If it is not confidential, can you please share the logs of your terminal or the documents (you said 'copy attached' but I cannot see any screenshot at the moment)? Thanks a lot.

@arthcras
Copy link
Author

arthcras commented Nov 21, 2019 via email

@arthcras
Copy link
Author

arthcras commented Nov 21, 2019 via email

@arthcras
Copy link
Author

arthcras commented Nov 21, 2019 via email

@Soliine Soliine reopened this Nov 25, 2019
@bamthomas
Copy link
Collaborator

Ok I think there is a database initialization issue here, I'm trying to reproduce.

thanks for reporting @arthcras

@bamthomas
Copy link
Collaborator

@arthcras I can't reproduce it with the latest version 4.21.6 on mac 10.14 Mojave either with an existent database or with a new one.

If you don't have valuable stars and tags, I would recommend to remove the database file that is located in /Users/<your_user>/Library/Datashare_Models/datashare.sqlite and to restart datashare, it will recreate the database file.

@annelhote
Copy link
Contributor

See comment #299 (comment)

@annelhote
Copy link
Contributor

@arthcras Did you try to remove the database file before the fresh install like @bamthomas said ?

@arthcras
Copy link
Author

arthcras commented Nov 27, 2019 via email

@arthcras
Copy link
Author

arthcras commented Nov 27, 2019 via email

@Soliine
Copy link
Collaborator

Soliine commented Nov 28, 2019

A call is planned on November 29th, 2019.

@Soliine
Copy link
Collaborator

Soliine commented Nov 29, 2019

After the call, it appears that the "forbidden" in ElasticSearch appears as a self protection when ElasticSearch does not have enough ressources (here due to arthcras' machine which is a Macbook Air).

@Soliine Soliine closed this as completed Nov 29, 2019
@arthcras
Copy link
Author

arthcras commented Nov 29, 2019 via email

@arthcras
Copy link
Author

arthcras commented Nov 29, 2019 via email

@arthcras
Copy link
Author

arthcras commented Nov 30, 2019 via email

@Soliine
Copy link
Collaborator

Soliine commented Dec 2, 2019

That's good news, thanks Arthur! We take note of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants