-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openwayback could index a .warc file but can not display it #392
Comments
Hi @guitarscape , that error usually means that the version you are looking for was found in the index as you say, but the application is unable to access the content from the WARC file where it is expecting to find it. Are you using the default configuration with BDB indexing etc.? |
Hi lauren, thanks for your response! |
I presume you put your .warc file in either of the default directories where owb will look for those files in either Sorry if these questions seem obvious--I am accustomed to seeing the |
we used the default wayback.basedir (/tmp/openwayback) for testing purpose and put a warc file in /tmp/openwayback/files1. we also made sure that all files and folders are owned by tomcat:tomcat before .starting tomcat. We also tried removed file-db, index, index-data folders in /tmp/openwayback so that they are recreated when tomcat is restarted. and the outcome were observed. is there a need to modify any other xml files besides wayback.xml? Thanks for your help! |
It sounds like you are minimally changing the configuration and have followed the steps given at these wiki pages, in which case you should not need to change other xml files: As far as troubleshooting goes: Have you looked at the Tomcat log to see if there are any related messages? You could try launching OpenWayback without any configuration changes from the default (using ROOT context) to see if it works. To see if the issue is with the WARC file versus the setup, have you tried a different WARC file to see if you get the same result? You could also try your WARC file with a containerized instance of OpenWayback: |
we stopped the tomcat, removed the existing OWB and ROOT context. Renamed the OWB .war file to ROOT.war, removed the file-db, index, index-data folders in /tmp/openwayback but kept the .warc file in files1 folder. we could play the same .warc file in pywb. |
With a |
unfortunately there is no other log entry related to the error. our set up uses vanilla centos 7.6 + openjdk 1.8, tomcat 7.0 |
Hi, have you been able to find a solution to this issue? We seem to be experiencing the same problem after an OS upgrade and a new BDB index for OpenWayback 2.3.0. Although the different captures are listed, OpenWayback returns
It would be great if you could give us some pointers! |
Hi @schmika, were the same WARCs working in Openwayback 2.3.0 before the OS upgrade? What OS did you upgrade to/from? What versions of Java and Tomcat are you running--did this change from when things were working? Do you see your WARCs listed with the correct paths in ${wayback.basedir}/file-db/incoming/files1 (replace "files1" with the directory name of your WARCs if needed)? |
Thank you for your reply, @ldko! |
Hi @schmika , I don't know if anyone has determined a rough size limit for a cutoff on BDB index. I pretty much only use BDB for small testing purposes, otherwise I use CDX. I wouldn't think this would be causing your problem though, sine the lookups for captures are working. One thing that comes to mind to check is if SELinux is enabled and enforcing on the machine which could cause a problem with serving files. You can check this with the |
Dear @ldko, Thank you very much for your answer. I am a colleague of @schmika. I just run the I found out that OpenWayback apparently tries to re-index warc files that were have already been indexed. In the Tomcat messages I find lines like this:
It seems that OpenWayback tries to index ABC.warc at 0:07, 1:27 and 3:43 again! At 6:55 OpenWayback removes the file from the index queue and at 7:41 it adds the file again! Could this phenomenon be related to using a BDB index with our collection? |
Hi @ThomasOsterman , One other thing I am wondering, are there duplicate WARC names with different paths in the file-db/state/* files? Regarding your question about switching to CDX, I would recommend trying CDX files or CDX Server and a path-index.txt configuration. Though you could try it with a small subset of your WARCs to make sure it is working before indexing everything. I also recommend upgrading to OpenWayback 2.4.0. |
Dear @ldko, Thank you very much for your explantations. I just checked some files and their paths in the file-db/state/* files. There are no duplicate WARC names with different paths. However, in file-db/state there are two files named files1 and files 2. Some of the WARC files are listed in both files1 and files2, some are only listed in one of the files. |
@ThomasOsterman for the duplicate WARC names that are listed in both files1 and files2, are they WARC files that happen to have the same name but different content (could potentially cause a problem as I believe the names are expected to be unique since the index of what content is inside the WARC specifies a WARC name but not path and wouldn't know which to use), are they duplicates of the same WARC that exist in both listed paths (shouldn't cause the problem), or does the duplicate WARC name only exist in one place (might cause a problem)? |
@ldko They are unique WARC files that exist only in one place but are listed both in files1 and files2. For example, both of the files contain the following line: |
@ldko I just noticed a configuration error in our BDBcollection.xml: We have two Wayback Archive dirs, ${wayback.archivedir.1} and ${wayback.archivedir.2}. Probably due to a careless mistake, the DirectoryResourceFileSource beans for both directories were called "files1". That might explain the phenomenon with the duplicate entries. |
@ThomasOsterman That's good to know :). I just tried setting two |
Dear @ldko, The configuration error in the BDBCollection.xml was actually the reason for our problems and the duplicate entries. After renaming the DirectoryResourceFileSource bean for ${wayback.archivedir.2} to "files2" and deleting the old index, the index was built correctly. Thank you very much for your help! |
Hi! I am using OWB2.4.1-SNAPSHOT to index and visualize .warc.gz files crawled with Heritrix and I'm getting the same error (The Resource you have requested is temporarily unavailable. Please try again later) but I'm using CDX instead of BDB. I am trying with just 4 captures (1 warc.gz each) to find the error but I still can't solve it. PD: I checked the content of both 'index.cdx' and 'path-index.txt' and they look OK. When I try to visualize OWB returns the error. Repeating the process returns different results as somehow one of these 4 can be visualized regularly and another one just sometimes. I tried indexing and visualizing each of them separately and works just fine, so I thought I was merging and sorting wrong. Then I tried creating a single .cdx for each warc.gz and configuring CDXCollection.xml for multiple CDX to visualize all 4 together, but I still get the error! Besides, I checked the permissions and ownership of the files and everything looks fine. catalina.out: Thank you. I hope you can help me. |
Hi @pqhais, the error message "Resource you have requested is temporarily unavailable." is usually not an issue with the cdx file. It seems like you have already checked most of the things that should be checked in troubleshooting. Since you say you can successfully index and replay each WARC individually, I am wondering if, in addition to sorting your index.cdx file, did you also sort your path-index.txt? |
Hi @ldko thanks for your quick response! I did sort de path-index.txt but still got the error. Btw, which are the issues that usually cause this error? Thank you. |
The error occurs when the URI is found in the CDX index, but when trying to access the WARC file given for the resource via lookup of its location in the path-index.txt, something goes wrong such as:
Could you share the content from your path-index.txt? |
It seems to be working now. I forgot to save the sorted 'path-index.txt' so I was just displaying the sorted output in the terminal but using the unsorted version in Wayback. I was going around in circles, thank you so much for your advice! |
Came here through google, and since this seems to be the only issue addressing this, I'll add my solution: I was using the docker setup, and even though it indexed the In the end, it worked when using the exact same setup as in the wiki, I tried mapping to port 8080 from docker to 8082 outside while also setting the env variables accordingly and the website loaded, however it never found the example warc when asked to display it. Once I freed port 8080 and ran with the original settings, it worked fine though. Maybe thats due to some kind of internal routing breaking when the outside port is different to the inside one? But running with |
Hi @bjrne - I think this might be down to the awkward way that parameters like i.e. this should work:
Note the |
we are testing openwayback using a .warc file generated by heritrix.
we run openwayback on centos7+tomcat7. OWB seems capable of indexing urls the .warc file. however, when we click the version (date) shown on the search result, OWB reports:
Resource Not Available
The Resource you have requested is temporarily unavailable. Please try again later.
any suggestions and help would be appreciated.
The text was updated successfully, but these errors were encountered: