Skip to content

Latest commit

 

History

History
492 lines (350 loc) · 27 KB

troubleshooting.md

File metadata and controls

492 lines (350 loc) · 27 KB
title subtitle toc geometry fontsize colorlinks header-includes
Data Delivery System
Troubleshooting
true
left=1cm, right=1cm, top=2cm, bottom=2cm
12pt
true
\usepackage{lato}
\renewcommand{\familydefault}{\sfdefault}
\usepackage{fancyhdr}
\pagestyle{fancy}
\rhead{30-Jul-2024}

Users: Unit Admin/Personnel/Researchers

A. Invitation link invalid

Did you receive the invitation email more than 7 days ago?

  • Yes: An invitation is only valid for 7 days. Contact the SciLifeLab unit that sent the original invitation and ask them to invite you again.
  • No: Something else has gone wrong. Contact the SciLifeLab unit that sent the original invitation and ask them to:
    1. Remove the invitation with dds user delete --is-invite
    2. Invite you again
    3. Inform the Data Centre of the issue.

B. Too many authentication requests in one hour

If this message is displayed in the web interface or CLI, the user has attempted to log in or logged in and out more than 10 times within an hour. Please wait an hour and try again.

C. Web: Invalid username or password

1. Are you sure you are using the correct username?

  • Yes: Move on to step 2
  • No: It’s currently not possible to reset/change a password. There are three options depending on your DDS account role:
    • Unit Admin or Unit Personnel:
      • Ask a colleague to list the users within your unit (dds user ls) and check the table for the row with your name.
      • Verify that the username matches the one you are using.
    • Researcher: Contact the SciLifeLab unit that invited you to the DDS.
      • Question for the Unit Admins / Personnel: Is the account connected to a project within the DDS?
        • Yes: A Unit Admin or Unit Personnel role can list the researchers in a specific project: dds ls --project --users. They can then check the table for the user and find the correct username.
        • No: The Unit Admin or Personnel should contact .

2. Are you sure you are using the correct password?

  • Yes: Notify the Data Centre, .

  • No: Try requesting a password reset in the web interface.

  • IMPORTANT: If you have access to any projects before the password reset, notify an Unit Admin or Unit Personnel immediately that they need to run the following command: dds project access fix

The user running this command needs to have access to the projects in question. This can be seen in the “Access” column when a user does dds ls to list the projects.

D. CLI: Failed to authenticate user: Missing or incorrect credentials

Follow the same steps as in C above.

E. Not receiving emails

KI seems to have a spam filter which makes receiving emails very slow in some cases. The email should show up at some point, but it may take some time.

If you have an email that you are certain does not go through the KI spam filter:

  • Verify that you are checking the correct email address
  • Check the junk folder

If you cannot find the email, contact the Data Centre.

F. CLI Documentation not accessible

You can download the CLI documentation as a PDF. Click here. The download should start automatically. Notify the Data Centre that the documentation page is down so that we can look into it.

G. Long error message (traceback)

If you get a long error message after running the CLI:

  • Unit Personnel / Admins: Report this to [email protected].
  • Researchers: Report this to the SciLifeLab unit delivering the data.

Include the error message in the email. If the error is related to the DDS, an understandable message should always be displayed and it’s therefore important that we get information so that we can change it.

H. Token expires during delivery

Unfortunately we cannot do anything about this at this time. Remember to authenticate yourself before every large delivery. Re-authenticate yourself and then upload or download again. We will look into this issue if and when it arises.

I. Windows OS: Warning message "Storing the login information locally - please ensure no one else can access the file at …"

Since the token file permissions check does not work on Windows OS, we are warning the users to be extra cautious in keeping this file secure. The dds client should be fully functional despite this warning.

J. Permission denied (authentication token) when using the DDS on a func account or similar

The default setting is that the authenticated token generated by dds auth login has permission 600. This means read and write permissions for the specific unix user that is running the command.

Running the dds auth login command with the option --allow-group will allow users from the same unix group to use the resulting token file. However, keep the following in mind:

  • The permissions of tokens cannot be changed after the tokens are established. If you began an authenticated session without the use of the --allow-group option, but want to use it in a new session, use dds auth logout to end the current session. Then use the --allow-group option and start a new session. This also applies to the reverse.
  • Not recommended for users with role Researcher, use with care.

K. QuotaExceeded

The error could look something like this:

raise error_class(parsed_response, operation_name)

         botocore.exceptions.ClientError: An error occurred (QuotaExceeded) when calling the UploadPart operation: Unknown

This indicates that the cloud storage location has a specific limit for your unit and that it has been exceeded. Contact the Data Centre, inform them of the error (remember to include the full error message) and tell them the size of the data you are attempting to upload.

L. Unrecoverable key error

The most likely reason for a key related error is that you do not have access to the project you are trying to access. If you get an error message containing something similar to “Unrecoverable key error”, please go through the following steps:

1. Run dds ls

  • Find the row of the project you were trying to access or use
  • Check the “Access” column for that row - does it contain a red cross or a green check mark?
    • Green check: You have access to the project. This is not the issue.
      • Does the current project status allow for the action you are trying to perform? Please look at the technical overview and documentation. These are linked at the start of this document.
        • Yes: Go to step 2.
        • No: This is most likely the issue. Make sure the project status allows for the current action. Please report the error message, what you were trying to do, the exact command, what steps you have performed and any other information you may have, and we will look into this.
    • Red cross: You do not have access to the project. Important note: If no one else has access to the project, please note that the Data Centre cannot help you. As Super Admins in the DDS, we do not have access to any of your data and we cannot restore your access. Please read the technical overview, the link is displayed at delivery.scilifelab.se and also when running the CLI commands.
      • Researcher: You need to contact all SciLifeLab units responsible for the projects you are involved in in the DDS. They need to run dds project access fix for you. Try again after that is done.
      • Unit Admin / Personnel: You need to contact a colleague with access to the projects you are trying to use. They need to run dds project access fix for you. Try again after that is done.

2. If you have checked the project status, the project access, the access is fixed, or you have not been able to identify the issue, please contact the Data Centre and include the full error message, the exact command you are running.

M. ERROR: Internal Server Error

Contact [email protected] and provide the information listed here.

N. TooManyBuckets

Safespring has a maximum number of buckets per unit, however this number is set quite high and therefore this should not happen. If it does happen, contact the Data Centre and we will fix this. You can also check the active projects and see if there is one that is an active project that can be archived, for example. Be careful with this and do not delete/archive (including abort) projects which you are not 100% sure that they are ready to be deleted/archived. This is not reversible.

O. Errors occurred during upload

When running dds data put, if there are issues, the following is a possible output:

Errors occurred during upload.

If you wish to retry the upload, re-run the `dds data put` command again, specifying the same options as you did now. To also overwrite the files that were uploaded, also add the `--overwrite` flag at the end of the command.

See /code/DataDelivery_2022-06-30_14-17-18/logs/dds_failed_delivery.json for more information.

Inform the Data Centre of the issue.

P. Not possible to change project status (Deadlock)

A while back, someone commented that they were having issues with changing the status of a project. It turned out that the status had actually been changed, so before contacting the Data Centre, check the project status and see if it has changed despite the error.

Q. Lost access to 2FA Authenticator App

If you have configured the Two Factor Authentication of your DDS account to use an authenticator app, but you have lost access to it (e.g. due to a broken or lost device), the Data Centre can deactivate the configuration in order to let you authenticate using the default email option. After this you can choose to activate the authenticator app method again.

  • Researcher role: Please contact a SciLifeLab unit responsible for one of your ongoing delivery projects and tell them you need a reset of the 2FA method.
  • Unit Admin / Personnel: Please contact Data Centre and ask for a reset of the 2FA method. Inform them of the username and email address of the affected account.

R. Windows: DDS freezes after message

The output from the DDS could look something like this:

PS Z:\> dds auth login
dds : \ufe35
At line:1 char:1
+ dds auth login
+ ~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: ( \ufe35 :String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
\ufe35 ( ) \ufe35
( ) ) ( ( ) SciLifeLab Data Delivery System
\ufe36 ( ) ) ( https://delivery.scilifelab.se/
\ufe36 ( ) Version 2.1.1
\ufe36

INFO Attempting to create the session token

Unfortunately this seems to have something to do with the Windows command prompt and the encoding (or logging), and not the DDS. We have attempted to reproduce the issue in both the Command Prompt and Powershell without success. To attempt to solve this, look at the following steps:

How did you install the CLI?

  • From PyPi, with the instructions found here.
    • Please download the executable, following the instructions, and try running the same command again (but with the executable).
  • I downloaded the executable as instructed here.
    • Inform the Data Centre.

Depending on your level of experience with Stack Overflow and the command line, this comment could potentially help you to solve the issue.

S. The download is interrupted

1. Are you attempting to download the data within Sweden?

  • Yes, I’m in Sweden: Move on to 2.
  • No, I’m not in Sweden: The DDS uploads data to a cloud service located within the Swedish borders. Downloading data overseas is most likely slower, however it should work. Verify that your internet connection is stable. Note that downloading data via a hotel WiFi connection a) is not recommended, and b) will most likely not be permitted due to hotel WiFi limits.

2. Have you checked that you have enough storage space available in the location you’re attempting to download the data?

  • Yes, I have enough space: Move on to 3.
  • No: Please make sure you have enough storage space available.

3. Contact the SciLifeLab unit delivering your data.

T. “No data specified”

This message is displayed when you have not specified what data to download. In order to download data from your project, you need to use one of the following options:

--source/–s: Downloads a specific file or directory. You can use this option multiple times to specify separate files or directories.

--get-all/–a: Downloads the entire content of the project.

Data Centre

A. Invitation link invalid

Ask them to follow the instructions in the Invitation link invalid section in the User part of this document. Unit Admins/Personnel can also run the following commands:

  • Delete the invite if it exists: dds user delete --is-invite. If they are experiencing issues with this, we as Super Admins can also do this.
  • Invite them again. Super Admins should not be doing this.

B. Too many authentication requests in one hour

If this message is displayed in the web interface or CLI, the user has attempted to log in or logged in and out more than 10 times within an hour. Ask the user to wait for an hour and try again later.

C. Web: Invalid username or password

If a user cannot log in to the web interface, ask them to follow the steps in C above. If they have already done this, check that they have provided all the information listed here.

A possibility is that the web is not letting them in because they have attempted to log in more than 10 times in an hour. If that’s the case, tell them to try again later. If it’s not the case, there’s not much you can do. Offer to delete their account and invite them again. It’s better if a Unit Admin or Personnel handles the new invite, but if you do, remember that you cannot invite Researchers to specific projects, only to the DDS in general and that you need to specify the unit public ID if you want to invite a Unit Personnel or Admin. However, if there are active projects in the unit, they will still not have access, so they still need to ask someone within their unit with access to the projects, for renewal of the project access.

D. CLI: Failed to authenticate user: Missing or incorrect credentials

Follow the same steps as in the above C above.

E. Not receiving emails

Check the information in the User section earlier in the document regarding this specific topic. If they contact us, the email is correct and it’s not an email address connected to the KI spam filter, create a card in the Jira board.

F. CLI Documentation not accessible

  1. Notify the users that they can download the CLI documentation as a PDF from here
  2. Investigate the issue, including whether it could be infrastructure-related.

G. Long error message (traceback)

  1. Make sure they have upgraded the CLI to the latest version. If they have, or the upgrade does not help, move on to 2.
  2. Make sure they have provided the information listed here.
  3. Look through the traceback. There could be some uncaught error or similar in the CLI, most often this is not caused by the backend, and most likely this is not something that can wait. If that’s the case, create a card in Jira.

H. Token expires during delivery

We cannot do anything about this. It’s on the todo list. No need to add a card about this, but add a comment to the existing ticket (DDS-1173).

I. Windows OS: Warning message "Storing the login information locally - please ensure no one else can access the file at …"

Since the token file permissions check does not work on Windows OS, we are warning the users to be extra cautious in keeping this file secure. The dds client should be fully functional despite this warning.

J. Permission denied (authentication token) when using the DDS on a func account or similar

Basic information

  • The default setting is that the authenticated token generated by dds auth login has permission 600. This means read and write permissions for the specific unix user that is running the command.
  • Running the dds auth login command with the option --allow-group will allow users from the same unix group to use the resulting token file.
  • The permissions of tokens cannot be changed after the tokens are established. If you began an authenticated session without the use of the --allow-group option, but want to use it in a new session, use dds auth logout to end the current session. Then use the --allow-group option and start a new session. This also applies to the reverse.
  • Not recommended for users with role Researcher, use with care.
  • As long as the dds auth login command works without the additional permissions option, that is what the users will
  • have to work with for now.

Background / additional information:

Information from NGI and explanation to our (possibly temporary) solution: On Miarka, the deliveries are done through a web service which is run as a special user (func account) and the access for that user is heavily restricted. Usually the bioinformaticians that perform the data deliveries do not have access to that account and instead the delivery is done through API calls to the service. When starting the integration testing, their idea for the delivery with the DDS was that a user would generate an authentication token with their account on Miarka, place it in a folder that is accessible to both them and the func account, and then tell the DDS, run by the func account, to use that authentication token.

When performing the authentication, the DDS CLI originally collected the token from the API and saved that in a file with permission 600: -rw------- (only read and write for that specific user). Because of the issue mentioned above, there was a suggestion to change this to allow permission 660: -rw-rw---- (read and write for specific user and group). In the end, we have now added the option --allow-group as a flag to the dds auth login command so that they can (should be able to but as of writing this it has not yet been tested) allow the func account to use the authenticated token for deliveries with the DDS. However, the permissions are now set to 640: -rw-r----- (read and write permissions for the user that is using the command and only read permission for a group). This is a conscious choice since the func account should only need to read the token file and should not be able to edit it, etc.

We have discussed the potential problems with this and we will look into it again when we can. For the time being we needed the DDS to be usable and therefore this is implemented. We will look into this again at some point and it is possible that this will change in some way, we do not yet know if and in that case how.

K. QuotaExceeded

The error could end with something like this:

raise error_class(parsed_response, operation_name)

         botocore.exceptions.ClientError: An error occurred (QuotaExceeded) when calling the UploadPart operation: Unknown

Hopefully we should be catching these and displaying something more user friendly, but if the error is not caught, then do the following:

  • Check with the user (if they have not already included the information in their email) which unit they have an account within and the size of the data that they are attempting to upload.
  • Contact Safespring (will post the email to them and other important information in the dds-development channel on slack) and ask them to increase the quota for the specific project. It should currently be set to 100TB for all NGI units. It doesn’t matter which value you choose when you ask Safespring to increase it, it’s just there as a precaution so that data is not uploaded without someone’s knowledge. This is not something that we can affect, it’s on Safespring's side.
  • When the quota has been changed, inform the users that they can try the upload again.

L. Unrecoverable key error

The most likely reason for a key related error is that the user does not have access to the project they are trying to access, and that there is an uncaught error somewhere. Add it as a card to the Jira board.

  1. Ask the user to perform the steps in the corresponding user section. If they have done this, check if the user is active or not first: dds user ls. This should display all users and the right-most column (“Active”) shows if they are active (green check mark) or not (red cross). If they are inactive, then that’s the issue. There shouldn’t be a possibility of this causing this type of error, report it as a bug in the Jira board. Inform the user that someone in their unit should activate them.
  2. Ask them to run the commands as specified in the Unrecoverable key error section earlier in the document and for them to send you the output of those commands. Does the current project status allow for the command they are trying to run? Does it say that they have access to the project?
  3. Ask them to create a new project and to do the same command with that one. If it works, the issue is with that specific project and this should be reported as a bug in the Jira board.

If, when running the dds project access fix commands, the users get error messages, it’s most likely that the user does not have access to the projects. If no one else has access to the project either, we cannot help. This information is clearly explained in the technical overview: As Super Admins in the DDS, we do not have access to any of the data and we cannot restore any access. They will need to create a completely new project and possibly perform the delivery a second time.

M. ERROR: Internal Server Error

The most likely reason for an Internal Server Error is an uncaught exception in the backend, often something database related. If someone contacts you regarding an Internal Server Error, begin with making sure they have provided the information mentioned here. When they have done this, look through the traceback.

If you cannot find the issue in the traceback, get the logs from the backend. The log should not be posted in Github just in case of the logs showing anything that shouldn’t be public information (this shouldn’t happen but it’s a precaution).

N. TooManyBuckets

Safespring has a maximum number of buckets per unit, however this number is set quite high and therefore this should not happen. If it does happen, it should only be possible when a user is creating a project. It’s also possible that this is the cause of a failed project creation, so if a project cannot be created and there is a questionable error message displayed, check the logs and see if the bucket could be created. To increase the number of allowed buckets, contact Safespring and ask them to increase the number of allowed buckets in the affected Safespring project, listed and pinned in the dds-development channel on Slack. After this, the users can attempt creating a project again.

O. Errors occurred during upload

The message from the CLI could be something like this:

Errors occurred during upload.
If you wish to retry the upload, re-run the `dds data put` command again, specifying the same options as you did now. To also overwrite the files that were uploaded, also add the `--overwrite` flag at the end of the command.


See /code/DataDelivery_2022-06-30_14-17-18/logs/dds_failed_delivery.json for more information.

Note: Normally we would need to do the following, but as is noted in the section in the User part above, they can simply perform the upload again at this time. This has also not happened yet apart from during the testing phase, but there’s a first time for everything.

If a user contacts us with an upload issue that has generated the aforementioned file, do the following:

  1. Ask them to send the exact CLI command that they ran
  2. Get the JSON file.
  3. Download that file and run the following command in the cluster, in the DDS backend: flask update-uploaded-file while specifying the project that the user has used in the command emailed to us, and the path to the file as the options. This command goes through the file and checks if the file has been uploaded but has not been added to the database. If so, the backend adds the file information to the database which should mean that the project contents should now be accessible for download.

Make sure to do this as soon as possible since it’s possible that the upload has in fact been successful but the information has not been added to the database. This means that the data is located in the cloud and will take up space there, but you cannot access it since it is not recorded in the database.

P. Not possible to change project status (Deadlock)

A while back, someone commented that they were having issues with changing the status of a project. It turned out that the status had actually been changed, but that there was a deadlock happening somewhere in the backend. Ask the user to check the project status if they haven’t already, it may be the status they were trying to change to. Otherwise, at this point the only thing we can do is ask them to wait for a bit and then try again, and tell them that we know this may happen and that we are looking into it.

Q. Lost access to 2FA Authenticator App

If you have configured the Two Factor Authentication of your DDS account to use an authenticator app, but you have lost access to it (e.g. due to a broken or lost device), another Super Admin can deactivate the configuration in order to let you authenticate using the default email option. After this you can choose to activate the authenticator app method again. Super Admins can also do this for other account roles.

Run the following command:

dds auth twofactor deactivate

R. Windows: DDS freezes after message

Ask them to follow the corresponding User section of this document. If they have done that already and it does not help, there’s nothing we can do. There’s a link in the section to a command they could run, but since we haven’t been able to reproduce it, we cannot guarantee that it would work. Usually it works if they use the executable instead of the CLI installed via PyPi.

S. The download is interrupted

All cases reported so far have not been connected to the DDS specifically. It’s most likely due to a network issue or lack of storage space as described in the corresponding section in the User section. Make sure that they provide the information specified on the and create an issue in Jira. It will have to wait unfortunately.

T. “No data specified”

The user has not specified what they wish to download. Make sure that the user has specified either --source/–s or --get-all/–a as described in the user section.