Proposal: A possible way of incrementally updating tags file #2697

AmaiKinono · 2020-11-14T23:58:23Z

We had some discussion before, see #2475 (comment). @masatake suggested the following scheme:

Drop tag entries for updated files (reported by git?) and write the remaining tag entries to a temporary tag file (tmp tag file),
Appende tag entries generated by running parsers for the updated files, and writing them to a new tag file,
Merge and Sort the two tags files.

2 and 3 can be done by the -a option.

I want to suggest a way to do 1 in this proposal. My thought is we use a group of pseudo tags to record the last modified time for each source file. It may look like this:

!_SOURCE_FILE_TIMESTAMP!/project/path/something.c	2020-11-11 23:33:28.908535595 +0800	/last modified time/

Then we offer an option like --increment to do incremental update. With this option, we compare this pseudo tag with the actual timestamp, and find files that are and are not updated (new files are counted as updated ones). This option should do nothing when generating a tags file for the first time.

For writing the temporary tags file, maybe we could use readtags like this:

$ readtags -t tags -Q '(not (#/(updated-file1)\|(updated-file2)\|.../ $input))' -Ene > temp_tags

Another problem is to ensure that the command line options are the same. A possible way we've discussed is to also put them in the pseudo tags, and @masatake said this is related to "multipass parsing". I'm not familiar with that idea, but for the scope of incremental tags file update, I think if people are doing this, then the command line is probably generated by a script/makefile/client tool/ctags configuration file, so it wouldn't become a problem if we ask for same options between multiple runs, rather than letting ctags ensure it.

The text was updated successfully, but these errors were encountered:

masatake · 2020-11-16T08:48:05Z

This includes multiple sub issues.
I made a Wiki page for this topic.
https://github.com/universal-ctags/ctags/wiki/Designing:-updating-tags-file-incrementally
@AmaiKinono, I assume you can edit the page.

I think about the initial goal.
"Adding -C action to readtags" is not so hard task but we need anyway.

$ readtags -t a.tags -C
--fields={...} --kinds-C=... ...

-C option reproduces the command line used for generating an existing tags file.
The option refers pseudo tags.

For achieving the goal, we have to do:
A. adding more pseudo tags
B. extending readtags command

masatake · 2020-11-16T09:07:18Z

I found I implemented TAG_EXTRA_DESCRIPTION and TAG_FIELD_DESCRIPTION ago.

$ u-ctags --list-pseudo-tags
#NAME                     ENABLED DESCRIPTION
JSON_OUTPUT_VERSION       off     the version of json output stream format
TAG_EXTRA_DESCRIPTION     off     the names and descriptions of enabled extras
TAG_FIELD_DESCRIPTION     off     the names and descriptions of enabled fields
...

I cannot remember when I implemented them.

Anyway, I should turn on the ptags by default.

AmaiKinono · 2020-11-16T09:42:04Z

I saw this in the wiki

Where do timestamps (and content hash values) go

Using hash is obviously another way to do the detection. Do we want one of them or both?

masatake · 2020-11-16T18:39:33Z

Attaching timestamps to 'F' kind tags is worked:

$ ./ctags --extras=+f -o - main/main.c | grep timestamp
main.c	main/main.c	1;"	F	language:C	timestamp:Sat Oct 24 01:31:30 2020
$ ./ctags --output-format=json --extras=+f -o - main/main.c | grep timestamp
{"_type": "tag", "name": "main.c", "path": "main/main.c", "pattern": false, "language": "C", "kind": "file", "timestamp": "Sat Oct 24 01:31:30 2020"}

Should I change the print format?

AmaiKinono · 2020-11-16T19:37:10Z

I think it's good as long as ctags itself could compare between timestamps. But if we assume that client tools may want to use it in some way, I suggest adding the time zone so we could convert it with the Unix time. The encode-time function in Elisp also requires the time zone.

Btw, is there any reason to create these tags for files, rather than using pseudo tags?

masatake · 2020-11-17T00:57:53Z

I found the output of stat command includes such a timestamp.

$ stat /
  File: /
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 96          Links: 22
Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:root_t:s0
Access: 2020-11-14 04:55:32.965210497 +0900
Modify: 2020-10-31 13:45:40.465843725 +0900

This format may be good enough.
For parsing it, we can use strptime. If a platform doesn't have the function, we can use https://github.com/openssh/openssh-portable/blob/master/openbsd-compat/strptime.c .

Btw, is there any reason to create these tags for files, rather than using pseudo tags?

As I wrote in the Wikipage, there are three ways to store timestamps to a tags file.

In the prototyping, I chose F/file because F/file kind is already there.
I will enumerate benefits using F/file.

We can use binary search as libreadtags provides.
If one knows the name of a file whose timestamp the one wants to know, binary-search is helpful especially when
one runs ctags on a large source tree including many files. (e.g. linux kernel)
If one uses the ptags, one has to use linear-search.
We can keep the size of tags file smaller.
To implement jumping-tags-by-file-name, the one needs F/file anyway.
Putting the timestamp for the file to the F/file tag looks natural.
Adding more fields are not so hard.

weaknesses of F/file are

You cannot enumerate input files without the linear-search.
This approach makes tagEntryInfo data structure larger.

If a user gives the list of input files for updating, F/file is enough.
If a user gives only existing tags file, the ptags is better.
In the latter case, ctags must enumerate input files by reading the tags file.

I cannot do much at once. I would like to explore what we can do more with F/file kind.

masatake · 2020-11-23T16:22:46Z

I have studied the way to update a file from sed -i.

$ strace -y -o /tmp/LOG -f sed -i -e 's/a/b/g'  /tmp/i.tmp
1055322 openat(AT_FDCWD, "/tmp/i.tmp", O_RDONLY) = 3</tmp/i.tmp>
1055322 ioctl(3</tmp/i.tmp>, TCGETS, 0x7ffc82af0ac0) = -1 ENOTTY (Inappropriate ioctl for device)
1055322 fstat(3</tmp/i.tmp>, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
1055322 lgetxattr("/tmp/i.tmp", "security.selinux", "unconfined_u:object_r:user_tmp_t"..., 255) = 36
1055322 access("/var/run/setrans/.setrans-unix", F_OK) = -1 ENOENT (No such file or directory)
1055322 futex(0x7f5a1647d5d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
1055322 futex(0x7f5a1647c2e8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
1055322 openat(AT_FDCWD, "/proc/thread-self/attr/fscreate", O_RDONLY|O_CLOEXEC) = 4</proc/1055322/task/1055322/attr/fscreate>
1055322 read(4</proc/1055322/task/1055322/attr/fscreate>, "", 4095) = 0
1055322 close(4</proc/1055322/task/1055322/attr/fscreate>) = 0
1055322 openat(AT_FDCWD, "/proc/thread-self/attr/fscreate", O_RDWR|O_CLOEXEC) = 4</proc/1055322/task/1055322/attr/fscreate>
1055322 write(4</proc/1055322/task/1055322/attr/fscreate>, "unconfined_u:object_r:user_tmp_t"..., 36) = 36
1055322 close(4</proc/1055322/task/1055322/attr/fscreate>) = 0
1055322 umask(077)                      = 022
1055322 getpid()                        = 1055322
1055322 openat(AT_FDCWD, "/tmp/sedZXwkGG", O_RDWR|O_CREAT|O_EXCL, 0600) = 4</tmp/sedZXwkGG>
1055322 umask(022)                      = 077
1055322 fcntl(4</tmp/sedZXwkGG>, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE)
1055322 openat(AT_FDCWD, "/proc/thread-self/attr/fscreate", O_RDWR|O_CLOEXEC) = 5</proc/1055322/task/1055322/attr/fscreate>
1055322 write(5</proc/1055322/task/1055322/attr/fscreate>, NULL, 0) = 0
1055322 close(5</proc/1055322/task/1055322/attr/fscreate>) = 0
1055322 fstat(3</tmp/i.tmp>, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
1055322 read(3</tmp/i.tmp>, "b\n\n", 4096) = 3
1055322 fstat(4</tmp/sedZXwkGG>, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
1055322 read(3</tmp/i.tmp>, "", 4096)   = 0
1055322 fchown(4</tmp/sedZXwkGG>, 1000, 1000) = 0
1055322 fgetxattr(3</tmp/i.tmp>, "system.posix_acl_access", 0x7ffc82af0a70, 132) = -1 ENODATA (No data available)
1055322 fstat(3</tmp/i.tmp>, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
1055322 fsetxattr(4</tmp/sedZXwkGG>, "system.posix_acl_access", "\2\0\0\0\1\0\6\0\377\377\377\377\4\0\4\0\377\377\377\377 \0\4\0\377\377\377\377", 28, 0) = 0
1055322 close(3</tmp/i.tmp>)            = 0
1055322 write(4</tmp/sedZXwkGG>, "b\n\n", 3) = 3
1055322 close(4</tmp/sedZXwkGG>)        = 0
1055322 rename("/tmp/sedZXwkGG", "/tmp/i.tmp") = 0

AmaiKinono · 2020-11-23T17:14:23Z

I have studied the way to update a file from sed -i.

Very cool. In which way this would help us? I mean ctags already has the -a option, and we have to sort it afterwards. Or are you planning to use it in other parts? Actually my feeling on this is you can use it in the edittags command you've mentioned.

masatake · 2020-11-26T14:10:22Z

I would like to withdraw the goal "Adding -C action to readtags".
I made a pull request for adding timestamp: field in #2732.

masatake · 2020-11-27T20:02:10Z

Like the timestamp field, linking libreadtags to ctags is another step for the incremental updating.
The multipass parsing is just another feature that requires linking libreadtags to ctags.
There are two approaches to implement the multipass parsing:
(0) using in-memory database
(1) reading an existing tags file

ctags already infrastructure of (0). If you are interested in this idea, search "symtab" in our source tree.
(0) is nothing to do with the incremental updating.
Like the incremental updating, approach (1) requires linking libreadtags to ctags.

Tonight I have prototyped the multipass parsing (1). Again, this is also an important step for incremental updating.
@AmaiKinono, you wrote you are not familiar with the idea of "multipass parsing."
I'm bad in English. So I will show what it is only with the command output of the prototyping.

$ cat input.py 
from mymod import X 
# ...

$ cat mymod.py 
X = 1
$ u-ctags --extras=+r --fields=+r --fields=+K -o 1st-pass.tags input.py mymod.py 
$ u-ctags --extras=+r --fields=+r --fields=+K -o 2nd-pass.tags --_hint=1st-pass.tags input.py mymod.py 
$ diff -ruN 1st-pass.tags 2nd-pass.tags 
--- 1st-pass.tags	2020-11-28 04:19:03.296310967 +0900
+++ 2nd-pass.tags	2020-11-28 04:19:13.281347322 +0900
@@ -9,6 +9,6 @@
 !_TAG_PROGRAM_NAME	Universal Ctags	/Derived from Exuberant Ctags/
 !_TAG_PROGRAM_URL	https://ctags.io/	/official site/
 !_TAG_PROGRAM_VERSION	5.9.0	/7f84a443/
-X	input.py	/^from mymod import X $/;"	unknown	module:mymod	roles:imported
+X	input.py	/^from mymod import X $/;"	variable	module:mymod	roles:imported
 X	mymod.py	/^X = 1$/;"	variable	roles:def
 mymod	input.py	/^from mymod import X $/;"	module	roles:namespace

For recoding I would like to show the diff of python.c:-)

diff --git a/parsers/python.c b/parsers/python.c
index 7806a903..a506a044 100644
--- a/parsers/python.c
+++ b/parsers/python.c
@@ -23,6 +23,8 @@
 #include "debug.h"
 #include "xtag.h"
 #include "objpool.h"
+#include "hint.h"
+#include "routines.h"
 
 #define isIdentifierChar(c) \
 	(isalnum (c) || (c) == '_' || (c) >= 0x80)
@@ -135,7 +137,8 @@ static kindDefinition PythonKinds[COUNT_KIND] = {
 	{true, 'c', "class",    "classes"},
 	{true, 'f', "function", "functions"},
 	{true, 'm', "member",   "class members"},
-	{true, 'v', "variable", "variables"},
+	{true, 'v', "variable", "variables",
+	 .referenceOnly = false, ATTACH_ROLES(PythonUnknownRoles)},
 	{true, 'I', "namespace", "name referring a module defined in other file"},
 	{true, 'i', "module",    "modules",
 	 .referenceOnly = true,  ATTACH_ROLES(PythonModuleRoles)},
@@ -1054,6 +1057,75 @@ static bool parseClassOrDef (tokenInfo *const token,
 	return true;
 }
 
+struct foreachHintData
+{
+	const char *module_name;
+	size_t module_namelen;
+	int kind;
+};
+
+static bool langobjIsInModule (const char *name, hintEntry *hint, void *data)
+{
+	struct foreachHintData *hdata = data;
+
+	const char *f = strrstr (hint->file, hdata->module_name);
+
+	if (f == NULL)
+		return true;			/* continue the finding */
+
+
+	size_t len = hdata->module_namelen
+		? hdata->module_namelen
+		: (hdata->module_namelen = strlen(hdata->module_name)
+		   , hdata->module_namelen);
+	const char * suffix = f + len;
+	if (strcmp(suffix, ".py") != 0)
+		return true;			/* continue the finding */
+
+	if (!(hint->kind && *hint->kind))
+		return true;
+
+	kindDefinition *kdef = NULL;
+	if (hint->kind[1] == '\0')
+	{
+		/* kind letter */
+		kdef = getLanguageKindForLetter (Lang_python, *hint->kind);
+	}
+
+	if (kdef == NULL)
+	{
+		/* long name of letter */
+		kdef = getLanguageKindForName (Lang_python, hint->kind);
+	}
+
+	if (kdef == NULL)
+	{
+		/* Not found; please continue the finding. */
+		return true;
+	}
+
+	hdata->kind = kdef->id;
+	/* Found; please stop the finding. */
+	return false;
+}
+
+static int resolveKind (const char *name, int moduleIndex)
+{
+	tagEntryInfo *e = getEntryInCorkQueue (moduleIndex);
+	if (!e)
+		return K_UNKNOWN;
+
+	struct foreachHintData data = {
+		.kind = K_UNKNOWN,
+		.module_name = e->name,
+		.module_namelen = 0,
+	};
+
+	foreachHints (name, TAG_FULLMATCH, langobjIsInModule, &data);
+
+	return data.kind;
+}
+
 static bool parseImport (tokenInfo *const token)
 {
 	tokenInfo *fromModule = NULL;
@@ -1175,7 +1247,8 @@ static bool parseImport (tokenInfo *const token)
 						   x = (kind:module,  role:namespace),
 						   Y = (kind:unknown, role:imported, scope:module:x) */
 						/* Y */
-						int index = makeSimplePythonRefTag (name, NULL, K_UNKNOWN,
+						int kind  = resolveKind (vStringValue (name->string), moduleIndex);
+						int index = makeSimplePythonRefTag (name, NULL, kind,
 															PYTHON_UNKNOWN_IMPORTED,
 															XTAG_UNKNOWN);
 						/* fill the scope field for Y */

AmaiKinono · 2020-11-27T20:43:34Z

So I will show what it is only with the command output of the prototyping.

Oh, I got it immediately!

The example you gave is:

We have a symbol x defined in module M.
We know the kind of x when parsing it, but not when importing it in another file.
We can know the latter, using the reference tag for the import statement.

I just thought of this: #2727 (comment)

We have symbols defined in file F, and we don't know their scope when parsing it.
We have F included in a module. Suppose we could parse include and generate a reference tag for F.
We can know the scope of all symbols in F, using the tag for the include expression.

This is kind of a "reversed" example. But then I found that, F is actually allowed to be included in other modules (like mixin in Ruby), although it may never happen in real life (The common use of include is to just break a large module into files). So I wonder should we use multipass parsing technique for such situations, and what should we do.

AmaiKinono · 2020-11-27T20:53:58Z

(0) using in-memory database

I actually read a bit about "cork symbol table" in some PRs (but I can't say I'm familiar with the cork thing for now). I guess the approach is to use libreadtags to read a tags file into the cork queue, so we could look it up? And...

(0) is nothing to do with the incremental updating.

Is this because the cork queue is freed after parsing a file?

Tonight I have prototyped the multipass parsing (1). Again, this is also an important step for incremental updating.

So, does this mean we don't use -a option, but libreadtags -> cork queue -> merge with new tags -> tags file?

masatake · 2020-11-28T16:27:57Z

This is kind of a "reversed" example. But then I found that, F is actually allowed to be included in other modules (like mixin in Ruby), although it may never happen in real life (The common use of include is to just break a large module into files). So I wonder should we use multipass parsing technique for such situations, and what should we do.
@AmaiKinono

multipass parsing technique is needed to implement such tags. However, I'm not sure we should do as you wrote.
I cannot say well-defined criteria but I think tagging the name of the included file is enough in this case. Handling included
files may be part of client tools.

(0) using in-memory database

I actually read a bit about "cork symbol table" in some PRs (but I can't say I'm familiar with the cork thing for now). I guess the approach is to use libreadtags to read a tags file into the cork queue, so we could look it up? And...

If we can load a tags file into the cork symtab, it is the best. But we cannot because the information in a tags file is not enough to reproduce the table.

(0) is nothing to do with the incremental updating.

Is this because the cork queue is freed after parsing a file?

Yes. I'm thinking about extending the current implementation to keep cork queues/tables generated for each input file
till the ending of the ctags process. The in-memory database consists of the kept cork queues/tables.

Tonight I have prototyped the multipass parsing (1). Again, this is also an important step for incremental updating.

So, does this mean we don't use -a option, but libreadtags -> cork queue -> merge with new tags -> tags file?

It doesn't mean as you wrote. What I did is nothing to do with -a.
For implementing incremental updating, we have to link libreadtags to ctags at least.
About incremental updating, what I did is the linking.
In addition to linking, I wrote an application utilizing libreadtags in ctags, "multi-pass parsing".

masatake mentioned this issue Nov 28, 2020

SystemVerilog: implement unsupported features in processDesignElement() #2737

Merged

masatake mentioned this issue Dec 6, 2020

main: add timestamp field for file kind #2732

Merged

AmaiKinono mentioned this issue Jul 6, 2021

Request: pseudo tags to record command line arguments #3085

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: A possible way of incrementally updating tags file #2697

Proposal: A possible way of incrementally updating tags file #2697

AmaiKinono commented Nov 14, 2020

masatake commented Nov 16, 2020

masatake commented Nov 16, 2020

AmaiKinono commented Nov 16, 2020

masatake commented Nov 16, 2020

AmaiKinono commented Nov 16, 2020

masatake commented Nov 17, 2020

masatake commented Nov 23, 2020

AmaiKinono commented Nov 23, 2020

masatake commented Nov 26, 2020

masatake commented Nov 27, 2020

AmaiKinono commented Nov 27, 2020

AmaiKinono commented Nov 27, 2020

masatake commented Nov 28, 2020

Proposal: A possible way of incrementally updating tags file #2697

Proposal: A possible way of incrementally updating tags file #2697

Comments

AmaiKinono commented Nov 14, 2020

masatake commented Nov 16, 2020

masatake commented Nov 16, 2020

AmaiKinono commented Nov 16, 2020

masatake commented Nov 16, 2020

AmaiKinono commented Nov 16, 2020

masatake commented Nov 17, 2020

masatake commented Nov 23, 2020

AmaiKinono commented Nov 23, 2020

masatake commented Nov 26, 2020

masatake commented Nov 27, 2020

AmaiKinono commented Nov 27, 2020

AmaiKinono commented Nov 27, 2020

masatake commented Nov 28, 2020