Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow choice between platform independent and platform dependent API #168

Closed
edsko opened this issue Sep 14, 2024 · 13 comments
Closed

Allow choice between platform independent and platform dependent API #168

edsko opened this issue Sep 14, 2024 · 13 comments
Labels
enhancement New feature or request

Comments

@edsko
Copy link
Collaborator

edsko commented Sep 14, 2024

Explained in this comment of PrimType:

-- | Primitive type
--
-- The interpretation of the primitive types is in many cases implementation
-- and/or machine dependent. In @hs-bindgen@ we are dealing with one specific
-- implementation (@libclang@), and we are generating code for one specific
-- machine (possibly cross platform). This means we have a choice; suppose we
-- see a field of type @int@ in a C struct:
--
-- 1. We could produce a field of type 'CInt' in the generated Haskell code
-- 2. We could query @libclang@ to what choice it makes for the selected
--    target platform, and use 'CShort' or 'CLong' (or something else again.
--
-- Both options have advantages; most users will probably prefer (1), so that
-- we generate a /single/ API, independent of implementation details. However,
-- some users may prefer (2) in some cases, if they want to take advantage of
-- specific features of the target platform.
--
-- We don't force the decision here, but simply represent the C AST faithfully.
data PrimType =

We are defaulting to option (1) currently, but as the comment says, we may wish to give users a choice.

@edsko edsko added the enhancement New feature or request label Sep 14, 2024
@edsko edsko added this to the 1: `Storable` instances milestone Sep 14, 2024
@phadej
Copy link
Collaborator

phadej commented Sep 14, 2024

We are not really implementing (1). We hard code offsets and alignment for structs for example. For (1) one would really need to generate hsc2hs code or "formulas" for size and alignment.

I don't think that (2) even makes sense for TH generation, TH is run on the target architecture anyway, so doing something else feels unnecessary. TH could assume host = target, but I guess true (1) would be better at least from testing perspective.


  1. We could query @libclang@ to what choice it makes for the selected
    -- target platform, and use 'CShort' or 'CLong' (or something else again.

This doesn't feel right. I think that if we do (2) and libclang says "unsigned 4 byte integer" then we should use Word32 and not try to figure out which C-type on the host is the same size as the target type. In other words, be as explicit as possible.


For (1) we also need #134, if the header has struct foo { uint64_t bar };, the field is uint64_t on all machines, though libclang will tell us some "primitive" c-type (at least after we look through typedefs). Similarly, for something like uint_fast32_t which is actually quite tricky to represent otherwise than as uint_fast32_t. (Foreign.C.Types doesn't have analogues for these, so those are a challenge for FFI already because of that).

#131 shows that just setting the target triple is not enough, there are some differences in standard headers.

@edsko
Copy link
Collaborator Author

edsko commented Sep 14, 2024

We are not really implementing (1). We hard code offsets and alignment for structs for example. For (1) one would really need to generate hsc2hs code or "formulas" for size and alignment.

Yes, that's why I emphasize "API" (as opposed to implementation). Perhaps I should be clearer about that.

My main thinking why this is OK is that if we say, CInt, then this type itself provides the same amount of ambiguity: you don't know what its size is. Therefore picking a specific implementation (in terms of Storable instances, for example) is compatible with that: it's "implementation defined" after all. This way we keep the types the same, but the implementation differs. I think this is what most users would anyway expect?

I don't think that (2) even makes sense for TH generation, TH is run on the target architecture anyway

"TH is run on the target achitecture anyway" -- that would be nice, but not actually the case with current ghc is it?

This doesn't feel right. I think that if we do (2) and libclang says "unsigned 4 byte integer" then we should use Word32 and not try to figure out which C-type on the host is the same size as the target type. In other words, be as explicit as possible.

Yes, that's fair enough, if we do do 2, then indeed, we should be explicit.

For (1) we also need #134, (..)

Yes, we should definitely not always look through typedefs; I'm currently working on that.

@phadej
Copy link
Collaborator

phadej commented Sep 14, 2024

think this is what most users would anyway expect?

I don't understand, so are you saing that

data StructFoo = MkStructFoo
  { field1 :: CInt
  , field2 :: CInt  -- ^ system independent types
  }

instance Storable StructFoo where
  sizeOf _ = 64  -- 32 bit system specific value
  ...

is fine?

IMO it isn't. Either it's

(1)

data StructFoo = MkStructFoo
  { field1 :: CInt
  , field2 :: CInt
  }

instance Storable StructFoo where
  sizeOf _ = #{sizeof struct foo} -- or some formula, like sizeof_ @CInt + sizeof_ @CInt -- but also taking alignment into account.

or

(2)

data StructFoo = MkStructFoo
  { field1 :: Word32
  , field2 :: Word32
  }

instance Storable StructFoo where
  sizeOf _ = 64
  • that would be nice, but not actually the case with current ghc is it?

It is. TH is always run on the target, in a way or another (even GHCJS etc). The current multi-staged setup won't work otherwise. (TL;DR the Int is the same on all stages, there are no host Int and target Int) (EDIT: Maybe there are some cross-compilation workarounds in use where people run TH code on host, pretending it's on target - but that can easily cause problems. I can share examples privately)

@edsko
Copy link
Collaborator Author

edsko commented Sep 14, 2024

Why isn't it fine? CInt is defined to be implementation defined; here, there is one such implementation. I don't see a conflict here.

Option (1), generating esentially .hsc code, is explicitly not what the client wants: hs-bindgen should do the resolution, and it should not depend on invoking a C compiler. We could maybe offer this as choice, but it would strictly be an enhancement that we choose to implement.

I don't understand re TH, but yes, we don't need to discuss this in this ticket.

@phadej
Copy link
Collaborator

phadej commented Sep 14, 2024

The

data StructFoo = MkStructFoo
  { field1 :: CInt
  , field2 :: CInt  -- ^ system independent types
  }

instance Storable StructFoo where
  sizeOf _ = 64  -- 32 bit system specific value
  ...

is still system specific. Then there is really no difference between (1) and (2) if (1) means doing the above.

I couldn't do

hs-bindgen alib.h > ALib.hs

and commit that file to repository (i.e. run hs-bindgen before code distribution). ALib.hs will be system specific.


I think I understand, the interface of the module will appear to be system-independent in (1), but the implementation will be always specific to the target, whether it's (1) or (2).

In other words hs-bindgen will not generate system independent code, it's out of scope?

@edsko
Copy link
Collaborator Author

edsko commented Sep 14, 2024

Yeah, I take your point. I think this needs some discussion with the client.

@edsko
Copy link
Collaborator Author

edsko commented Sep 15, 2024

So I guess we have three modes:

  1. System independent API, system dependent implementation. The code that we generate (such as Storable instances) is specific to a specific platform (we know the size of CInt), but the API is system independent (we use CInt rather than Word64).
    • We would have to do "last minute code generation" (we couldn't just add in a single generated bindings to a package intended to be cross-platform), but
    • when programmer A writes code against this API on one machine, his code should ideally still work also when programmer B uses it on another machine (because the types don't allow machine specific choices).
  2. System dependent API, system dependent implementation. Unlike in (1), we also generate a machine specific API (Word64 instead of CInt).
    • This is primarily useful when writing code that uses a C library on a very specific device, for example when writing embedded code for a known platform.
  3. System independent API, system independent implementation (this probably involves generating .hsc instead of .hs files).

The mode that I had somewhat implicitly assumed we'd focus on first is (1), but indeed all three are valid. I thought from previous discussions with the client that (3) was not considered too desirable, but perhaps we should revisit this question. One difficulty with (3) is CPP: If the header uses CPP to make machine dependent choices, then it becomes unclear what we should do; in a way, option (3) implies #72, at least to some degree.

@edsko
Copy link
Collaborator Author

edsko commented Sep 21, 2024

So we had a long chat about this with the client. I will try to summarize our thoughts here, but I should note that the conclusions here are still tentative, and it's entirely possible that a strong argument in a different direction can change those conclusions again.

The first -- relatively uncontested -- conclusion is that generating true system independent bindings (in the form of .hsc files or otherwise) is, for now at least, outside the scope of this project.

This still leaves a lot of room for different design decisions for the machine dependent case. Let's first consider the two opposite extremes of the spectrum:

  1. In an ideal world, we might have a strict phase separation: generate a machine independent API (type declarations), and a machine dependent implementation (class instances etc.). Then the machine independent API could be checked into the repo, and would provide an API for everyone to code against, and the machine dependent implementation would need to be generated last minute as part of the regular compilation pipeline. Unfortunately, this scenario is not in general possible, because in principle even the API might look completely different on different machines (structs might have different fields, for example).
  2. Given that a true machine independent API is not possible, we could go to the other extreme, and make everything completely machine dependent. Instead of generating CInt, we generate Word64, etc.

The benefit of (2) is that the programmer can take advantage of machine specific features; if code written by programmer A on their machine doesn't work with different architectures, then programmer B will find out (hopefully) at compile time on their machine.

The main problem with (2) is that we lose a lot of valuable type information:

  • When a programmer writes int, they are signalling to the reader that this type might vary across platforms; that information is lost if we replace it with Word64: when writing code against such bindings, users will no longer realize that they need to be careful, until (perhaps) they compile the code on a different machine. (It is certainly true that the vast majority of uses of int are not in fact quite so intentional, but we should optimize for correct language usage, not incorrect language usage).

  • To illustrate this point a bit more clearly, consider

    #define __STD_TYPE            typedef
    #define __SQUAD_TYPE          __int64_t
    #define __SYSCALL_SLONG_TYPE  __SQUAD_TYPE
    #define __CLOCK_T_TYPE        __SYSCALL_SLONG_TYPE
    
    __STD_TYPE __CLOCK_T_TYPE __clock_t;	
    typedef __clock_t clock_t;

    I think it's relatively clear that we don't want to replace clock_t by Word64 in the generated bindings, but it's for the same reason: just like int signals "this is a variable length integer type", clock_t signals "this is an int that captures clock information". Types capture semantic information, and we should preserve as much of this information as we can.

  • Importantly, we might also want different class instances for these different types (that is, they should be newtypes on the Haskell side, not type aliases). Indeed, CInt may well behave differently to Word64 in some instances. One area where there might be particularly gnarly differences is when it comes to things like sign extension, shifting, etc.

The tentative conclusion therefore is, for now, to go with option (1) as enumerated in #168 (comment) , unless there are strong arguments against it.

@phadej
Copy link
Collaborator

phadej commented Sep 21, 2024

One area where there might be particularly gnarly differences is when it comes to things like sign extension, shifting, etc.

I would find it very confusing if CInt and Int32 had different behaviors in about any instance, assuming CInt is 32bit integer (as it's usually is, even on 64bit systems). They don't in base: https://github.com/ghc/ghc/blob/35eb4f428ab72b712ea78d6ef86b956e321c3bb2/libraries/ghc-internal/include/CTypes.h#L4 all instances are newtype derived AFAICT.

@TravisWhitaker
Copy link

I was not able to find an example of e.g. CInt having an instance that's different from Int32 in the wild. However, I think situations could arise where it could be reasonable to have different instances. For example, the Floating instance for CDouble could conceivably differ from the instance for Double to address differences in how floating point operations work on a given platform between C and Haskell.

@edsko
Copy link
Collaborator Author

edsko commented Dec 20, 2024

Closing; option (3) is out of the scope, and between (1) and (2) we decided that we want to preserve as much semantic type information as possible.

@edsko edsko closed this as completed Dec 20, 2024
@phadej
Copy link
Collaborator

phadej commented Dec 20, 2024

For the record, as (3) System independent API, system independent implementation is out of scope, the hs-bindgen cannot be used for bootstrap itself; thus #80 cannot be implemented.

@edsko
Copy link
Collaborator Author

edsko commented Jan 7, 2025

Makes sense! Will close #80.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants