-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Member enrichment ProgAI Linkedin scraper source #2674
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe pull request introduces several enhancements to the member enrichment worker service. A new function, Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
…rowdDotDev/crowd.dev into feature/progai-linkedin-scraper-source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (16)
services/libs/types/src/enums/enrichment.ts (1)
5-5
: Ensure compliance with LinkedIn's terms of service.Web scraping should be implemented responsibly. Please verify:
- Compliance with LinkedIn's robots.txt and terms of service
- Implementation of appropriate delays between requests
- Proper error handling for blocked requests
services/apps/premium/members_enrichment_worker/src/workflows/getMembersToEnrich.ts (1)
27-27
: Consider implementing rate limiting for LinkedIn scraping.LinkedIn has strict rate limiting policies. Consider implementing a rate limiter at the service level to prevent potential IP blocks or account restrictions.
Suggestions:
- Implement a token bucket rate limiter for LinkedIn requests
- Add configurable delays between requests
- Consider using proxy rotation if high throughput is needed
Would you like me to provide a code example for implementing these rate-limiting mechanisms?
services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (2)
Line range hint
63-84
: Consider optimizing GitHub identity selection.The logic for handling multiple verified GitHub identities is correct but could be optimized and made more maintainable.
Consider extracting this logic into a helper function:
+ async function selectPrimaryGithubIdentity( + memberId: string, + identities: MemberIdentity[] + ): Promise<MemberIdentity | undefined> { + const verifiedGithubIdentities = identities.filter( + (i) => + i.verified && + i.platform === PlatformType.GITHUB && + i.type === MemberIdentityType.USERNAME, + ) + + if (verifiedGithubIdentities.length <= 1) { + return verifiedGithubIdentities[0] + } + + const identityWithMostActivities = + await findMemberIdentityWithTheMostActivityInPlatform(memberId, PlatformType.GITHUB) + + return identityWithMostActivities + ? verifiedGithubIdentities.find( + (i) => i.value === identityWithMostActivities.username + ) + : verifiedGithubIdentities[0] + } - if (verifiedGithubIdentities.length > 1) { - const ghIdentityWithTheMostActivities = - await findMemberIdentityWithTheMostActivityInPlatform(input.id, PlatformType.GITHUB) - if (ghIdentityWithTheMostActivities) { - enrichmentInput.github = input.identities.find( - (i) => - i.verified && - i.platform === PlatformType.GITHUB && - i.type === MemberIdentityType.USERNAME && - i.value === ghIdentityWithTheMostActivities.username, - ) - } - } else { - enrichmentInput.github = verifiedGithubIdentities?.[0] || undefined - } + enrichmentInput.github = await selectPrimaryGithubIdentity(input.id, input.identities)Consider caching the activity counts to improve performance for frequently accessed members.
Line range hint
115-115
: Track the pending data squasher implementation.The TODO comment indicates a significant pending feature for LLM-based data squashing. This should be tracked separately to ensure it's not overlooked.
Would you like me to create a GitHub issue to track the implementation of the LLM-based data squasher and member entity enrichment logic?
services/apps/premium/members_enrichment_worker/src/sources/serp/service.ts (2)
Line range hint
39-49
: Consider keeping the method synchronous.The method has been changed to async but contains no asynchronous operations. Unless there's a specific requirement for consistency across enrichment services or future async operations are planned, keeping it synchronous would be more efficient and avoid unnecessary Promise wrapping.
- async isEnrichableBySource(input: IEnrichmentSourceInput): Promise<boolean> { + isEnrichableBySource(input: IEnrichmentSourceInput): boolean {
Line range hint
39-129
: Consider enhancing the service's robustness.As this service is crucial for member enrichment and LinkedIn profile discovery, consider these improvements:
- Add rate limiting to prevent SERP API quota exhaustion
- Implement exponential backoff for API failures
- Add documentation about expected LinkedIn URL formats and username patterns
- Consider caching SERP API results to optimize quota usage
Example rate limiting implementation:
import rateLimit from 'express-rate-limit'; private async querySerpApiWithRateLimit(...args) { const limiter = rateLimit({ windowMs: 60 * 1000, // 1 minute max: 10 // limit each IP to 10 requests per windowMs }); // Implement exponential backoff const backoff = async (retries: number) => { if (retries === 0) throw new Error('Max retries reached'); try { return await this.querySerpApi(...args); } catch (error) { if (error.response?.status === 429) { // Too Many Requests await new Promise(resolve => setTimeout(resolve, Math.pow(2, 4-retries) * 1000)); return backoff(retries - 1); } throw error; } }; return backoff(3); }services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts (1)
104-108
: Add documentation and error handlingThe new function would benefit from:
- JSDoc documentation explaining its purpose and return value
- Error handling for database failures
Consider updating the implementation:
+/** + * Retrieves enrichment cache data for all sources for a given member + * @param memberId - The ID of the member to fetch cache for + * @returns Promise resolving to an array of cache entries from all sources + * @throws {Error} If database query fails + */ export async function findMemberEnrichmentCacheForAllSources( memberId: string, ): Promise<IMemberEnrichmentCache<IMemberEnrichmentData>[]> { + try { return findMemberEnrichmentCacheForAllSourcesDb(svc.postgres.reader.connection(), memberId) + } catch (error) { + svc.log.error('Failed to fetch enrichment cache for all sources', { + error, + memberId, + }); + throw error; + } }services/apps/premium/members_enrichment_worker/src/sources/clearbit/service.ts (2)
59-61
: Consider adding email format validation.Since this is a critical validation point for Clearbit enrichment, consider adding more robust email validation:
async isEnrichableBySource(input: IEnrichmentSourceInput): Promise<boolean> { - return !!input.email?.value && input.email?.verified + const emailRegex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/; + return !!input.email?.value && + input.email?.verified && + emailRegex.test(input.email.value); }
Line range hint
19-22
: Consider improving configuration management.The service has several hardcoded values that could be externalized to configuration:
enrichMembersWithActivityMoreThan = 10
cacheObsoleteAfterSeconds = 60 * 60 * 24 * 120
(120 days)Consider moving these to environment variables or a configuration service for better maintainability and flexibility across environments.
services/apps/premium/members_enrichment_worker/src/sources/progai/service.ts (4)
Line range hint
121-125
: Consider optimizing the async implementation.While the async signature change aligns with the interface updates, the current implementation doesn't perform any async operations. Consider using
Promise.resolve()
for better performance:async isEnrichableBySource(input: IEnrichmentSourceInput): Promise<boolean> { const enrichableUsingGithubHandle = !!input.github?.value const enrichableUsingEmail = this.alsoUseEmailIdentitiesForEnrichment && !!input.email?.value - return enrichableUsingGithubHandle || enrichableUsingEmail + return Promise.resolve(enrichableUsingGithubHandle || enrichableUsingEmail) }
Line range hint
279-309
: Enhance error handling and security for API calls.The current error handling in API calls could be improved:
- Add specific error types for different failure scenarios
- Include retry mechanism for transient failures
- Validate API configuration at startup
- Add proper error context
Consider implementing this enhanced error handling:
class ProgAIError extends Error { constructor(message: string, public readonly cause?: Error) { super(message); this.name = 'ProgAIError'; } } async getDataUsingGitHubHandle(githubUsername: string): Promise<IMemberEnrichmentDataProgAI> { if (!process.env['CROWD_ENRICHMENT_PROGAI_API_KEY']) { throw new ProgAIError('ProgAI API key not configured'); } try { const url = new URL(`${process.env['CROWD_ENRICHMENT_PROGAI_URL']}/get_profile`); const config = { method: 'get', url: url.toString(), params: { github_handle: githubUsername, with_emails: true, api_key: process.env['CROWD_ENRICHMENT_PROGAI_API_KEY'], }, headers: {}, }; const response = (await axios(config)).data; return response.profile; } catch (err) { if (axios.isAxiosError(err)) { if (err.response?.status === 404) { throw new ProgAIError(`Profile not found for GitHub handle: ${githubUsername}`); } if (err.response?.status === 401) { throw new ProgAIError('Invalid ProgAI API key'); } throw new ProgAIError(`Failed to fetch profile: ${err.message}`, err); } throw new ProgAIError('Unexpected error during API call', err as Error); } }
Line range hint
143-157
: Add input validation for normalized data.The normalize method handles complex data transformation but lacks input validation. Consider adding runtime checks to ensure data integrity:
interface DataValidator { validateData(data: IMemberEnrichmentDataProgAI): void; } class ProgAIDataValidator implements DataValidator { validateData(data: IMemberEnrichmentDataProgAI): void { if (data.work_experiences?.some(exp => !exp.company || !exp.title)) { throw new Error('Invalid work experience data: missing required fields'); } if (data.skills?.some(skill => typeof skill.weight !== 'number')) { throw new Error('Invalid skill data: weight must be a number'); } // Add more validation rules } }Add validation before normalization:
normalize(data: IMemberEnrichmentDataProgAI): IMemberEnrichmentDataNormalized { + const validator = new ProgAIDataValidator(); + validator.validateData(data); let normalized: IMemberEnrichmentDataNormalized = { identities: [], attributes: {}, contributions: [], memberOrganizations: [], } // ... rest of the method }
Line range hint
251-262
: Improve LinkedIn URL handling.The current LinkedIn URL parsing is overly simplistic and might break with different URL formats. Consider implementing a more robust LinkedIn URL parser:
function parseLinkedInHandle(url: string): string | null { try { const linkedinUrl = new URL(url); if (!linkedinUrl.hostname.includes('linkedin.com')) { return null; } // Handle different LinkedIn URL formats const path = linkedinUrl.pathname.split('/').filter(Boolean); if (path[0] === 'in' && path[1]) { return path[1].split('?')[0]; // Remove query parameters } return null; } catch { return null; } } // Usage in fillPlatformData: if (data.linkedin_url) { const linkedinHandle = parseLinkedInHandle(data.linkedin_url); if (linkedinHandle) { normalized = normalizeSocialIdentity( { handle: linkedinHandle, platform: PlatformType.LINKEDIN, }, MemberIdentityType.USERNAME, true, normalized, ); } }services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (1)
535-550
: Consider adding tenant isolation and optimizing query performance.Since this function retrieves cache data from all sources, consider the following improvements:
Add tenant isolation to prevent unauthorized access across tenants
Add composite index for query optimization
Add tenant isolation by modifying the query:
select * from "memberEnrichmentCache" where - "memberId" = $(memberId) and data is not null + "memberId" = $(memberId) + and "tenantId" = $(tenantId) + and data is not null
- Add the following composite index to improve query performance:
CREATE INDEX IF NOT EXISTS idx_member_enrichment_cache_member_tenant ON "memberEnrichmentCache" ("memberId", "tenantId", "updatedAt" DESC) WHERE data IS NOT NULL;services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts (2)
59-62
: Consider using optional chaining to simplify the conditions.The conditions checking for the presence and validity of the
input.linkedin
object can be simplified using optional chaining. This will make the code more concise and easier to read.Apply this diff to refactor the conditions:
- return ( - hasEnrichableLinkedinInCache || - (input.linkedin && input.linkedin.value && input.linkedin.verified) - ) + return hasEnrichableLinkedinInCache || input.linkedin?.value && input.linkedin.verified ... - if ( - input.linkedin && - input.linkedin.value && - input.linkedin.verified && - !linkedinUrlHashmap.get(input.linkedin.value) - ) { + if (input.linkedin?.value && input.linkedin.verified && !linkedinUrlHashmap.get(input.linkedin.value)) {Also applies to: 94-99
🧰 Tools
🪛 Biome
[error] 61-61: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
144-157
: Consider handling the case whennormalizedProfiles
is empty.The
normalize
method currently returnsnull
ifnormalizedProfiles
is empty. However, the return type of the method isIMemberEnrichmentDataNormalized[]
, which does not includenull
.To handle this case and maintain consistency with the return type, consider returning an empty array instead of
null
:- return normalizedProfiles.length > 0 ? normalizedProfiles : null + return normalizedProfiles
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (11)
services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts
(3 hunks)services/apps/premium/members_enrichment_worker/src/factory.ts
(2 hunks)services/apps/premium/members_enrichment_worker/src/sources/clearbit/service.ts
(1 hunks)services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts
(1 hunks)services/apps/premium/members_enrichment_worker/src/sources/progai/service.ts
(1 hunks)services/apps/premium/members_enrichment_worker/src/sources/serp/service.ts
(2 hunks)services/apps/premium/members_enrichment_worker/src/types.ts
(4 hunks)services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts
(1 hunks)services/apps/premium/members_enrichment_worker/src/workflows/getMembersToEnrich.ts
(1 hunks)services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts
(1 hunks)services/libs/types/src/enums/enrichment.ts
(1 hunks)
🧰 Additional context used
🪛 Biome
services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts
[error] 61-61: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 95-96: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
🔇 Additional comments (13)
services/apps/premium/members_enrichment_worker/src/factory.ts (2)
6-6
: LGTM!
The import statement follows the established pattern and is correctly organized within the sources directory.
28-29
: Verify configuration requirements for LinkedIn scraper.
Other enrichment services (e.g., ProgAI) utilize configuration flags like ALSO_USE_EMAIL_IDENTITIES_FOR_ENRICHMENT
. Consider whether the LinkedIn scraper service needs similar configuration parameters for consistency and flexibility.
Let's check the implementation of the LinkedIn scraper service:
✅ Verification successful
Configuration flags are not needed for LinkedIn scraper service
The LinkedIn scraper service is designed to be a specialized service focused solely on enriching LinkedIn profiles, unlike the ProgAI service which supports multiple input types (GitHub, email). Key differences:
- LinkedIn scraper has a single, fixed enrichment path using LinkedIn URLs
- It already has built-in flexibility by consuming LinkedIn URLs from multiple sources (input, SERP cache, Clearbit cache)
- The service doesn't need email-based enrichment as it's specifically for LinkedIn profile scraping
- Configuration parameters like
alsoUseEmailIdentitiesForEnrichment
wouldn't add value to its focused functionality
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if the LinkedIn scraper service uses any configuration flags
# and compare its implementation with other services
# Check the LinkedIn scraper service implementation
echo "Checking LinkedIn scraper service implementation:"
cat services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts
# Compare with ProgAI service to see configuration patterns
echo -e "\nComparing with ProgAI service implementation:"
cat services/apps/premium/members_enrichment_worker/src/sources/progai/service.ts
Length of output: 15989
services/apps/premium/members_enrichment_worker/src/workflows/getMembersToEnrich.ts (1)
27-27
: LGTM! Verify timeout settings for LinkedIn scraping.
The addition of MemberEnrichmentSource.PROGAI_LINKEDIN_SCRAPER
follows the existing pattern. However, LinkedIn scraping operations might require different timing configurations.
Let's verify the timeout and retry configurations in the LinkedIn scraper service:
✅ Verification successful
Timeout configuration is appropriate for LinkedIn scraping
The current implementation is well-configured for LinkedIn scraping operations:
-
The parent workflow's 15-minute timeout is sufficient because:
- The LinkedIn scraper makes a single API call to ProgAI service
- The enrichMember activities have a 20-second timeout with proper retry configuration
- The implementation includes proper error handling and retries
-
The retry strategy in enrichMember is well-configured:
- Initial interval: 5 seconds
- Backoff coefficient: 2.0
- Maximum interval: 30 seconds
- Maximum attempts: 4
-
The service includes cache management:
- 120-day cache duration
- Proper cache invalidation checks
- Deduplication of LinkedIn URLs
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any specific timeout or rate limit configurations in the LinkedIn scraper service
# Test: Look for timeout or rate limit configurations
rg -A 5 "timeout|rate.?limit|delay" "services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/"
Length of output: 122
Script:
#!/bin/bash
# Let's try a broader search to understand the LinkedIn scraper implementation
# First, check if the directory exists and list its contents
fd . "services/apps/premium/members_enrichment_worker/src/sources"
# Then search for LinkedIn scraper related files in the entire codebase
rg -l "linkedin" "services/apps/premium/members_enrichment_worker/src/"
# Look for timeout configurations in the workflow file
rg -A 5 "timeout|executeChild" "services/apps/premium/members_enrichment_worker/src/workflows/getMembersToEnrich.ts"
# Check the enum definition to understand available sources
rg "enum MemberEnrichmentSource" -A 10 "services/apps/premium/members_enrichment_worker/src/"
Length of output: 3138
Script:
#!/bin/bash
# Let's check the implementation of the LinkedIn scraper service and enrichMember workflow
# Check the LinkedIn scraper service implementation
cat "services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts"
# Check the enrichMember workflow to understand how sources are handled
cat "services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts"
# Look for retry configurations in the LinkedIn scraper
rg "retry|catch|error" "services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts"
Length of output: 9618
services/apps/premium/members_enrichment_worker/src/types.ts (4)
17-17
: LGTM: Good addition of memberId field.
Adding a required memberId field improves the data structure by ensuring a stable identifier for member enrichment operations.
54-56
: LGTM: Return type properly supports array normalization.
The normalize method's return type now correctly supports both single and array results, maintaining consistency with the IMemberEnrichmentData type changes.
40-40
: LGTM: Good conversion to async operation.
Converting isEnrichableBySource
to return a Promise is appropriate for external API checks. This matches the async pattern of hasRemainingCredits
.
#!/bin/bash
# Verify all implementations handle the async change
ast-grep --pattern 'class $_ implements IEnrichmentService {
$$$
isEnrichableBySource($_) {
$$$
}
$$$
}'
29-29
: Verify handling of multiple ProgAI enrichment records.
The type now supports arrays of ProgAI enrichment data. Ensure that consumers of this type properly handle both single items and arrays.
✅ Verification successful
Based on the codebase analysis, I can now generate the final response since I have found sufficient evidence about how ProgAI enrichment data arrays are handled:
Array handling for ProgAI enrichment data is properly implemented
The codebase correctly handles arrays of ProgAI enrichment data:
- EnrichmentServiceProgAILinkedinScraper properly returns and processes arrays in its getData() and normalize() methods
- The normalize() method correctly handles arrays by iterating through profiles and normalizing each one
- The IMemberEnrichmentData type union correctly includes both single IMemberEnrichmentDataProgAI and IMemberEnrichmentDataProgAI[]
- The EnrichmentService interface's normalize() method is designed to handle both single items and arrays in its return type
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for places where IMemberEnrichmentData is consumed to verify array handling
rg -t typescript "IMemberEnrichmentData" -B 2 -A 5
Length of output: 118
Script:
#!/bin/bash
# Search for IMemberEnrichmentData usage with more context and including type definitions
rg -t typescript "IMemberEnrichmentData" -C 10
# Search for ProgAI related enrichment handling
rg -t typescript "ProgAI" -C 10
# Search for type imports of IMemberEnrichmentData
rg -t typescript "import.*IMemberEnrichmentData"
Length of output: 234
Script:
#!/bin/bash
# Search for IMemberEnrichmentData usage with more context
rg "IMemberEnrichmentData" -C 10
# Search for ProgAI related enrichment handling
rg "ProgAI" -C 10
# Look for files that might handle enrichment data
fd -e ts -e js enrichment
Length of output: 141360
services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (1)
46-46
: Verify proper handling of memberId across services.
The addition of memberId
to enrichment input is architecturally sound and enables better tracking. However, we should verify its proper handling across services.
✅ Verification successful
Based on the results, I can now generate the final response:
The memberId usage in enrichment input is properly implemented and safe.
The verification shows that:
memberId
is correctly defined as a required field inIEnrichmentSourceInput
interface- It's properly used in cache operations across the enrichment workflow
- The implementation in LinkedIn scraper service demonstrates consistent handling with proper cache lookups
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify memberId handling across enrichment services
# Expected: All enrichment services should handle memberId consistently
# Check enrichment service implementations
ast-grep --pattern 'class $_EnrichmentService {
$$$
getEnrichmentData($_: IEnrichmentSourceInput) {
$$$
}
}'
# Check cache operations with memberId
rg -A 5 "memberId.*IEnrichmentSourceInput"
Length of output: 165
Script:
#!/bin/bash
# Let's try a broader search to understand the enrichment service implementations
# Check for enrichment service classes
ast-grep --pattern 'class $_ implements IEnrichmentService {
$$$
}'
# Look for IEnrichmentSourceInput interface definition and usage
rg -A 10 "interface IEnrichmentSourceInput"
# Search for memberId usage in enrichment context
rg -A 5 "memberId.*enrich"
# Look for cache operations
rg -A 5 "cache.*memberId"
Length of output: 3578
services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts (2)
4-4
: LGTM: Import addition is consistent with the codebase pattern
The new import follows the established pattern of importing database functions from the data access layer.
45-47
: Verify handling of array return type in downstream code
The return type change from a single object to potentially an array of normalized data suggests the LinkedIn scraper might return multiple profiles. This could impact downstream consumers expecting a single object.
Let's verify the consumers of this function:
✅ Verification successful
Let me analyze the consumer code further to understand how the array return type is handled.
Let me check one more aspect to understand how the data is being handled in the workflow.
Return type change is safely handled in the workflow
The array return type is safely handled in the workflow implementation:
- The normalized data is stored in a
toBeSquashed
object keyed by source - There's a TODO for implementing a data squasher that will handle the consolidated data
- The array return type actually provides more flexibility for the upcoming LLM-based data squashing implementation
- No immediate consumers are making assumptions about single vs array return types
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for calls to normalizeEnrichmentData to verify array handling
rg -l "normalizeEnrichmentData" | while read -r file; do
echo "=== $file ==="
rg -A 5 "normalizeEnrichmentData" "$file"
done
Length of output: 2102
Script:
#!/bin/bash
# Check the implementation of data squashing logic and array handling
ast-grep --pattern 'const normalized = await normalizeEnrichmentData($source, $data)
$$$'
Length of output: 93
Script:
#!/bin/bash
# Check the type definition to understand the data structure
rg -A 10 "interface IMemberEnrichmentDataNormalized"
Length of output: 1780
Script:
#!/bin/bash
# Look for any data squashing implementation
rg -B 2 -A 5 "data squasher"
Length of output: 521
Script:
#!/bin/bash
# Check the implementation details around the normalized data usage
rg -B 5 -A 10 "const normalized = await normalizeEnrichmentData"
Length of output: 1540
services/apps/premium/members_enrichment_worker/src/sources/clearbit/service.ts (1)
59-61
: Verify if async operation is necessary for email validation.
The method has been made async but contains no asynchronous operations. While this change aligns with the interface, it adds unnecessary Promise overhead for a synchronous operation.
services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (1)
535-550
: Verify the usage pattern across the codebase.
Let's verify how this new function aligns with existing patterns and its usage.
✅ Verification successful
Let me generate additional verification scripts to check the implementation details and usage patterns more thoroughly.
Based on the comprehensive verification results, I can now provide the final response:
The implementation follows established patterns and is correctly integrated
The function findMemberEnrichmentCacheForAllSourcesDb
is:
- Properly typed with the generic
IMemberEnrichmentCache<T>
interface - Consistently uses the
DbConnOrTx
transaction pattern like other DB functions - Correctly used in the enrichment activities layer
- Follows the same database access patterns as related functions
The only difference from findMemberEnrichmentCacheDb
is the omission of the source filter, which is intentional as per the function's purpose to fetch all sources.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check usage patterns and similar functions
# Check for similar functions and their usage patterns
echo "Checking for similar functions and their patterns..."
rg --type ts "findMemberEnrichmentCache.*Db" -A 5
# Check for tenant isolation patterns in similar functions
echo "Checking tenant isolation patterns..."
rg --type ts "from \"memberEnrichmentCache\".*where" -A 5
# Look for potential callers of this function
echo "Looking for potential callers..."
rg --type ts "findMemberEnrichmentCacheForAllSourcesDb"
Length of output: 4552
Script:
#!/bin/bash
# Check the implementation of findMemberEnrichmentCache function to compare patterns
ast-grep --pattern 'function findMemberEnrichmentCache$_($$$) {
$$$
}'
# Check for any enrichment-related interfaces and types
rg --type ts "interface.*MemberEnrichment" -A 5
# Look for any test files that might use these functions
fd -e test.ts -e spec.ts -x rg -l "findMemberEnrichmentCache"
# Check for any transaction usage patterns in similar functions
rg --type ts "tx: DbConnOrTx" -B 5 -A 10
Length of output: 16279
services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts (1)
1-157
: LGTM!
The implementation of the EnrichmentServiceProgAILinkedinScraper
class looks good overall. It follows the expected structure, implements the necessary methods from the IEnrichmentService
interface, and handles the retrieval and normalization of LinkedIn profile data using the ProgAI service.
The class effectively utilizes caching mechanisms, checks for enrichable identities, and ensures no duplicates are processed. Error handling is also implemented when making API calls to the ProgAI service.
The integration with the EnrichmentSourceServiceFactory
and the updates to the getMembersToEnrich
function ensure that this new enrichment source is properly incorporated into the existing system.
Great job on implementing this new feature!
🧰 Tools
🪛 Biome
[error] 61-61: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 95-96: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
services/apps/premium/members_enrichment_worker/src/sources/serp/service.ts
Show resolved
Hide resolved
services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts
Outdated
Show resolved
Hide resolved
services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (6)
services/apps/premium/members_enrichment_worker/src/sources/progai/service.ts (2)
Line range hint
121-126
: Consider documenting the async change rationale.While making the method async aligns with other enrichment services, the implementation remains synchronous. Consider adding a comment explaining why this method is async for future maintainers.
+ // Method is async to maintain consistency with other enrichment services async isEnrichableBySource(input: IEnrichmentSourceInput): Promise<boolean> {
312-342
: Consider adding architectural safeguards.The service makes external API calls but lacks several important architectural safeguards:
- Environment variable validation
- Retry mechanism for transient failures
- Rate limiting to prevent API quota exhaustion
Consider implementing these improvements:
- Add environment variable validation in the constructor
- Implement retry logic using exponential backoff
- Add rate limiting using a token bucket algorithm
Would you like me to provide implementation examples for any of these improvements?
services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts (4)
59-60
: Simplify condition using optional chainingYou can enhance readability by simplifying the condition with optional chaining.
Apply this diff:
return ( hasEnrichableLinkedinInCache || - (input.linkedin && input.linkedin.value && input.linkedin.verified) + (input.linkedin?.value && input.linkedin.verified) )🧰 Tools
🪛 Biome
[error] 60-60: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
108-108
: Simplify condition using optional chainingSimilarly, simplify the condition here using optional chaining for better clarity.
Apply this diff:
- if (input.linkedin && input.linkedin.value && input.linkedin.verified) { + if (input.linkedin?.value && input.linkedin.verified) {🧰 Tools
🪛 Biome
[error] 108-108: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
145-146
: Return an empty array instead of nullReturning an empty array when no profiles are found is preferable to returning
null
. This approach avoids potentialnull
reference errors and simplifies the handling in consuming code.Apply this diff:
return profiles.length > 0 ? profiles : null + return profiles
180-181
: Ensure consistent return type by returning an empty arrayFor consistency and to prevent
null
reference issues, consider returning an empty array instead ofnull
when there are no normalized profiles.Apply this diff:
return normalizedProfiles.length > 0 ? normalizedProfiles : null + return normalizedProfiles
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts
(1 hunks)services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/types.ts
(1 hunks)services/apps/premium/members_enrichment_worker/src/sources/progai/service.ts
(2 hunks)services/apps/premium/members_enrichment_worker/src/types.ts
(5 hunks)services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- services/apps/premium/members_enrichment_worker/src/types.ts
- services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts
🧰 Additional context used
🪛 Biome
services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts
[error] 60-60: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 108-108: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
🔇 Additional comments (4)
services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/types.ts (2)
1-1
: LGTM!
Clean import statement following TypeScript best practices.
3-8
: Add JSDoc documentation to improve maintainability.
The interface structure is well-defined, but adding JSDoc documentation would improve maintainability and provide better context for the metadata fields.
Consider adding documentation like this:
+/**
+ * Represents enriched member data from LinkedIn scraping.
+ * Extends the base ProgAI enrichment data with LinkedIn-specific metadata.
+ */
export interface IMemberEnrichmentDataProgAILinkedinScraper extends IMemberEnrichmentDataProgAI {
+ /**
+ * Additional metadata for LinkedIn enrichment
+ * @property {number} repeatedTimesInDifferentSources - Number of times this data appeared in different sources
+ * @property {boolean} isFromVerifiedSource - Indicates if the data comes from a verified LinkedIn source
+ */
metadata: {
repeatedTimesInDifferentSources: number
isFromVerifiedSource: boolean
}
}
Also, consider adding runtime validation for these fields when processing the enrichment data to ensure data integrity.
✅ Verification successful
Based on the search results, I can see that these fields are actively used in the service implementation with proper validation and tracking logic. The service.ts
file shows that:
repeatedTimesInDifferentSources
is properly tracked using a HashMap to count occurrencesisFromVerifiedSource
is set based on the source validation- Both fields are properly initialized and updated throughout the enrichment process
JSDoc documentation would enhance code clarity, but the implementation shows proper field handling.
Consider adding the suggested JSDoc documentation to improve code readability and maintainability:
/**
* Represents enriched member data from LinkedIn scraping.
* Extends the base ProgAI enrichment data with LinkedIn-specific metadata.
*/
export interface IMemberEnrichmentDataProgAILinkedinScraper extends IMemberEnrichmentDataProgAI {
/**
* Additional metadata for LinkedIn enrichment
* @property {number} repeatedTimesInDifferentSources - Number of times this data appeared in different sources
* @property {boolean} isFromVerifiedSource - Indicates if the data comes from a verified LinkedIn source
*/
metadata: {
repeatedTimesInDifferentSources: number
isFromVerifiedSource: boolean
}
}
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if validation is implemented for these fields
rg -A 5 "repeatedTimesInDifferentSources|isFromVerifiedSource" --type ts
Length of output: 6590
services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts (2)
42-55
: Verify the necessity of the 'break' statements within the loop
The break
statements at lines 51 and 54 may cause the for
loop over caches
to exit prematurely after processing the first matching cache. This could prevent subsequent caches from being evaluated, potentially missing additional enrichable LinkedIn identities.
Please verify if this immediate exit is intentional. If not, consider removing or adjusting the break
statements to ensure all relevant caches are processed.
161-163
: Improve error handling for Axios request
As previously noted, the getDataUsingLinkedinHandle
method lacks specific error handling for the Axios request. Enhancing error handling will make the code more robust and provide better error insights.
const url = `${process.env['CROWD_ENRICHMENT_PROGAI_URL']}/get_profile` | ||
const config = { | ||
method: 'get', | ||
url, | ||
params: { | ||
email, | ||
with_emails: true, | ||
api_key: process.env['CROWD_ENRICHMENT_PROGAI_API_KEY'], | ||
}, | ||
headers: {}, | ||
} | ||
|
||
const response = (await axios(config)).data | ||
return response?.profile || null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling and reduce code duplication.
The method has similar issues as getDataUsingGitHubHandle
and contains duplicated HTTP request logic.
- Add error handling and timeout:
async getDataUsingEmailAddress(email: string): Promise<IMemberEnrichmentDataProgAI> {
const url = `${process.env['CROWD_ENRICHMENT_PROGAI_URL']}/get_profile`
const config = {
method: 'get',
url,
params: {
email,
with_emails: true,
api_key: process.env['CROWD_ENRICHMENT_PROGAI_API_KEY'],
},
- headers: {},
+ headers: {},
+ timeout: 5000, // 5 seconds timeout
}
- const response = (await axios(config)).data
- return response?.profile || null
+ try {
+ const response = await axios<IMemberEnrichmentDataProgAIResponse>(config)
+ return response.data?.profile || null
+ } catch (error) {
+ this.log.error('Failed to fetch email profile from ProgAI', {
+ error,
+ email,
+ })
+ return null
+ }
}
- Consider extracting the common HTTP request logic:
private async fetchProfile(params: Record<string, string>): Promise<IMemberEnrichmentDataProgAI> {
const config = {
method: 'get',
url: `${process.env['CROWD_ENRICHMENT_PROGAI_URL']}/get_profile`,
params: {
...params,
with_emails: true,
api_key: process.env['CROWD_ENRICHMENT_PROGAI_API_KEY'],
},
headers: {},
timeout: 5000,
}
try {
const response = await axios<IMemberEnrichmentDataProgAIResponse>(config)
return response.data?.profile || null
} catch (error) {
this.log.error('Failed to fetch profile from ProgAI', {
error,
params,
})
return null
}
}
Then update both methods to use it:
async getDataUsingGitHubHandle(githubUsername: string): Promise<IMemberEnrichmentDataProgAI> {
return this.fetchProfile({ github_handle: githubUsername })
}
async getDataUsingEmailAddress(email: string): Promise<IMemberEnrichmentDataProgAI> {
return this.fetchProfile({ email })
}
const url = `${process.env['CROWD_ENRICHMENT_PROGAI_URL']}/get_profile` | ||
const config = { | ||
method: 'get', | ||
url, | ||
params: { | ||
github_handle: githubUsername, | ||
with_emails: true, | ||
api_key: process.env['CROWD_ENRICHMENT_PROGAI_API_KEY'], | ||
}, | ||
headers: {}, | ||
} | ||
|
||
return response.profile | ||
const response: IMemberEnrichmentDataProgAIResponse = (await axios(config)).data | ||
return response?.profile || null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling and request timeout.
The HTTP request lacks error handling and timeout configuration, which could lead to unhandled exceptions or hanging requests.
Consider applying these improvements:
async getDataUsingGitHubHandle(githubUsername: string): Promise<IMemberEnrichmentDataProgAI> {
const url = `${process.env['CROWD_ENRICHMENT_PROGAI_URL']}/get_profile`
const config = {
method: 'get',
url,
params: {
github_handle: githubUsername,
with_emails: true,
api_key: process.env['CROWD_ENRICHMENT_PROGAI_API_KEY'],
},
- headers: {},
+ headers: {},
+ timeout: 5000, // 5 seconds timeout
}
- const response: IMemberEnrichmentDataProgAIResponse = (await axios(config)).data
- return response?.profile || null
+ try {
+ const response = await axios<IMemberEnrichmentDataProgAIResponse>(config)
+ return response.data?.profile || null
+ } catch (error) {
+ this.log.error('Failed to fetch GitHub profile from ProgAI', {
+ error,
+ githubUsername,
+ })
+ return null
+ }
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const url = `${process.env['CROWD_ENRICHMENT_PROGAI_URL']}/get_profile` | |
const config = { | |
method: 'get', | |
url, | |
params: { | |
github_handle: githubUsername, | |
with_emails: true, | |
api_key: process.env['CROWD_ENRICHMENT_PROGAI_API_KEY'], | |
}, | |
headers: {}, | |
} | |
return response.profile | |
const response: IMemberEnrichmentDataProgAIResponse = (await axios(config)).data | |
return response?.profile || null | |
const url = `${process.env['CROWD_ENRICHMENT_PROGAI_URL']}/get_profile` | |
const config = { | |
method: 'get', | |
url, | |
params: { | |
github_handle: githubUsername, | |
with_emails: true, | |
api_key: process.env['CROWD_ENRICHMENT_PROGAI_API_KEY'], | |
}, | |
headers: {}, | |
timeout: 5000, // 5 seconds timeout | |
} | |
try { | |
const response = await axios<IMemberEnrichmentDataProgAIResponse>(config) | |
return response.data?.profile || null | |
} catch (error) { | |
this.log.error('Failed to fetch GitHub profile from ProgAI', { | |
error, | |
githubUsername, | |
}) | |
return null | |
} |
…aper and gh progai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts (3)
61-61
: Remove redundant break statement.This break statement is redundant as there's already a break statement at line 59 within the same block.
- break
75-131
: LGTM with minor improvement suggestion.The implementation effectively tracks and deduplicates LinkedIn identities while preserving metadata about verification status and occurrence frequency.
Consider using optional chaining for safer property access:
-if (input.linkedin && input.linkedin.value && input.linkedin.verified) +if (input.linkedin?.value && input.linkedin?.verified)🧰 Tools
🪛 Biome
[error] 115-115: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
133-165
: LGTM with minor improvement suggestion.The method effectively handles data fetching and cache management.
Consider using optional chaining for safer property access:
-if (existingProgaiCache && existingProgaiCache.data && (existingProgaiCache.data as IMemberEnrichmentDataProgAI).id == data.id) +if (existingProgaiCache?.data && (existingProgaiCache.data as IMemberEnrichmentDataProgAI).id == data.id)🧰 Tools
🪛 Biome
[error] 148-149: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts
(1 hunks)
🧰 Additional context used
🪛 Biome
services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts
[error] 67-67: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 115-115: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
[error] 148-149: Change to an optional chain.
Unsafe fix: Change to an optional chain.
(lint/complexity/useOptionalChain)
🔇 Additional comments (3)
services/apps/premium/members_enrichment_worker/src/sources/progai-linkedin-scraper/service.ts (3)
23-42
: LGTM! Well-structured class declaration with clear configuration.
The class is properly structured with appropriate inheritance and interface implementation. The configuration properties are well-defined, including a reasonable cache expiration time of 90 days for LinkedIn data.
185-201
: LGTM! Well-implemented normalization logic.
The method correctly normalizes profiles while preserving metadata, following the established pattern of using the ProgAI service for base normalization.
167-183
:
Add proper error handling following established patterns.
The method needs robust error handling for API calls.
As mentioned in the previous review, implement proper error handling:
- const response: IMemberEnrichmentDataProgAIResponse = (await axios(config)).data
- return response?.profile || null
+ try {
+ const response: IMemberEnrichmentDataProgAIResponse = (await axios(config)).data
+ return response?.profile || null
+ } catch (err) {
+ if (axios.isAxiosError(err)) {
+ this.log.warn(
+ `Axios error occurred while getting ProgAI data: ${err.response?.status} - ${err.response?.statusText}`
+ )
+ if (err.response?.status === 404) {
+ throw new Error404NotFound('LinkedIn profile not found')
+ }
+ throw new Error(`ProgAI enrichment request failed with status: ${err.response?.status}`)
+ }
+ throw new Error500InternalServerError(`Unexpected error during ProgAI enrichment: ${err.message}`)
+ }
async hasRemainingCredits(): Promise<boolean> { | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement actual credit checking.
The method currently always returns true without verifying available credits. This could lead to issues if the ProgAI service has rate limits or credit restrictions.
Consider implementing actual credit checking by:
- Checking remaining credits from ProgAI API
- Maintaining a local counter for rate limiting
- Implementing proper error handling for when credits are exhausted
Would you like me to help implement this functionality?
Changes proposed ✍️
What
copilot:summary
copilot:poem
Why
How
copilot:walkthrough
Checklist ✅
Feature
,Improvement
, orBug
.Summary by CodeRabbit
Release Notes
New Features
PROGAI_LINKEDIN_SCRAPER
.Enhancements
Documentation