Connecting Informatica to Twitter
Monday, December 29th, 2014In the past few months I have started to use Informatica Power Centre at work. During the course the instructor mentioned that Informatica had released an Express version of the full development tool. This is fully feature but limited in the number of records it can process. The limit is 250,000 records per day so this makes it useful for personal, non-commercial use. This has lead me to install the system on my home computer and to start to investigate some of the features I will not have the opportunity to use in my day job.
Informatica Power Centre Express can be connected to several data sources including social media site such as Twitter and Facebook. I am highly unlikely to have the opportunity to use Informatica in this area and so a weekend project came into being, the problem, how to connect Power Centre Express to Twitter, retrieve a list of followers and store this in a local database.
Problem Definition
Informatica Power Centre Express can performed a number of operations with Twitter:
- Search
- Lookup a profile
- Get a list of Followers for the account
- Get a list of Friends for the account
The two operations of interest are the Get list of Followers and Profile Lookup functions.
The general algorithm is as follows:
- Get a list of Followers for the Twitter user
- Compare the list to those in the local database.
- If the Twitter user ID is found then update the record to indicate that the user is still a follower at the time the list was retrieved.
- If the Twitter user is not in the list of Followers in the local database then retrieve information about the Twitter user and add to the local database.
One of the overriding aims of this is to only ever retrieve as much information as necessary and never any more. It is therefore essential that the data is filtered as soon as possible to prevent retrieving information which we already hold locally.
Connecting Power Centre Express to Twitter
Rather then duplicate existing knowledge, I draw your attention to the article Informatica PowerCenter Express – Connecting to Twitter. This is a fairly comprehensive article but it stops short of what will be achieved in this post, namely dynamic lookup of user information.
In order to implement the above article you will need to have the following completed:
- Download and install Power Centre Express
- Create a Twitter account and also register for a Twitter developer ID
At the end of the article you should have a Power Centre mapping which can retrieve static information from Twitter.
Retrieving Dynamic Data
If you have followed the above article you will be able to retrieve information about a user, a number of posts which match search criteria and lookup a profile. This is all achieved by changing the Query Parameter in the Twitter_Data_Object_Operation and running a mapping through the Developer user interface.
Hidden away in the help system is the key fact which we need to allow the retrieval of multiple user profiles where the user IDs are taken from a dynamic data source, namely a text file.
Twitter Project
The Informatica Twitter Project created following the guidelines in the above article looks like this:
This is slightly different from the project in the article as the project is intended to process the data retrieved.
conInformatica
This object is a connection from Informatica to a SQL Server local database.
rdoTwitterUser
This object is a relational database object in the local SQL Server database. It contains the mapping to a database table which can be used to hold information about the Twitter user. The SQL code to create a small table to hold the Twitter user details is as follows:
1 2 3 4 5 6 7 8 9 | CREATE TABLE [dbo].[TwitterUser] ( [ID] [ int ] IDENTITY(1,1) NOT NULL , [Description] [ varchar ](512) NULL , [TwitterID] [ varchar ](160) NULL , [IdStr] [ varchar ](160) NULL , [ScreenName] [ varchar ](160) NULL , [LastUpdated] [datetime] NULL DEFAULT (getdate()) ) ON [ PRIMARY ] |
This table holds a local ID, the Twitter ID (TwitterID and IdStr), Twitter account name (ScreenName), a description (Description) for the account and the date/time the record in the local database was last updated (LastUpdated).
When executing a Twitter Profile Lookup, the Twitter API returns much more information than the above. The following creates a table which would hold all of the information returned from the API:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | CREATE TABLE [dbo].[TwitterUser] ( [ID] [ int ] IDENTITY(1,1) NOT NULL , [ContributorsEnabled] [ bit ] NULL , [CreatedAt] [datetime] NULL , [DefaultProfile] [ bit ] NULL , [DefaultProfileImage] [ bit ] NULL , [Description] [ varchar ](512) NULL , [FavouritesCount] [ bigint ] NULL , [FollowRequestSent] [ bit ] NULL , [FollowersCount] [ bigint ] NULL , [Following] [ bit ] NULL , [FriendsCount] [ bigint ] NULL , [GeoEnabled] [ bit ] NULL , [TwitterID] [ varchar ](160) NULL , [IdStr] [ varchar ](160) NULL , [IsTranslator] [ bit ] NULL , [Lang] [ varchar ](40) NULL , [ListedCount] [ bigint ] NULL , [Location] [ varchar ](360) NULL , [ Name ] [ varchar ](220) NULL , [Notifications] [ bit ] NULL , [ProfileBackgroundColour] [ varchar ](120) NULL , [ProfileBackgroundImageURL] [ varchar ](940) NULL , [ProfileBackgoundImageURLHttps] [ varchar ](940) NULL , [ProfileBackgroundImageTile] [ bit ] NULL , [ProfileImageURL] [ varchar ](1600) NULL , [ProfileImageURLHttps] [ varchar ](1640) NULL , [ProfileLinkColour] [ varchar ](120) NULL , [ProfileSidebarBorderColour] [ varchar ](120) NULL , [ProfileSidebarFillColour] [ varchar ](120) NULL , [ProfileTextColour] [ varchar ](120) NULL , [ProfileUseBackgroundImage] [ bit ] NULL , [Protected] [ bit ] NULL , [ScreenName] [ varchar ](160) NULL , [ShowAllInlineMedia] [ varchar ](100) NULL , [StatusContributors] [ varchar ](512) NULL , [StatusCoordinates] [ varchar ](512) NULL , [StatusCreatedAt] [datetime] NULL , [StatusFavourited] [ bit ] NULL , [StatusGeo] [ varchar ](512) NULL , [StatusID] [ varchar ](360) NULL , [StatusInReplyToScreenName] [ varchar ](512) NULL , [StatusInReplyToStatusID] [ varchar ](512) NULL , [StatusInReplyToUserID] [ varchar ](512) NULL , [StatusPlace] [ varchar ](512) NULL , [StatusPossiblySensitive] [ varchar ](100) NULL , [StatusRetweetCount] [ bigint ] NULL , [StatusRetweeted] [ bit ] NULL , [StatusSource] [ varchar ](1440) NULL , [StatusText] [ varchar ](1900) NULL , [StatusTruncated] [ bit ] NULL , [StatusesCount] [ bigint ] NULL , [TimeZone] [ varchar ](120) NULL , [URL] [ varchar ](480) NULL , [UTCOffset] [ bigint ] NULL , [Verified] [ bit ] NULL , [LastUpdated] [datetime] NULL DEFAULT (getdate()) ) ON [ PRIMARY ] |
conNevynTwitterAccount
This connection object connects the Informatica project to a Twitter account.
tdoNevynsTwitter
tdoNevynsTwitter is a Twitter Data Object connected to a Twitter account.
adoGetFollowers
This data object gets a lit of Followers for the Twitter account. It retrieves a list Twitter user IDs and nothing more.
adoGetUserDetails
adoGetUserDetails retrieves the details for one or more users. In the article above it retrieved information about a single user by setting the query parameter to something like user_id=123456 where 123456 represents the Twitter user ID to lookup.
The Query Parameter can be modified to read the IDs from a text file by setting the value to something like user_id=file:///c:\tmp\UserIDs.txt.
Informatica will now use the file as the data source for the query.
ffUserIDs
ffUserIDs is a flat file object which will contain the Twitter IDs of the Followers to look up. This will be the same file specified in the Query Parameter of the adoGetUserDetails object.
Mappings
Two mappings are required to achieve the project goals, the first mapGetFollowers retrieves the full list of Followers for the Twitter account. The second mapping, mapUpdateTwitterUsersTable, looks up the new Followers and adds them to the local database.
mapGetFollowers
mapGetFollowers retrieves the full list of Followers for the Twitter account.
The User IDs are split into two groups by the mapping, the first group contains the IDs of the new Followers. These IDs are written to the flat file object ffUserIDs. The second group are used to update the local database to say that the IDs stored are still Followers on the date the mapping was run.
adoGetFollowers retrieves the full list of Followers for the Twitter account.
The list of Follower IDs are then used in lkpUser to see if the FollowerIDs retrieved exist in the local database. The IdStr field in the lookup will contain the ID of the Twitter user for Followers whose IDs are in the local database and will be null for new Followers.
expAddLastUpdatedField simply adds the current date/time to the record set.
rtrSplitExistingFromNewFollowers splits the records into two groups, new Followers and existing Followers. The IDs of new Followers are written to the flat file ffUserIDs. The IDs for existing followers are passed to the update strategy updExistingFollowers, this will update the local database by matching the Follower ID and updating the LastUpdated column in the database.
mapUpdateTwitterUsersTable
mapUpdateTwitterUsersTable reads the Follower IDs from the flat file ffUserIDs and passes them to Twitter in order to get the profile for the users. Twitter will send back the profile for each of the user IDs in the file.
The Required data is then extracted and converted from Twitter data (strings etc.) into SQL data types by the expression expConverTwitterDataToSQLDataTypes.
wfUpdateFollowersInformation
This workflow allows the two mappings to be executed:
The command tasks in the workflow simply audit the execution of the mappings recording start times, number of records etc. into a log file.
Data Types and Conversions
The current project simply looks at the number of Followers and the Twitter account names. Along the way the full data set returned by Twitter was examined and conversions for the various fields put together. If you wish to extend this project then you may find the following conversions useful.
Data Type | Conversion/Expression |
Boolean | IIF(LOWER(boolean_field) = ‘true’, 1, 0) |
BigInt | IIF(IS_NUMBER(big_int_field), TO_BIGINT(big_int_field), 0) |
Date / Time | IIF(ISNULL(date_field), NULL, to_date(substr(date_field, length(date_field) – 3) || ‘-‘ || substr(date_field, 5, 3) || ‘-‘ || substr(date_field, 9, 2) || ‘ ‘ || substr(date_field, 12, 8))) |
Conclusion
The free Express edition of Informatica Powercentre allows the home user to experiment with the full power of Informatica. The social media connectors allow data to be retrieved and analysed locally as in this case.
If you’ve found this useful and wish to use the above you should remember to read the terms and conditions of the Twitter API and remember to respect the privacy of Twitter users.