Articles, Blog

Data Science & Agriculture

October 12, 2019

Hello everyone, and welcome to the National Agricultural
Library’s Ag Data Commons monthly webinar series. This month we are focusing on a theme of Data
Science as it applies to Agriculture, and will highlight the big data analysis projects
and work of two of NAL’s University of Maryland iSchool data science fellows. Shivam Saith will present his work on USDA
Dietary Guidelines Sentiment Analysis, and Akshat Pant will present on his work with
Gap Assessment of Accessible Agrobiodiversity Data using GBIF and the NAL Thesaurus. I’m sorry, we’re getting a little bit of feedback here, we’re trying to fix it. Hello, if everyone could make sure that their phones are muted for the presentation so we don’t get feedback, that would be wonderful, thank you. If you have questions during the webinar,
feel free to post them in the public chat, and at the end of both presentations we will
open up for questions through both audio and the online chat. So without further delay, I will turn it over
to Shivam.>>>
Hey everyone My name is Shivam Saith and I am a Data Science Fellow here at the National
Agricultural Library in the Knowledge Services Division. I recently graduated with a degree in Information
Management from the University of Maryland College Park. My primary interests are working with data
to generate insights out of it to help in decision making and also generating impactful
data visualizations. I am here to talk about the latest project
I accomplished here at the NAL, a project titled USDA Dietary Guidelines Sentiment Analysis. In this project I have done a sentiment analysis
of the statements relevant to 6 essential nutrients and generated visualizations to
demonstrate the sentiment associated with each nutrient. I have been guided in this project by my supervisor
here at NAL, Mrs. Cynthia Parr along with my supervisor at the University of Maryland,
Mr. Adam Kriesberg. So what are these USDA Dietary guidelines? In simple words, they include a little over
100 years of nutrition advice for Americans. The first one was published in 1894.They have
been about scientific findings and health marketing techniques. And since 1980 they are being released once every
5 years under the name Dietary guidelines. The dietary guidelines are more detailed and
comprehensive in nature. The motivation behind this project was the
fact that recommendations regarding the different nutrients have changed over the years. For example, in the past, fats have usually
been presented in a negative tone, but over the time it has also been mentioned that some
types of fats are good for the body unlike other fats. The goal was to create visualizations to easily
convey complex information. Basically, it is about analyzing all the statements
dealing with a particular nutrient, and not just understanding whether the sentiment is
positive or negative but also calculating the extent to which a statement is positive
or negative. The individual statement sentiments have been
averaged over the time to generate trendline visualizations. The project scope has been limited to the
dietary guidelines that were released starting 1980 until the latest one that was released
in 2015. There are 2 reasons for limiting the scope
to this time period. The first one being and as I mentioned before,
because the dietary guidelines are more detailed in nature as compared to the original nutrition
guidelines, there was a lot more data to work with and to use it to come up with reasonable
conclusions about the sentiment of nutrients. Secondly, the data was much cleaner and Optimal
Character Recognition could recognize the original data with a much higher accuracy
as compared to the data of some of the previous years. Now let’s talk a little bit about the corpus. The corpus I had to work with included the
Dietary guidelines from 1980 till 2015 as I mentioned before. I have shared the link in case any one of
you is interested to have a look. []
The corpus consisted of a total of 8 documents which had a hundred and eighty-two thousand
words in them in total. It is interesting to note that the content in these
dietary guidelines has followed a trend of followed a trend where the content increases every year. To put things into perspective, the dietary
guidelines in 1980 had around 3000 words whereas the latest one in 2015 had 57000 words. Just wanted to talk in brief about the essential
nutrients that are the focus of this project. Wikipedia defines an essential nutrient as
a nutrient that the body cannot synthesize on its own and something that must be provided
by diet. There are 6 of them including Proteins, Vitamins,
Minerals, Carbohydrates, Fat and Water. It is also important to note at this stage that the project concentrates only on the parent level and not the sub groups of nutrients under a particular nutrient. For example, there are multiple types of Vitamins
like A, B, C, but the project concentrates on the overall sentiment for Vitamins and
not individual vitamins and this applies to all the other nutrients as well. As far as the process goes, I started by converting the pdfs to textual data. I had to use Optimal Character Recognition
for that and Google Docs did a fairly good job for me by converting the PDFs, image with textual data and converting it to full textual data. As far as the data cleaning goes, I had to
remove erroneous new line characters and special characters that played no part in the sentiment. I also had to develop regular expressions
to identify the beginning of a new statement and this was later used in effectively separating
the different statements. Finally, after separating at the individual
statement level, I used the relevant package methods to give me the sentiment scores. I just want to talk in brief about the technologies
I have used to accomplish this project. Python is a very popular language in the Data
Science world and has been used for the project as well. I have personally used both Python and R and
went with Python in this case because I felt it had a variety of packages for data manipulation
as well Natural Language Processing. Jupyter notebooks have been used which allow
us to create and share documents that contain live code, equations, and visualizations. It allows one to code right in the browser
and eliminates the need to install any other Integrated Development Environment and also
makes it very easy to share your code with others. Besides this, I have used Anaconda which is an open
source distribution and helps in simplifying package management and deployment. The 2 main Python packages which have been used for sentiment analysis are TextBlob and Vader sentiment. Each of these packages is tuned to a specific
type of data. Vader is more or less tuned to social media
data and Texblob is a beginner level package [random noise from someone’s line]
>>>[Cyndy Parr] Could you please mute? Somebody is not muted.
Thank you. So, as I was mentioning, each of these packages is tuned to a specific type of data Vader is more or less tuned to social media
data and Texblob is a beginner level package not tuned to any specific type of data. I have used both of them and provided the
end user the option to use either of these packages to carry out the sentiment analysis. One package performs better for some kind
of statements and vice versa. For visualizations there were a couple of
options I contemplated including Matplotlib which is very popular in the Python world, but ended up using plotly due to the ease
as well as the quality of visualizations it produces with minimal code. It is important to note although that only
a limited number of visualizations generated are free and it is not free for commercial
use. Just a bit of details about the packages. Packages have a predefined list of positive,
negative, negation words with scores for each word. For instance, textblob has polarity subjectivity
and intensity scores for each word. Polarity is from -1 to 1, subjectivity is
from 0 to 1 and and intensity score is from 0.5 to 2. Polarity refers to the extent of positivity
or negativity of a word, subjectivity score calculates how objective or subjective a particular
word is and intensity is a measure of how a particular word modify the next word. As an example, the word GREAT has a polarity
of 0.8, but when the word NOT is used behind behind GREAT, the polarity of the 2 words together
becomes -0.4. Whereas the two words VERY GREAT combined together have an overall polarity and have a perfect score
of 1.0 For calculating the polarity score of a statement,
these individual scores for words and phrases are averaged out to calculate the final score. It is important to note here that in the scope
of this project, we are concerned only with the polarity scores and not the subjectivity
or the intensity scores. This slide shows 3 simple examples of positive
negative and neutral statements as calculated by the Text Blob package. The statement on the top has a perfect score
of + 1.0 primarily because of the presence of the word EXCELLENT. The statement in the middles is neutral because
it does not have any positive or negative word within it. The third statement is slightly negative in
polarity primarily because of the presence of the word LIMIT. Once the sentiment scores of the nutrient
relevant statements were calculated, 3 different types of visualizations were generated. The polarity value scatter plot, the maximum
and minimum polarity trend plot and the mean polarity time series plot. We’ll talk about each of them one by one. As you can see, the polarity value scatter plot give the individual polarity scores for all the nutrient relevant statements to give a quick look at how was
the sentiment of the different statements related to a particular nutrient in that year. As we can see here, the least sentiment which is
-0.6 and the maximum sentiment is 0.45. Thus, even though water has a higher overall
negative sentiment, because there are so many more positive sentiment statements, the mean
sentiment for water for the year 2010 will is positive. The mean polarity time series plot also runs across
all the years, but, it runs, it runs across all the years but has only one, My apologies You’re looking at the max and min polarity trend plot right now it runs across all the years and this type of plot is used to give an idea about the range of positive and negative statements
for a nutrient over the years. And in this specific case we can see that Water related positive sentiments are increasing since 1995 . Coming to the mean polarity time series plot, these plots are also runing across all the years as you can see, The only difference is that they have one trend line as compared to two trend lines in the previous plot This type of plot basically displays the mean sentiment of a nutrient over the years to give an overall idea of how the sentiment for a nutrient has
been all these years. As can be seen here, the mean Fat sentiment
dipped in 1990, remained more or less consistent in 2005, but has started increasing ever
since 2010. In the interest of time, I won’t be able
to go through the examples of the many other visualizations which I have here, but I have included them in the slide deck so that those interested can have a look at them later. Let me come to a couple of open questions. Majority of the nutrient relevant statements
are not positive or negative in nature in a direct way. Within the dietary guidelines Most of the statements are not in a form like Cholesterol is bad for health, or, for example, Vitamins are really good for health. But they are like ABC and XYZ contains vitamins
or CDE and XYZ fulfil your daily quota of proteins. The polarity scores for the overall statements
might be affected by positive and negative words which are not necessarily directly about
the nutrient in question. There might be a better way to handle this. Vader Sentiment is a package that is well
tuned to social media data as mentioned before. TextBlob, again, is for general data. Even though the language used in the USDA
dietary guidelines is more or less generic in nature, it might be a good idea to use
packages with pre-decided scores for words or whether a separate lexicon is created
only for dietary related data or nutrition related data. That also might be a good idea. The final open question, I feel is Are the official dietary guidelines by USDA
the best or the only trusted source to determine the trends in the sentiment for
nutrients over the years? What other articles or papers can be used? Finally, coming to the conclusion and the
future scope. I learned a lot many good things working on this project. Python has many good quality packages for sentiment analysis. I ended up using TextBlob and Vader but there
are many other packages for sentiment analysis in Python like the old school Natural Language Tool Kit, a file package called Gensim, a package by Stanford called Stanford Core NLP and another package called Spacy. Currently the user has a choice to use Textblob or Vader as the project currently stands. I did not come across a package specifically
tuned to nutrition data. The same analysis can be tried by using or
creating a package especially for nutrition data and that might give us different and possibly more accurate results. As far as the future scope goes, it would
be very interesting to analyze that once the official guidelines are released, how is the reaction
of the people on social media platforms like Twitter or maybe on individual blogs and news articles. It would be interesting to note that does the sentiment for a particular nutrient change after the official guidelines are released, and in fact the same technologies and packages used in this project can be utilized to analyze data on social media as well. It would also be interesting to analyze that
How does the strength of the language used for the nutrients in the dietary guidelines affect the consumption behavior of the consumers for the different nutrients. Well, that’s all I have for you. At the end, I want to mention that the dataset
for this project will be available soon on the Ag Data Commons website on the link shared below.
[] It will have the code files, the original data I worked with and also the descriptions
of the methods used. Akshat Pant will go in next and present his work.>>>
Hi all, I’m Akshat Pant, a graduate student at the university of Maryland, College Park. I am passionate about working with data and
creating visualizations to gather insights. This project for Gap Assessment of Accessible
Agrobiodiversity Data using GBIF and the NAL Thesaurus was overseen by my supervisor at
NAL, Cynthia Parr, and my supervisor at UMD, Adam Kriesberg. So what is GBIF? GBIF stands for the Global Biodiversity Information
Facility and is an international network and research infrastructure funded by the world’s
governments and aimed at providing anyone, anywhere, open access to data about all life types on Earth. []
It provides data-holding institutions around the world with common standards and open-source
tools that enable them to share information about where and when species have been recorded. This knowledge derives from many sources,
including everything from museum specimens collected in the 18th and 19th century to
geotagged smartphone photos shared by amateur naturalists in recent days and weeks. The data are provided by many institutions
from around the world. Data available through the GBIF portal are
primarily distribution data on plants, animals, fungi, and microbes for the world, and scientific
names for data. It also includes some limited agricultural data such
as germplasm records. This slide shows data about a specific organism,
namely zea mays, which is corn. It has over 100,000 georeferenced specimens
all over the world, as can be seen on the map. And GBIF also provides metrics on their overall
GBIF corpus. So here we can see these occurrence trends of kingdoms like plantae, animalia and other kingdoms. So we can see that the trend increases with time. The NAL Thesaurus and Glossary are online vocabulary
tools of agricultural terms in English and Spanish and are cooperatively produced by
the National Agricultural Library, USDA, and the Inter-American Institute for Cooperation
on Agriculture as well as other Latin American agricultural institutions belonging to the
Agriculture Information and Documentation Service of the Americas. []
The thesaurus was first released by the NAL in 2002. And in 2007, NAL and the Inter-American Institute
for Cooperation on Agriculture collaborated to develop the Spanish version of the thesaurus. It has in depth coverage of agriculture, biology
and related disciplines, and also contains over 135,000 terms. Each term has data like taxonomic rank, broader term that is its parent, narrower term, its children, an RDF format, and a persistant uri links to that will link to the organism’s web page. Over 50,000 of the 135,00 terms in the thesaurus are
organism names. So the first criterion to include a name in the thesaurus is number of hits in AGRICOLA. AGRICOLA contains bibliographic records from the National Agricultural Library, and provides millions
of citations relating to the field of agriculture. Citations are comprised of journal articles,
book chapters, theses, patents, software, audiovisual materials and technical reports
to support agricultural research. Sometimes entire groups are included if some members of the group have high numbers of hits in AGRICOLA. There could be many reasons, to stop
wrongly indexed terms. Many times, epithets are spelled the same as Genus. In that case, all the narrower terms are added. Sometimes, there are not that many hits but
the taxon may be medicinally or economically important, so those are added as well. Also we focus on about 22,000 species-level
names to keep the analysis manageable and also because anything above, any analysis above that level would be too broad The goals outlined for this project were to get
some descriptive statistics about Agrobiodiversity Data AgData in GBIF. Some statistics that can represent the whole
data in just a few numbers. Visually see the occurrence trends of GBIF corpus
and agdata in GBIF. and to determine gaps or biases. Provide examples of and code for how agricultural
researchers can work with the GBIF data The questions that need to be answered in
order to achieve the goals outlined were How many different species in the thesaurus have occurrences in GBIF? So what percent of the species in the thesaurus have
occurrence records in GBIF. Which are the most and least common agricultural
species in GBIF? And are there some particular species which have
high occurrence records or low occurrence records? What is the temporal trend of the occurrences
in the overall GBIF corpus? And how does this trend look like visually? And is there a way to compare this, how can we compare this to the temporal trend
of agriculturally relevant occurrence data? Also, how does the geographic distribution of the
overall GBIF corpus looks like as compared to the geographic distribution of agriculturally relevant occurrences? In this talk, I’ll only focus on the questions
in bold because this project is still ongoing and I’m working on the other questions. So, for this project, I used Python for data analysis because it is great and has support for both web scraping and data analysis, both of which were essential for this project. Jupyter notebooks were used, run through Anaconda. Project Jupyter is a non-profit, open-source
project, born out of the IPython Project in 2014 as it evolved to support interactive
data science and scientific computing across all programming languages. So you can use Jupyter notebook for not only Python but R as well. And it will run directly on your browser without the
need for internet access or any ID. Tableau has been used to create visualizations
after performing the analysis in python. The main packages used in this project are
pandas for data manipulation, requests and json to interact with the GBIF API, and numpy
which adds support for array and matrix operations For each organism, GBIF API was used to determine its rank whether species or not and the count of occurrences. So it turns out that 94 percent of the species in NALT in the thesaurus have occurrence records in GBIF. 6 percent have no occurrence records. Also, each species has 18855 occurrence records
on average. This is a venn diagram depicting the same. The most common, it turns out that the eBird citizen science project is so successful that most of the occurrence records are bird sightings. And we can see that the least common occurrence records are those of insects and microbes. Also, doing a simple binning of the count values tells
us that occurrence counts in the range 1-1000 are the most, followed by 1001-2000. There are around 14 thousand species with
fewer than a thousand occurrences. This graph shows us how Occurrence data in the
overall GBIF corpus changes over time. The horizontal axis depicts years and the
vertical axis depicts the total number of occurrences for each year. The vertical axis the number of occurrences
is on a logarithmic scale. So this means that the trend increases not
linearly but exponentially. The scale has been changed to a long scale to view the changes in occurrence values a little better. Because of the high difference between occurrence
records during earlier years and the 21st century, these changes would not be so apparent
on a normal scale. GBIF already has similar visualizations. There are some challenges that remain for
the remainder of this project. Since all the most occurrence records are
for birds, I need to adjust for the GBIF bias in bird citizen science. The map on this slide shows the distribution of occurrences in the GBIF corpus. To create a geographic distribution for the
corpus and ag relevant data, I need to improve queries so the results have effective latitude
and longitude data so there are no null or missing values in the analyses. Also, to compare the distributions, I need
to decide on a good way to either show the geospatial differences or visualize the ag data in a manner similar to the gbif corpus that is on the slide. So, further analyses will be computationally intensive
and so I plan to take advantage of the Agriculture Research Service’s high performance computing
platform called SCINet This is a high performance computing infrastructure
for ARS researchers And it is designed to enable large-scale computing and
large-scale storage. So in conclusion, we observe that
most of the species in NAL Thesaurus have occurrence data in GBIF, 94 percent to be accurate. And
occurrence records for the GBIF corpus have increased exponentially with time.
Python is a versatile language, suitable for both web scraping and data analysis. And for future work, this approach will be useful in answering the remaining questions and will provide some guidance for people interested in using GBIF
for agricultural analysis. It will also highlight where data is missing
from GBIF, and gaps may be filled by indexing data from agricultural records or by accounting
for biases. Future work can include below species occurrence
like particular cultivars or varieties. Thank you, and we are open to questions.>>>Alright, before we open up for questions,
this is Cyndy Parr, and I’ve been helping to advise these students, and I want to thank
them for their work with us this year. As was mentioned, Shivam and Akshar, Akshat
are students of the Masters of Information Management Program in the iSchool at the University
of Maryland. They presented on the projects they worked
on this year as they were serving as fellows here at the National Ag Library. Keep in mind that these students are not domain
experts themselves, but did consult with domain experts as they did their work. One of our goals for these projects was to
develop capacity here at the library to support these kinds of analyses,
and we hope to have more data science fellows working with us or working with you in the
future. So we’d like to open up for questions now,
and please feel free to ask either Shivam or Akshat about their work,
or to discuss data science at USDA in general. Remember, you can ask your questions in the
chat window, and we will also unmute everyone so that you can ask directly if you wish. Any questions?>>>
Hi, this is Matthew Lang,>>>
Hi, Matthew>>>
Go ahead Matthew>>>
[Matthew] I had a question, I was looking at those slides
and I saw, I believe it was Akshat’s presentation, but I’m sorry, I don’t remember. But one of the slides was terminologies going
back to the 1600s.>>>
[Akshat] Yes>>>
Did I see that right? And I’m curious where you collected that data
from, and…>>>[Akshat]
Yeah, so, in order to get data from the GBIF
APIs, you can run a query on GBIF to get occurrence records
starting from a particular year.>>>
[Cyndy] So these are not terminologies, Matthew, these
are occurrence records. So when Lineaus gathered specimens for his
work, those ended up in museums, and the metadata about those speciments ends up in GBIF.>>>[Akshat]
So these are all the ocurrence records for
a particular year.>>>[Matthew] Okay, so, I understand that, okay. So, do we have an explanation for the spike?>>>
[Akshat] So, the spike may represent the decade. So if data is not recorded properly they may
adjust it at the end of the decade. That’s why maybe we see the spikes.>>>
[Matthew] Okay>>>
[Cyndy] But there is another possible explanation
which is that there is major expeditions historically, where, for example,
Lewis and Clark would go out and collect a couple thousand specimens and then they would
all appear in the same year. And so in that age of exploration you’re gonna
see more spikes.>>>
[Matthew] I see. That’s fascinating. It would be interesting to correlate that
with expeditions. From a historical perspective I think that
would be pretty cool. Nice work.>>>[Akshat] Thank you.>>>
[Cyndy] So while we wait for more questions, let me
just follow on a little bit. So, one of the goals of Akshat’s study is
to get a sense of what the ag bio data landscape is like. We suspect that there is a lot of agriculturally
relevant data that has not been indexed by GBIF
and to the extent that that might also represent huge chunks of activity, it might also generate spikes like we see for the more historical museum-related data. Um, so that the point is that GBIF so far
has really only looked to museums and citizen science and perhaps some limited agricultural
sources for its data, but there’s probably a lot more out there. Any other questions?>>>
[Matthew] Is there any move toward integrating the GBIF
data with, into sort of logical, if you will, units of business so that we can see how specific
mentions of GBIF terms are being affected by various other aspects related to ag or
climate change or what have you.>>>
[Akshat] We could do that,
Can you hear me? Hello?>>>[Cyndy]
Repeat it, go ahead, repeat it.>>>[Akshat]
We could do that,>>>
[Cyndy] As far as I know, that hasn’t happened, but
it’s an interesting direction to explore. The other thing is that we noticed some of
the more recent citizen science reports do include people who have, for example, uploaded
photos of corn, perhaps from cornfields. So it may be that a rise in citizen science
relating to agricultural activity might increase the representation of those organisms in GBIF. Alright, well, if there are no further questions,
we may remain on the line a little bit if people are still trying to formulate their
ideas. But, Erin, do you want to discuss what happens
[Erin] Um, what happens next is that we will email
the slide deck to everyone who attended and eventually… [Echo]
We will have the recording as well. That will also be posted on the Ag Data Commons
news item. [Echo] You can always email us at the last email
on the screen right now [email protected] if you have any
further quetions for us, and Shivam [[email protected]] and Akshat’s [[email protected]] emails are here as well. Oh, we have another question, we’re not going
just yet. [Mumbling]>>>
[Shivam] Yeah, so thank you for
the question, David, I just saw your question. I ended up only using the official dietary
guidelines, the official papers for USDA. And as I pointed out in my future scope and
you gave a good idea that we can actually go ahead, as a future scope of this project
we can go ahead and use the other supplement articles as a list
We know what people are talking about, I mean not just official USDA articles, but what
is the perception in general about these nutrients. So that is about the future scope of this
project, if that helps to answer your question. I guess it did.>>>
[Erin] Alright, if there are no further questions,
we’ll hang around for five minutes or so just in case anyone else has anything. Otherwise, we’ll see you next month for our
next webinar. Thank you.>>>
Hi, this is Dennis, can you hear me?>>>
[Shivam] Yes we can.>>>
[Erin] Did someone else have another question?

No Comments

Leave a Reply